Llama 3.2 1B Instruct FP8 Dynamic

apiVersion: serving.kserve.io/v1alpha1 kind: ServingRuntime metadata: name: vllm-llama-3-2-1b-instruct-fp8-dynamic namespace: llm-serving annotations: openshift.io/display-name: "Llama 3.2 1B Instruct FP8 Dynamic (fp8)" spec: supportedModelFormats: - name: vLLM autoSelect: true containers: - name: kserve-container image: vllm/vllm-openai:nightly command: - python - -m - vllm.entrypoints.openai.api_server args: - "--model" - "RedHatAI/Llama-3.2-1B-Instruct-FP8-dynamic" - "--max-num-seqs=32" - "--max-model-len=131072" resources: requests: cpu: "2" memory: 8Gi nvidia.com/gpu: "1" limits: cpu: "8" memory: 24Gi nvidia.com/gpu: "1" ports: - containerPort: 8000 protocol: TCP --- apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: llama-3-2-1b-instruct-fp8-dynamic namespace: llm-serving annotations: serving.kserve.io/deploymentMode: RawDeployment spec: predictor: model: modelFormat: name: vLLM runtime: vllm-llama-3-2-1b-instruct-fp8-dynamic storageUri: hf://RedHatAI/Llama-3.2-1B-Instruct-FP8-dynamic

Overview

Llama 3.2 1B Instruct FP8 Dynamic is an FP8-quantized version of Meta's Llama-3.2-1B-Instruct, created by Red Hat AI (Neural Magic). FP8 weight and activation quantization reduces GPU memory by approximately 50% while maintaining 98% of the original model's accuracy on OpenLLM benchmarks.

Prerequisites

OpenShift 4.14+ with the NVIDIA GPU Operator installed
At least 1 NVIDIA GPU with 16GB+ VRAM (A10, L4, A100, H100, etc.)
Red Hat OpenShift AI (RHOAI) operator installed, or KServe configured manually

Quick Start

Save the generated manifests above to deploy.yaml
Create a namespace: oc new-project llm-serving
Apply: oc apply -f deploy.yaml
Wait for the pod to pull the model and start serving
Get the route: oc get inferenceservice

Testing the Endpoint

export URL=$(oc get inferenceservice llama-3-2-1b-instruct-fp8-dynamic -o jsonpath='{.status.url}')
curl -s $URL/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8-dynamic",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Notes

Small enough to run on a single GPU with 16GB+ VRAM
Supports 128K context natively — reduce --max-model-len if VRAM is limited
Uses compressed-tensors format for native vLLM FP8 support
Supports 8 languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai