OpenShiftvLLM on OpenShift/ recipes
← All recipes

Red Hat AI

Llama 3.2 1B Instruct FP8 Dynamic

FP8-quantized 1.5B parameter Llama 3.2 instruction-tuned model. Reduces GPU memory ~50% vs BF16 with minimal accuracy loss.

1.5B131,072 ctxtextdense
These manifests assume you have OpenShift with the GPU Operator and RHOAI/KServe installed. Check prerequisites →

FP8 dynamic quantization on a single GPU

Deploy with oc apply

Save to a file and run oc apply -f deploy.yaml

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: vllm-llama-3-2-1b-instruct-fp8-dynamic
  namespace: llm-serving
  annotations:
    openshift.io/display-name: "Llama 3.2 1B Instruct FP8 Dynamic (fp8)"
spec:
  supportedModelFormats:
    - name: vLLM
      autoSelect: true
  containers:
    - name: kserve-container
      image: vllm/vllm-openai:nightly
      command:
        - python
        - -m
        - vllm.entrypoints.openai.api_server
      args:
        - "--model"
        - "RedHatAI/Llama-3.2-1B-Instruct-FP8-dynamic"
        - "--max-num-seqs=32"
        - "--max-model-len=131072"
      resources:
        requests:
          cpu: "2"
          memory: 8Gi
          nvidia.com/gpu: "1"
        limits:
          cpu: "8"
          memory: 24Gi
          nvidia.com/gpu: "1"
      ports:
        - containerPort: 8000
          protocol: TCP
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-3-2-1b-instruct-fp8-dynamic
  namespace: llm-serving
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
spec:
  predictor:
    model:
      modelFormat:
        name: vLLM
      runtime: vllm-llama-3-2-1b-instruct-fp8-dynamic
      storageUri: hf://RedHatAI/Llama-3.2-1B-Instruct-FP8-dynamic

Overview

Llama 3.2 1B Instruct FP8 Dynamic is an FP8-quantized version of Meta's Llama-3.2-1B-Instruct, created by Red Hat AI (Neural Magic). FP8 weight and activation quantization reduces GPU memory by approximately 50% while maintaining 98% of the original model's accuracy on OpenLLM benchmarks.

Prerequisites

  • OpenShift 4.14+ with the NVIDIA GPU Operator installed
  • At least 1 NVIDIA GPU with 16GB+ VRAM (A10, L4, A100, H100, etc.)
  • Red Hat OpenShift AI (RHOAI) operator installed, or KServe configured manually

Quick Start

  1. Save the generated manifests above to deploy.yaml
  2. Create a namespace: oc new-project llm-serving
  3. Apply: oc apply -f deploy.yaml
  4. Wait for the pod to pull the model and start serving
  5. Get the route: oc get inferenceservice

Testing the Endpoint

export URL=$(oc get inferenceservice llama-3-2-1b-instruct-fp8-dynamic -o jsonpath='{.status.url}')
curl -s $URL/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8-dynamic",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Notes

  • Small enough to run on a single GPU with 16GB+ VRAM
  • Supports 128K context natively — reduce --max-model-len if VRAM is limited
  • Uses compressed-tensors format for native vLLM FP8 support
  • Supports 8 languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai

References