Overview
Llama 3.2 1B Instruct FP8 Dynamic is an FP8-quantized version of Meta's Llama-3.2-1B-Instruct, created by Red Hat AI (Neural Magic). FP8 weight and activation quantization reduces GPU memory by approximately 50% while maintaining 98% of the original model's accuracy on OpenLLM benchmarks.
Prerequisites
- OpenShift 4.14+ with the NVIDIA GPU Operator installed
- At least 1 NVIDIA GPU with 16GB+ VRAM (A10, L4, A100, H100, etc.)
- Red Hat OpenShift AI (RHOAI) operator installed, or KServe configured manually
Quick Start
- Save the generated manifests above to
deploy.yaml - Create a namespace:
oc new-project llm-serving - Apply:
oc apply -f deploy.yaml - Wait for the pod to pull the model and start serving
- Get the route:
oc get inferenceservice
Testing the Endpoint
export URL=$(oc get inferenceservice llama-3-2-1b-instruct-fp8-dynamic -o jsonpath='{.status.url}')
curl -s $URL/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "RedHatAI/Llama-3.2-1B-Instruct-FP8-dynamic",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Notes
- Small enough to run on a single GPU with 16GB+ VRAM
- Supports 128K context natively — reduce
--max-model-lenif VRAM is limited - Uses
compressed-tensorsformat for native vLLM FP8 support - Supports 8 languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai