Overview
DiffusionGemma 26B A4B IT is a mixture-of-experts diffusion language model with 26B total parameters and 4B active parameters per token. This Red Hat AI recipe uses FP8 dynamic quantization to fit the model on a single H100 80GB GPU.
Unlike autoregressive models, DiffusionGemma generates text via iterative denoising — it requires the v2 model runner in vLLM.
Prerequisites
- OpenShift 4.14+ with the NVIDIA GPU Operator installed
- At least 1 NVIDIA H100 80GB GPU
- Red Hat OpenShift AI (RHOAI) operator installed, or KServe configured manually
Quick Start
- Save the generated manifests above to
deploy.yaml - Create a namespace:
oc new-project llm-serving - Apply:
oc apply -f deploy.yaml - Wait for the pod to pull the model and start serving
- Get the route:
oc get inferenceservice
Testing the Endpoint
export URL=$(oc get inferenceservice diffusiongemma -o jsonpath='{.status.url}')
curl -s $URL/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "RedHatAI/diffusiongemma-26B-A4B-it-FP8-dynamic",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Notes
- Requires
VLLM_USE_V2_MODEL_RUNNER=1environment variable --trust-remote-codeis required for this model--max-num-seqsis set to 4 to manage memory with diffusion decoding- Uses shared memory (/dev/shm) for inter-process communication — 16Gi recommended