DiffusionGemma 26B A4B IT FP8

apiVersion: serving.kserve.io/v1alpha1 kind: ServingRuntime metadata: name: vllm-diffusiongemma-26b-a4b-it-fp8-dynamic namespace: llm-serving annotations: openshift.io/display-name: "DiffusionGemma 26B A4B IT FP8 (fp8)" spec: supportedModelFormats: - name: vLLM autoSelect: true containers: - name: kserve-container image: vllm/vllm-openai:nightly command: - python - -m - vllm.entrypoints.openai.api_server args: - "--model" - "RedHatAI/diffusiongemma-26B-A4B-it-FP8-dynamic" - "--trust-remote-code" - "--max-num-seqs=4" - "--max-model-len=8192" resources: requests: cpu: "2" memory: 8Gi nvidia.com/gpu: "1" limits: cpu: "8" memory: 24Gi nvidia.com/gpu: "1" ports: - containerPort: 8000 protocol: TCP env: - name: VLLM_USE_V2_MODEL_RUNNER value: "1" --- apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: diffusiongemma-26b-a4b-it-fp8-dynamic namespace: llm-serving annotations: serving.kserve.io/deploymentMode: RawDeployment spec: predictor: model: modelFormat: name: vLLM runtime: vllm-diffusiongemma-26b-a4b-it-fp8-dynamic storageUri: hf://RedHatAI/diffusiongemma-26B-A4B-it-FP8-dynamic

Overview

DiffusionGemma 26B A4B IT is a mixture-of-experts diffusion language model with 26B total parameters and 4B active parameters per token. This Red Hat AI recipe uses FP8 dynamic quantization to fit the model on a single H100 80GB GPU.

Unlike autoregressive models, DiffusionGemma generates text via iterative denoising — it requires the v2 model runner in vLLM.

Prerequisites

OpenShift 4.14+ with the NVIDIA GPU Operator installed
At least 1 NVIDIA H100 80GB GPU
Red Hat OpenShift AI (RHOAI) operator installed, or KServe configured manually

Quick Start

Save the generated manifests above to deploy.yaml
Create a namespace: oc new-project llm-serving
Apply: oc apply -f deploy.yaml
Wait for the pod to pull the model and start serving
Get the route: oc get inferenceservice

Testing the Endpoint

export URL=$(oc get inferenceservice diffusiongemma -o jsonpath='{.status.url}')
curl -s $URL/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "RedHatAI/diffusiongemma-26B-A4B-it-FP8-dynamic",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Notes

Requires VLLM_USE_V2_MODEL_RUNNER=1 environment variable
--trust-remote-code is required for this model
--max-num-seqs is set to 4 to manage memory with diffusion decoding
Uses shared memory (/dev/shm) for inter-process communication — 16Gi recommended

Deploy with oc apply

Overview

Prerequisites

Quick Start

Testing the Endpoint

Notes

References