OpenShiftvLLM on OpenShift/ recipes
← All recipes

Red Hat AI

DiffusionGemma 26B A4B IT FP8

FP8-quantized 26B mixture-of-experts diffusion language model (4B active). Fits on a single H100 GPU.

26B4B active8,192 ctxtextmoe
These manifests assume you have OpenShift with the GPU Operator and RHOAI/KServe installed. Check prerequisites →

FP8 dynamic quantization on a single H100 80GB

Deploy with oc apply

Save to a file and run oc apply -f deploy.yaml

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: vllm-diffusiongemma-26b-a4b-it-fp8-dynamic
  namespace: llm-serving
  annotations:
    openshift.io/display-name: "DiffusionGemma 26B A4B IT FP8 (fp8)"
spec:
  supportedModelFormats:
    - name: vLLM
      autoSelect: true
  containers:
    - name: kserve-container
      image: vllm/vllm-openai:nightly
      command:
        - python
        - -m
        - vllm.entrypoints.openai.api_server
      args:
        - "--model"
        - "RedHatAI/diffusiongemma-26B-A4B-it-FP8-dynamic"
        - "--trust-remote-code"
        - "--max-num-seqs=4"
        - "--max-model-len=8192"
      resources:
        requests:
          cpu: "2"
          memory: 8Gi
          nvidia.com/gpu: "1"
        limits:
          cpu: "8"
          memory: 24Gi
          nvidia.com/gpu: "1"
      ports:
        - containerPort: 8000
          protocol: TCP
      env:
        - name: VLLM_USE_V2_MODEL_RUNNER
          value: "1"
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: diffusiongemma-26b-a4b-it-fp8-dynamic
  namespace: llm-serving
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
spec:
  predictor:
    model:
      modelFormat:
        name: vLLM
      runtime: vllm-diffusiongemma-26b-a4b-it-fp8-dynamic
      storageUri: hf://RedHatAI/diffusiongemma-26B-A4B-it-FP8-dynamic

Overview

DiffusionGemma 26B A4B IT is a mixture-of-experts diffusion language model with 26B total parameters and 4B active parameters per token. This Red Hat AI recipe uses FP8 dynamic quantization to fit the model on a single H100 80GB GPU.

Unlike autoregressive models, DiffusionGemma generates text via iterative denoising — it requires the v2 model runner in vLLM.

Prerequisites

  • OpenShift 4.14+ with the NVIDIA GPU Operator installed
  • At least 1 NVIDIA H100 80GB GPU
  • Red Hat OpenShift AI (RHOAI) operator installed, or KServe configured manually

Quick Start

  1. Save the generated manifests above to deploy.yaml
  2. Create a namespace: oc new-project llm-serving
  3. Apply: oc apply -f deploy.yaml
  4. Wait for the pod to pull the model and start serving
  5. Get the route: oc get inferenceservice

Testing the Endpoint

export URL=$(oc get inferenceservice diffusiongemma -o jsonpath='{.status.url}')
curl -s $URL/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "RedHatAI/diffusiongemma-26B-A4B-it-FP8-dynamic",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Notes

  • Requires VLLM_USE_V2_MODEL_RUNNER=1 environment variable
  • --trust-remote-code is required for this model
  • --max-num-seqs is set to 4 to manage memory with diffusion decoding
  • Uses shared memory (/dev/shm) for inter-process communication — 16Gi recommended

References