Gemma 4 12B IT

apiVersion: serving.kserve.io/v1alpha1 kind: ServingRuntime metadata: name: vllm-gemma-4-12b-it namespace: llm-serving annotations: openshift.io/display-name: "Gemma 4 12B IT (bf16)" spec: supportedModelFormats: - name: vLLM autoSelect: true containers: - name: kserve-container image: vllm/vllm-openai:nightly command: - python - -m - vllm.entrypoints.openai.api_server args: - "--model" - "google/gemma-4-12b-it" - "--max-num-seqs=16" - "--max-model-len=256000" resources: requests: cpu: "2" memory: 8Gi nvidia.com/gpu: "1" limits: cpu: "8" memory: 24Gi nvidia.com/gpu: "1" ports: - containerPort: 8000 protocol: TCP --- apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: gemma-4-12b-it namespace: llm-serving annotations: serving.kserve.io/deploymentMode: RawDeployment spec: predictor: model: modelFormat: name: vLLM runtime: vllm-gemma-4-12b-it storageUri: hf://google/gemma-4-12b-it

Overview

Gemma 4 12B IT is Google's encoder-free multimodal model that handles text, image, and audio inputs by projecting raw patches directly into the LLM embedding space. Apache 2.0 licensed.

Prerequisites

OpenShift 4.14+ with the NVIDIA GPU Operator installed
At least 1 NVIDIA GPU with 40GB+ VRAM (A100 40GB, A100 80GB, H100, etc.)
Red Hat OpenShift AI (RHOAI) operator installed, or KServe configured manually

Quick Start

Save the generated manifests above to deploy.yaml
Create a namespace: oc new-project llm-serving
Apply: oc apply -f deploy.yaml
Wait for the pod to pull the model and start serving
Get the route: oc get inferenceservice

Testing the Endpoint

export URL=$(oc get inferenceservice gemma-4-12b-it -o jsonpath='{.status.url}')
curl -s $URL/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-12b-it",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Notes

Supports variable image resolution via configurable token budgets
Has built-in thinking/reasoning mode (toggle with enable_thinking)
Native function calling for agentic workflows
Context length set to 32K in this recipe to conserve VRAM — increase --max-model-len if your GPU has headroom

Deploy with oc apply

Overview

Prerequisites

Quick Start

Testing the Endpoint

Notes

References