OpenShiftvLLM on OpenShift/ recipes
← All recipes

Google

Gemma 4 12B IT

Encoder-free multimodal 12B model supporting text, image, and audio input. Fits on a single GPU.

12B256,000 ctxmultimodaldense
These manifests assume you have OpenShift with the GPU Operator and RHOAI/KServe installed. Check prerequisites →

Full BF16 precision on a single H100 80GB

Deploy with oc apply

Save to a file and run oc apply -f deploy.yaml

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: vllm-gemma-4-12b-it
  namespace: llm-serving
  annotations:
    openshift.io/display-name: "Gemma 4 12B IT (bf16)"
spec:
  supportedModelFormats:
    - name: vLLM
      autoSelect: true
  containers:
    - name: kserve-container
      image: vllm/vllm-openai:nightly
      command:
        - python
        - -m
        - vllm.entrypoints.openai.api_server
      args:
        - "--model"
        - "google/gemma-4-12b-it"
        - "--max-num-seqs=16"
        - "--max-model-len=256000"
      resources:
        requests:
          cpu: "2"
          memory: 8Gi
          nvidia.com/gpu: "1"
        limits:
          cpu: "8"
          memory: 24Gi
          nvidia.com/gpu: "1"
      ports:
        - containerPort: 8000
          protocol: TCP
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: gemma-4-12b-it
  namespace: llm-serving
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
spec:
  predictor:
    model:
      modelFormat:
        name: vLLM
      runtime: vllm-gemma-4-12b-it
      storageUri: hf://google/gemma-4-12b-it

Overview

Gemma 4 12B IT is Google's encoder-free multimodal model that handles text, image, and audio inputs by projecting raw patches directly into the LLM embedding space. Apache 2.0 licensed.

Prerequisites

  • OpenShift 4.14+ with the NVIDIA GPU Operator installed
  • At least 1 NVIDIA GPU with 40GB+ VRAM (A100 40GB, A100 80GB, H100, etc.)
  • Red Hat OpenShift AI (RHOAI) operator installed, or KServe configured manually

Quick Start

  1. Save the generated manifests above to deploy.yaml
  2. Create a namespace: oc new-project llm-serving
  3. Apply: oc apply -f deploy.yaml
  4. Wait for the pod to pull the model and start serving
  5. Get the route: oc get inferenceservice

Testing the Endpoint

export URL=$(oc get inferenceservice gemma-4-12b-it -o jsonpath='{.status.url}')
curl -s $URL/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-12b-it",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Notes

  • Supports variable image resolution via configurable token budgets
  • Has built-in thinking/reasoning mode (toggle with enable_thinking)
  • Native function calling for agentic workflows
  • Context length set to 32K in this recipe to conserve VRAM — increase --max-model-len if your GPU has headroom

References