Overview
Gemma 4 12B IT is Google's encoder-free multimodal model that handles text, image, and audio inputs by projecting raw patches directly into the LLM embedding space. Apache 2.0 licensed.
Prerequisites
- OpenShift 4.14+ with the NVIDIA GPU Operator installed
- At least 1 NVIDIA GPU with 40GB+ VRAM (A100 40GB, A100 80GB, H100, etc.)
- Red Hat OpenShift AI (RHOAI) operator installed, or KServe configured manually
Quick Start
- Save the generated manifests above to
deploy.yaml - Create a namespace:
oc new-project llm-serving - Apply:
oc apply -f deploy.yaml - Wait for the pod to pull the model and start serving
- Get the route:
oc get inferenceservice
Testing the Endpoint
export URL=$(oc get inferenceservice gemma-4-12b-it -o jsonpath='{.status.url}')
curl -s $URL/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-4-12b-it",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Notes
- Supports variable image resolution via configurable token budgets
- Has built-in thinking/reasoning mode (toggle with
enable_thinking) - Native function calling for agentic workflows
- Context length set to 32K in this recipe to conserve VRAM — increase
--max-model-lenif your GPU has headroom