Prerequisites

Everything you need before deploying a model recipe.

1. OpenShift Cluster

You need an OpenShift 4.14+ cluster with cluster-admin access. The recipes generate Kubernetes custom resources (ServingRuntime and InferenceService) that require an inference platform to be installed.

2. NVIDIA GPU Operator

Your cluster must have GPU-enabled worker nodes with the NVIDIA GPU Operator installed. This exposes nvidia.com/gpu as a schedulable resource. Each recipe specifies how many GPUs and how much VRAM is required.

3. Red Hat OpenShift AI (RHOAI)

Install the Red Hat OpenShift AI operator from OperatorHub. This provides KServe and the model serving stack that the ServingRuntime and InferenceService manifests target. Alternatively, you can install KServe standalone.

4. vLLM Serving Runtime Image

The recipes use quay.io/modh/vllm:latest as the default container image. This image ships vLLM pre-installed and is maintained by Red Hat. Your cluster nodes need to be able to pull from quay.io.

5. HuggingFace Access (for gated models)

Some models (Llama, Mistral, etc.) are gated on HuggingFace and require you to accept the license and provide an access token. Create a Kubernetes secret with your token:

oc create secret generic hf-token \
  --from-literal=HF_TOKEN=hf_your_token_here

Then reference it in the InferenceService. Open-weight models like Gemma 4 and Mistral Small don't require this.

Quick Check

Verify your cluster is ready:

# GPUs visible?
oc get nodes -l nvidia.com/gpu.present=true

# RHOAI installed?
oc get csv -n redhat-ods-operator | grep rhods

# Can schedule GPUs?
oc describe node <gpu-node> | grep nvidia.com/gpu