Deploy LLMs on OpenShift

Copy-paste deployment manifests for running open-weight models with vLLM on OpenShift. Pick a model, choose a variant, and oc apply -f.

Latest recipes

Red Hat AI

FP8-quantized 1.5B parameter Llama 3.2 instruction-tuned model. Reduces GPU memory ~50% vs BF16 with minimal accuracy loss.

Nvidia

FP8-quantized 30B hybrid Mamba-2/Transformer MoE reasoning model (3.5B active). Supports togglable chain-of-thought and fits on a single H100 GPU.

The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models...

Red Hat AI

FP8-quantized 26B mixture-of-experts diffusion language model (4B active). Fits on a single H100 GPU.

Google

Encoder-free multimodal 12B model supporting text, image, and audio input. Fits on a single GPU.