Overview
NVIDIA Nemotron-3-Nano-30B-A3B is a hybrid Mamba-2/Transformer mixture-of-experts model with 30B total parameters and 3.5B active parameters per token. It uses 128 routed experts plus 1 shared expert per MoE layer. This recipe deploys the FP8-quantized variant on a single H100 80GB GPU.
The model supports togglable reasoning — it can produce chain-of-thought traces
within <think> tags before answering, or skip reasoning for faster responses.
Prerequisites
- OpenShift 4.14+ with the NVIDIA GPU Operator installed
- At least 1 NVIDIA H100 80GB GPU
- Red Hat OpenShift AI (RHOAI) operator installed, or KServe configured manually
Quick Start
- Save the generated manifests above to
deploy.yaml - Create a namespace:
oc new-project llm-serving - Apply:
oc apply -f deploy.yaml - Wait for the pod to pull the model and start serving
- Get the route:
oc get inferenceservice
Testing the Endpoint
export URL=$(oc get inferenceservice nvidia-nemotron-3-nano -o jsonpath='{.status.url}')
curl -s $URL/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Notes
--trust-remote-codeis required for the hybrid Mamba-2/Transformer architecture--kv-cache-dtype=fp8reduces KV cache memory for longer sequences--max-model-len=32768limits context to 32K to fit in VRAM — the model natively supports up to 1M tokens--max-num-seqs=8limits concurrent sequences to manage memory with the MoE routing- Supports six languages: English, German, Spanish, French, Italian, and Japanese