NVIDIA Nemotron 3 Nano 30B A3B FP8

apiVersion: serving.kserve.io/v1alpha1 kind: ServingRuntime metadata: name: vllm-nvidia-nemotron-3-nano-30b-a3b-fp8 namespace: llm-serving annotations: openshift.io/display-name: "NVIDIA Nemotron 3 Nano 30B A3B FP8 (fp8)" spec: supportedModelFormats: - name: vLLM autoSelect: true containers: - name: kserve-container image: vllm/vllm-openai:nightly command: - python - -m - vllm.entrypoints.openai.api_server args: - "--model" - "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8" - "--trust-remote-code" - "--max-num-seqs=8" - "--kv-cache-dtype=fp8" - "--max-model-len=32768" resources: requests: cpu: "2" memory: 8Gi nvidia.com/gpu: "1" limits: cpu: "8" memory: 24Gi nvidia.com/gpu: "1" ports: - containerPort: 8000 protocol: TCP --- apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: nvidia-nemotron-3-nano-30b-a3b-fp8 namespace: llm-serving annotations: serving.kserve.io/deploymentMode: RawDeployment spec: predictor: model: modelFormat: name: vLLM runtime: vllm-nvidia-nemotron-3-nano-30b-a3b-fp8 storageUri: hf://nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8

Overview

NVIDIA Nemotron-3-Nano-30B-A3B is a hybrid Mamba-2/Transformer mixture-of-experts model with 30B total parameters and 3.5B active parameters per token. It uses 128 routed experts plus 1 shared expert per MoE layer. This recipe deploys the FP8-quantized variant on a single H100 80GB GPU.

The model supports togglable reasoning — it can produce chain-of-thought traces within <think> tags before answering, or skip reasoning for faster responses.

Prerequisites

OpenShift 4.14+ with the NVIDIA GPU Operator installed
At least 1 NVIDIA H100 80GB GPU
Red Hat OpenShift AI (RHOAI) operator installed, or KServe configured manually

Quick Start

Save the generated manifests above to deploy.yaml
Create a namespace: oc new-project llm-serving
Apply: oc apply -f deploy.yaml
Wait for the pod to pull the model and start serving
Get the route: oc get inferenceservice

Testing the Endpoint

export URL=$(oc get inferenceservice nvidia-nemotron-3-nano -o jsonpath='{.status.url}')
curl -s $URL/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Notes

--trust-remote-code is required for the hybrid Mamba-2/Transformer architecture
--kv-cache-dtype=fp8 reduces KV cache memory for longer sequences
--max-model-len=32768 limits context to 32K to fit in VRAM — the model natively supports up to 1M tokens
--max-num-seqs=8 limits concurrent sequences to manage memory with the MoE routing
Supports six languages: English, German, Spanish, French, Italian, and Japanese