OpenShiftvLLM on OpenShift/ recipes
← All recipes

Nvidia

NVIDIA Nemotron 3 Nano 30B A3B FP8

FP8-quantized 30B hybrid Mamba-2/Transformer MoE reasoning model (3.5B active). Supports togglable chain-of-thought and fits on a single H100 GPU.

30B3.5B active32,768 ctxtextmoe
These manifests assume you have OpenShift with the GPU Operator and RHOAI/KServe installed. Check prerequisites →

FP8 quantization on a single H100 80GB with 32K context

Deploy with oc apply

Save to a file and run oc apply -f deploy.yaml

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: vllm-nvidia-nemotron-3-nano-30b-a3b-fp8
  namespace: llm-serving
  annotations:
    openshift.io/display-name: "NVIDIA Nemotron 3 Nano 30B A3B FP8 (fp8)"
spec:
  supportedModelFormats:
    - name: vLLM
      autoSelect: true
  containers:
    - name: kserve-container
      image: vllm/vllm-openai:nightly
      command:
        - python
        - -m
        - vllm.entrypoints.openai.api_server
      args:
        - "--model"
        - "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8"
        - "--trust-remote-code"
        - "--max-num-seqs=8"
        - "--kv-cache-dtype=fp8"
        - "--max-model-len=32768"
      resources:
        requests:
          cpu: "2"
          memory: 8Gi
          nvidia.com/gpu: "1"
        limits:
          cpu: "8"
          memory: 24Gi
          nvidia.com/gpu: "1"
      ports:
        - containerPort: 8000
          protocol: TCP
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: nvidia-nemotron-3-nano-30b-a3b-fp8
  namespace: llm-serving
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
spec:
  predictor:
    model:
      modelFormat:
        name: vLLM
      runtime: vllm-nvidia-nemotron-3-nano-30b-a3b-fp8
      storageUri: hf://nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8

Overview

NVIDIA Nemotron-3-Nano-30B-A3B is a hybrid Mamba-2/Transformer mixture-of-experts model with 30B total parameters and 3.5B active parameters per token. It uses 128 routed experts plus 1 shared expert per MoE layer. This recipe deploys the FP8-quantized variant on a single H100 80GB GPU.

The model supports togglable reasoning — it can produce chain-of-thought traces within <think> tags before answering, or skip reasoning for faster responses.

Prerequisites

  • OpenShift 4.14+ with the NVIDIA GPU Operator installed
  • At least 1 NVIDIA H100 80GB GPU
  • Red Hat OpenShift AI (RHOAI) operator installed, or KServe configured manually

Quick Start

  1. Save the generated manifests above to deploy.yaml
  2. Create a namespace: oc new-project llm-serving
  3. Apply: oc apply -f deploy.yaml
  4. Wait for the pod to pull the model and start serving
  5. Get the route: oc get inferenceservice

Testing the Endpoint

export URL=$(oc get inferenceservice nvidia-nemotron-3-nano -o jsonpath='{.status.url}')
curl -s $URL/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Notes

  • --trust-remote-code is required for the hybrid Mamba-2/Transformer architecture
  • --kv-cache-dtype=fp8 reduces KV cache memory for longer sequences
  • --max-model-len=32768 limits context to 32K to fit in VRAM — the model natively supports up to 1M tokens
  • --max-num-seqs=8 limits concurrent sequences to manage memory with the MoE routing
  • Supports six languages: English, German, Spanish, French, Italian, and Japanese

References