Llama 3.1 8B Instruct

The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models...

8B8,192 ctxtextdense

These manifests assume you have OpenShift with the GPU Operator and RHOAI/KServe installed. Check prerequisites →

Precision

TODO: Add variant description

GPUs

Max Context Length

Namespace

Deploy with oc apply

Save to a file and run oc apply -f deploy.yaml

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: vllm-llama-3-1-8b-instruct
  namespace: llm-serving
  annotations:
    openshift.io/display-name: "Llama 3.1 8B Instruct (bf16)"
spec:
  supportedModelFormats:
    - name: vLLM
      autoSelect: true
  containers:
    - name: kserve-container
      image: vllm/vllm-openai:nightly
      command:
        - python
        - -m
        - vllm.entrypoints.openai.api_server
      args:
        - "--model"
        - "meta-llama/Llama-3.1-8B-Instruct"
        - "--max-num-seqs=32"
        - "--max-model-len=8192"
      resources:
        requests:
          cpu: "2"
          memory: 8Gi
          nvidia.com/gpu: "1"
        limits:
          cpu: "8"
          memory: 24Gi
          nvidia.com/gpu: "1"
      ports:
        - containerPort: 8000
          protocol: TCP
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-3-1-8b-instruct
  namespace: llm-serving
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
spec:
  predictor:
    model:
      modelFormat:
        name: vLLM
      runtime: vllm-llama-3-1-8b-instruct
      storageUri: hf://meta-llama/Llama-3.1-8B-Instruct