Meta
Llama 3.1 8B Instruct
The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models...
8B8,192 ctxtextdense
These manifests assume you have OpenShift with the GPU Operator and RHOAI/KServe installed. Check prerequisites →
TODO: Add variant description
Deploy with oc apply
Save to a file and run oc apply -f deploy.yaml
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: vllm-llama-3-1-8b-instruct
namespace: llm-serving
annotations:
openshift.io/display-name: "Llama 3.1 8B Instruct (bf16)"
spec:
supportedModelFormats:
- name: vLLM
autoSelect: true
containers:
- name: kserve-container
image: vllm/vllm-openai:nightly
command:
- python
- -m
- vllm.entrypoints.openai.api_server
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--max-num-seqs=32"
- "--max-model-len=8192"
resources:
requests:
cpu: "2"
memory: 8Gi
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: 24Gi
nvidia.com/gpu: "1"
ports:
- containerPort: 8000
protocol: TCP
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-3-1-8b-instruct
namespace: llm-serving
annotations:
serving.kserve.io/deploymentMode: RawDeployment
spec:
predictor:
model:
modelFormat:
name: vLLM
runtime: vllm-llama-3-1-8b-instruct
storageUri: hf://meta-llama/Llama-3.1-8B-Instruct