vllm-vcr

GPU-free vLLM engine-core replay

Record a real vLLM engine once, then replay its protocol behavior, timing, and optional output tokens behind the real vLLM frontend on ordinary CPU hosts.

vllm-vcr is a single binary with three subcommands:

record taps a live vLLM frontend ↔ engine-core link as a transparent ZMQ proxy and writes a JSONL trace.
play runs a mock engine-core backend that speaks the real ZMQ + msgpack protocol. It can generate synthetic tokens, sample timing from a fitted trace, replay recorded step timing, or serve recorded token ids.
inspect converts benchmark reports, summarizes traces, renders Perfetto timelines, and runs calibration checks.

With the optional nixl feature and a working libnixl/UCX runtime, play can also move simulated KV-cache bytes between prefill and decode instances over NIXL.

What it is for

Testing the software around a vLLM engine usually means provisioning GPUs and model weights before you can exercise frontends, cache-aware routers, schedulers, autoscalers, or CI compatibility matrices. vllm-vcr keeps the real frontend and wire protocol in the loop, but replaces the model backend with a CPU simulator.

Use it when you need to:

replay captured TTFT and inter-token behavior without a GPU;
run OpenAI-compatible frontend, streaming, LoRA, scheduler, and router tests against the real engine-core protocol;
validate trace fidelity and version compatibility in CI;
test prefill/decode control-plane behavior, and optionally the NIXL data plane, without model weights.

It is not a model-quality simulator: generated tokens are random unless you record and replay token ids, and latency fidelity depends on traces captured from the engine/configuration you care about.

How it fits

The vLLM frontend remains responsible for tokenization, chat templates, tool calling, streaming, metrics, and OpenAI-compatible HTTP handling. vllm-vcr play only replaces the engine-core process behind that frontend. For prefill/decode work, the default data plane is a no-op; the NIXL path is opt-in and still runs without CUDA or model weights.

vllm-vcr

vllm-vcr

What it is for

How it fits

New setup

Trace replay

Operations

Keyboard shortcuts

vllm-vcr

vllm-vcr

What it is for

How it fits

New setup

Trace replay

Operations