vllm-vcr
GPU-free vLLM engine-core replay
Record a real vLLM engine once, then replay its protocol behavior, timing, and optional output tokens behind the real vLLM frontend on ordinary CPU hosts.
vllm-vcr is a single binary with three subcommands:
recordtaps a live vLLM frontend ↔ engine-core link as a transparent ZMQ proxy and writes a JSONL trace.playruns a mock engine-core backend that speaks the real ZMQ + msgpack protocol. It can generate synthetic tokens, sample timing from a fitted trace, replay recorded step timing, or serve recorded token ids.inspectconverts benchmark reports, summarizes traces, renders Perfetto timelines, and runs calibration checks.
With the optional nixl feature and a working libnixl/UCX runtime, play can also
move simulated KV-cache bytes between prefill and decode instances over NIXL.
What it is for
Testing the software around a vLLM engine usually means provisioning GPUs and model
weights before you can exercise frontends, cache-aware routers, schedulers,
autoscalers, or CI compatibility matrices. vllm-vcr keeps the real frontend and
wire protocol in the loop, but replaces the model backend with a CPU simulator.
Use it when you need to:
- replay captured TTFT and inter-token behavior without a GPU;
- run OpenAI-compatible frontend, streaming, LoRA, scheduler, and router tests against the real engine-core protocol;
- validate trace fidelity and version compatibility in CI;
- test prefill/decode control-plane behavior, and optionally the NIXL data plane, without model weights.
It is not a model-quality simulator: generated tokens are random unless you record and replay token ids, and latency fidelity depends on traces captured from the engine/configuration you care about.
How it fits
The vLLM frontend remains responsible for tokenization, chat templates, tool calling,
streaming, metrics, and OpenAI-compatible HTTP handling. vllm-vcr play only replaces
the engine-core process behind that frontend. For prefill/decode work, the default
data plane is a no-op; the NIXL path is opt-in and still runs without CUDA or model
weights.
New setup
Read Architecture, then install the binary and run the quick start.
Trace replay
Start with Trace replay and calibration for capture, model fit, and replay modes.
Operations
Use Versioning and Conformance for multi-line vLLM support.