Architecture
vllm-vcr sits at the engine-core boundary. The frontend is still a normal vLLM
frontend; the simulator connects where a headless engine would connect and speaks the
same ZMQ + msgpack protocol. The protocol types come from vLLM's in-tree
vllm-engine-core-client crate, pinned per supported vLLM line.
frontend
vLLM frontend
Rust or Python
ZMQ + msgpack
backend
vllm-vcr play
mock engine-core
EngineInput
sim
generation loop
tokens, latency, scheduler
KV hooks
default
NoopDataPlane
control plane only
feature = nixl
NixlDataPlane
CPU KV transfers
The main pieces are:
connect_to_frontendjoins the frontend-owned handshake, reports ready, and opens the DEALER/PUSH data sockets.src/io.rsdecodes incoming frames intoEngineInputand writesEngineOutputmessages back to the frontend.src/engine.rsowns scheduling, latency, token emission, LoRA accounting, prefix-cache state, and failure injection.src/dataplane.rsis the prefill/decode integration point. Prefill advertises KV metadata throughkv_transfer_params; decode pulls those blocks. The defaultNoopDataPlaneonly exercises the control plane, whileNixlDataPlaneperforms real NIXL reads when thenixlfeature is enabled.
record uses the same boundary in proxy form: it presents as an engine to the
frontend, presents as a frontend to the real engine, relays frames unchanged, and
records timing/token metadata from decoded copies.