Architecture

vllm-vcr sits at the engine-core boundary. The frontend is still a normal vLLM frontend; the simulator connects where a headless engine would connect and speaks the same ZMQ + msgpack protocol. The protocol types come from vLLM's in-tree vllm-engine-core-client crate, pinned per supported vLLM line.

frontend vLLM frontend Rust or Python

ZMQ + msgpack

backend vllm-vcr play mock engine-core

EngineInput

sim generation loop tokens, latency, scheduler

KV hooks

default NoopDataPlane control plane only

feature = nixl NixlDataPlane CPU KV transfers

The main pieces are:

connect_to_frontend joins the frontend-owned handshake, reports ready, and opens the DEALER/PUSH data sockets.
src/io.rs decodes incoming frames into EngineInput and writes EngineOutput messages back to the frontend.
src/engine.rs owns scheduling, latency, token emission, LoRA accounting, prefix-cache state, and failure injection.
src/dataplane.rs is the prefill/decode integration point. Prefill advertises KV metadata through kv_transfer_params; decode pulls those blocks. The default NoopDataPlane only exercises the control plane, while NixlDataPlane performs real NIXL reads when the nixl feature is enabled.

record uses the same boundary in proxy form: it presents as an engine to the frontend, presents as a frontend to the real engine, relays frames unchanged, and records timing/token metadata from decoded copies.

Keyboard shortcuts

vllm-vcr

Architecture