Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Architecture

vllm-vcr sits at the engine-core boundary. The frontend is still a normal vLLM frontend; the simulator connects where a headless engine would connect and speaks the same ZMQ + msgpack protocol. The protocol types come from vLLM's in-tree vllm-engine-core-client crate, pinned per supported vLLM line.

The main pieces are:

  • connect_to_frontend joins the frontend-owned handshake, reports ready, and opens the DEALER/PUSH data sockets.
  • src/io.rs decodes incoming frames into EngineInput and writes EngineOutput messages back to the frontend.
  • src/engine.rs owns scheduling, latency, token emission, LoRA accounting, prefix-cache state, and failure injection.
  • src/dataplane.rs is the prefill/decode integration point. Prefill advertises KV metadata through kv_transfer_params; decode pulls those blocks. The default NoopDataPlane only exercises the control plane, while NixlDataPlane performs real NIXL reads when the nixl feature is enabled.

record uses the same boundary in proxy form: it presents as an engine to the frontend, presents as a frontend to the real engine, relays frames unchanged, and records timing/token metadata from decoded copies.