vllm-vcr

GPU-free vLLM engine-core replay

Record a real vLLM engine once, then replay its protocol behavior, timing, and optional output tokens behind the real vLLM frontend on ordinary CPU hosts.

vllm-vcr is a single binary with three subcommands:

record taps a live vLLM frontend ↔ engine-core link as a transparent ZMQ proxy and writes a JSONL trace.
play runs a mock engine-core backend that speaks the real ZMQ + msgpack protocol. It can generate synthetic tokens, sample timing from a fitted trace, replay recorded step timing, or serve recorded token ids.
inspect converts benchmark reports, summarizes traces, renders Perfetto timelines, and runs calibration checks.

With the optional nixl feature and a working libnixl/UCX runtime, play can also move simulated KV-cache bytes between prefill and decode instances over NIXL.

What it is for

Testing the software around a vLLM engine usually means provisioning GPUs and model weights before you can exercise frontends, cache-aware routers, schedulers, autoscalers, or CI compatibility matrices. vllm-vcr keeps the real frontend and wire protocol in the loop, but replaces the model backend with a CPU simulator.

Use it when you need to:

replay captured TTFT and inter-token behavior without a GPU;
run OpenAI-compatible frontend, streaming, LoRA, scheduler, and router tests against the real engine-core protocol;
validate trace fidelity and version compatibility in CI;
test prefill/decode control-plane behavior, and optionally the NIXL data plane, without model weights.

It is not a model-quality simulator: generated tokens are random unless you record and replay token ids, and latency fidelity depends on traces captured from the engine/configuration you care about.

How it fits

The vLLM frontend remains responsible for tokenization, chat templates, tool calling, streaming, metrics, and OpenAI-compatible HTTP handling. vllm-vcr play only replaces the engine-core process behind that frontend. For prefill/decode work, the default data plane is a no-op; the NIXL path is opt-in and still runs without CUDA or model weights.

New setup

Read Architecture, then install the binary and run the quick start.

Trace replay

Start with Trace replay and calibration for capture, model fit, and replay modes.

Operations

Use Versioning and Conformance for multi-line vLLM support.

Architecture

vllm-vcr sits at the engine-core boundary. The frontend is still a normal vLLM frontend; the simulator connects where a headless engine would connect and speaks the same ZMQ + msgpack protocol. The protocol types come from vLLM's in-tree vllm-engine-core-client crate, pinned per supported vLLM line.

frontend vLLM frontend Rust or Python

ZMQ + msgpack

backend vllm-vcr play mock engine-core

EngineInput

sim generation loop tokens, latency, scheduler

KV hooks

default NoopDataPlane control plane only

feature = nixl NixlDataPlane CPU KV transfers

The main pieces are:

connect_to_frontend joins the frontend-owned handshake, reports ready, and opens the DEALER/PUSH data sockets.
src/io.rs decodes incoming frames into EngineInput and writes EngineOutput messages back to the frontend.
src/engine.rs owns scheduling, latency, token emission, LoRA accounting, prefix-cache state, and failure injection.
src/dataplane.rs is the prefill/decode integration point. Prefill advertises KV metadata through kv_transfer_params; decode pulls those blocks. The default NoopDataPlane only exercises the control plane, while NixlDataPlane performs real NIXL reads when the nixl feature is enabled.

record uses the same boundary in proxy form: it presents as an engine to the frontend, presents as a frontend to the real engine, relays frames unchanged, and records timing/token metadata from decoded copies.

Status

vllm-vcr is usable today for protocol-level frontend testing, trace replay, calibration, and GPU-free prefill/decode control-plane experiments. The NIXL data plane is implemented behind an optional feature and needs a Linux host with libnixl and UCX.

Area	State	Validation
Engine-core protocol	Streaming and non-streaming OpenAI flows work through the vLLM Rust frontend over ZMQ/msgpack, with tokenizer, detokenizer, chat template, and frontend metrics intact.	`./scripts/e2e.sh`
Trace timing	TTFT, inter-token gaps, multi-token chunks, prefix-cache structure, and arrival/session pacing can be captured, modeled, and replayed.	`inspect calibrate`, `inspect calibrate-e2e`, trace replay tests
Content replay	`record --record-tokens` plus `play --replay-tokens` can serve recorded token ids and finish reasons.	`tests/engine_core_e2e.rs`, `tests/closed_loop_prefix_replay.rs`
P/D control plane	The simulator produces and consumes vLLM NixlConnector `kv_transfer_params` per request.	`scripts/pd_control.sh`
NIXL data plane	Prefill registers a paged KV pool and serves metadata; decode fetches metadata and posts paged NIXL reads.	`tests/nixl_loopback.rs` on Linux + libnixl
Multi-version support	The build matrix pins one `vllm-engine-core-client` rev per supported line and uses conformance goldens when available.	CI matrix + `tests/conformance.rs`

If NIXL initialization fails at runtime, the engine logs a warning and falls back to NoopDataPlane, so protocol tests can still run.

./scripts/pd_control.sh              # macOS: control-plane schema round trip
cargo check --features nixl-stub     # macOS gate: typecheck the NIXL path
cargo test  --features nixl          # Linux: NIXL transfer

Install

vllm-vcr requires Rust 1.85 or newer. The default build is pure Rust and does not include NIXL.

From a checkout:

cargo install --path . --locked

That installs the single vllm-vcr binary with record, play, inspect, and completions subcommands.

To install the default no-NIXL build directly from Git:

cargo install --git https://github.com/neuralmagic/vllm-vcr \
  --locked vllm-vcr

For a NIXL-enabled binary, build on Linux with libnixl and UCX available:

cargo install --path . --locked --features nixl

For local development on a machine without libnixl, use the stub feature to typecheck the NIXL code path without enabling real transfers:

cargo check --features nixl-stub

For Kubernetes deployments, build the container image instead. The image includes the vLLM Rust frontend, the simulator, libnixl, and UCX:

podman build -t ghcr.io/neuralmagic/vllm-vcr:dev .

Quick start

This smoke test runs the real vLLM frontend and replaces only the engine-core backend. It does not need a GPU or model weights, but the frontend still downloads the tokenizer on first use.

Start the simulator:

vllm-vcr play --handshake-address tcp://127.0.0.1:29550 --log-requests

In another shell, start a vLLM frontend with the same handshake port and no local engine rank. The exact command depends on the vLLM line and frontend you are testing; vllm-vcr uses the same external-engine role as vLLM's mock-engine harness. The repo script wraps the common Rust-frontend path:

./scripts/e2e.sh

Send a streaming chat request once the frontend is up:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"Qwen/Qwen3-0.6B","messages":[{"role":"user","content":"hello"}],"max_tokens":16,"stream":true}'

Prefill/decode control-plane smoke

Run the routing-sidecar schema round trip without NIXL:

./scripts/pd_control.sh

For the Kubernetes P/D deployment, build the image from the install section and use deploy/llm-d-pd/README.md.

Replay a trace

Once you have a tap trace, use it as the latency source:

vllm-vcr play --handshake-address tcp://127.0.0.1:29550 \
  --latency-trace trace.jsonl.gz --log-requests

If the trace was captured with record --record-tokens, add --replay-tokens trace.jsonl.gz to serve the recorded token ids instead of random tokens.

Verifying release artifacts

Every GitHub Release tarball ships with four supply-chain artifacts:

*.sha256 — a plain checksum, no tooling required (shasum -a 256 -c <file>.sha256).
*.cdx.json — a CycloneDX SBOM of the build's dependency graph.
*.sig + *.pem — a cosign keyless signature and its Fulcio certificate, for offline verification.
a SLSA build provenance attestation recorded in GitHub, binding the tarball's digest to the workflow run that produced it.

Verify provenance (proves it was built by this repo's release workflow):

gh attestation verify vllm-vcr-vllm0.23-x86_64-unknown-linux-musl.tar.gz \
  --repo neuralmagic/vllm-vcr

Verify the cosign signature without GitHub:

cosign verify-blob \
  --certificate vllm-vcr-vllm0.23-x86_64-unknown-linux-musl.tar.gz.pem \
  --signature  vllm-vcr-vllm0.23-x86_64-unknown-linux-musl.tar.gz.sig \
  --certificate-identity-regexp '^https://github.com/neuralmagic/vllm-vcr/' \
  --certificate-oidc-issuer https://token.actions.githubusercontent.com \
  vllm-vcr-vllm0.23-x86_64-unknown-linux-musl.tar.gz

Shell completions

vllm-vcr completions <shell> prints a completion script to stdout for bash, zsh, fish, powershell, or elvish. Install it wherever your shell loads completions, for example:

# fish
vllm-vcr completions fish > ~/.config/fish/completions/vllm-vcr.fish

# bash (current shell)
source <(vllm-vcr completions bash)

# zsh (into a directory on $fpath)
vllm-vcr completions zsh > ~/.zfunc/_vllm-vcr

The script is generated from the live command tree, so it always matches the subcommands and flags of the binary you ran it from.

Testing

The main local gate is:

cargo fmt --all --check
cargo clippy --workspace --all-targets --locked --no-deps -- -D warnings
cargo test --workspace --locked

The full smoke scripts also boot a real vLLM frontend:

./scripts/e2e.sh        # boots vllm-rs + this engine, asserts streaming + non-streaming flows
./scripts/e2e_lora.sh   # loads a LoRA adapter, asserts vllm:lora_requests_info names it
./scripts/e2e_generate.sh # exercises /inference/v1/generate token-in/token-out

These scripts need vllm-rs built once (cargo build --bin vllm-rs in the vLLM rust/ workspace). Override its path with FRONTEND_BIN=.... The first run fetches the tokenizer from Hugging Face.

e2e_lora.sh needs a frontend that exports vllm:lora_requests_info from the frontend metrics path. The image and current default protocol pin qualify; if you use your own checkout, point FRONTEND_BIN at a compatible build.

LoRA simulation

The simulator tracks LoRA adapters loaded by the frontend through the engine utility path (add_lora / remove_lora) and uses that registry while scheduling requests.

--max-loras mirrors vLLM's running-batch diversity cap: it limits how many distinct adapters may be resident in the running batch at once. 0 disables the cap, but LoRA accounting still runs so frontend metrics can report running and waiting adapters.

In the container image, set MOCK_MAX_LORAS; the entrypoint maps it to vllm-vcr play --max-loras.

The vllm:lora_requests_info metric is frontend-derived on current supported vLLM lines. The simulator's job is to keep scheduler stats and adapter state consistent enough for that frontend metric to reflect the request mix.

NIXL data plane

The NIXL path is optional. It needs libnixl and UCX on Linux, using either RDMA NICs or shared-memory transports. On a development machine without that runtime, typecheck the code path against stubs:

cargo check --features nixl-stub

On Linux with NIXL installed, split a prefill and a decode engine:

# prefill
cargo run --features nixl -- play --pd-role prefill \
  --engine-id mock-prefill --side-channel-host 127.0.0.1 --side-channel-port 5600 ...

# decode
cargo run --features nixl -- play --pd-role decode \
  --engine-id mock-decode --side-channel-port 5601 ...

The transfer path uses remote_host and remote_port from kv_transfer_params to fetch the prefill's PoolDescriptor over a TCP metadata side channel, then issues NIXL READs for the advertised block ids. Decode receives those remote_* fields per request; the prefill address is not a decode CLI argument.

tests/nixl_loopback.rs validates the byte-transfer path with distinct prefill and decode agents in one process. Kubernetes deployment validation is separate.

Engine internals

The engine separates loop orchestration from request behavior.

EngineCore (src/engine_core.rs) is the top-level contract. The generic run_loop owns the tokio select! over inputs, internal events, and deadline ticks. Any struct implementing EngineCore can use the loop. SimEngine is the production implementation; ConstantEngine (test-only, same file) is a minimal engine used by loop tests.

Three strategy traits on SimEngine control request behavior:

Trait	File	Default	What it controls
`TokenSource`	`src/tokens.rs`	`RandomTokens`	Which token ids each request emits. `EchoTokens` replays the prompt.
`LatencyModel`	`crates/sim-trace/src/latency.rs`	`KnobLatency`	TTFT and inter-token pacing. `FixedLatency` gives constant delays with no rng draws.
`Scheduler`	`src/sched.rs`	`Fcfs`	Waiting-queue admission order. `Priority` uses `(priority, arrival_time)`. `ShortestPromptFirst` picks the smallest prompt.

Defaults are wired in SimEngine::new (from CLI flags) and in run().

Contract tests live in tests/engine_core_e2e.rs. They drive ZMQ, protocol framing, and channels, then assert wire-level behavior. Unit tests in src/engine.rs cover engine internals.

Trace replay and calibration

Trace replay has two separate axes:

Workload replay: recorded arrivals, sessions, prompt prefix structure, and optionally output token ids.
Timing replay: either sampled from a fitted latency model, replayed verbatim from recorded step gaps, or replaced with explicit timing knobs.

Keeping those axes separate is important. You can replay a captured workload while testing a latency model fit from a different capture, or you can serve the same recorded token stream with synthetic timing for fast client tests.

The trace files used to build the committed figures live under traces/, which is gitignored. See traces/README.md for the local inventory and which captures are fitting inputs versus gate seeds.

Start with Concepts for terminology, then use the scenario pages for arrival replay, prefix-cache workloads, content-identical replay, and multi-token step replay.

Concepts

The trace docs use three terms consistently:

Captured — per-token tap recordings from a vLLM engine, taken server-side on the engine-core protocol. Figures label these as "real" or "source".
Modeled — latency the simulator emits. TTFT and per-token gaps are drawn from a statistical model fitted to a captured trace (conditioned on concurrency, context depth, and uncached prompt size). Captured timings are not played back verbatim, so a model fitted on one workload can be evaluated on another.
Direct replay — recorded values used verbatim, no statistics: arrival timestamps (--replay-arrivals), session pacing (--replay-sessions), prefix structure (block hashes), per-step gaps (--replay-steps), and opt-in output token ids (--replay-tokens).

"Replay" in a figure or flag name refers to the workload side (the schedule being replayed), not to the timing. Counterfactual gates fit on workload A, directly replay workload B's schedule, and check the modeled timing against B's capture.

just figures rebuilds the figures from local trace files listed in traces/README.md (scripts/make_figures.sh; ~30 minutes, the arrival replays run in real time). Those trace files are not committed. The head-to-head comparison is the exception; it needs live serving stacks (commands in that section).

Calibration demo

The vllm-vcr inspect subcommands include a calibration harness for quick local checks. The synthetic demo trace is deliberately heavy-tailed, so it shows the difference between the trace-fitted model and simple mean/std-dev knobs.

The harness checks two properties:

TraceLatency replay reproduces source-trace quantiles within tolerance.
KnobLatency cannot reproduce heavy tails: its [0.3*mean, 1.7*mean] clamp caps p99/p50 at roughly 1.7x for any knob settings.

This model-level check applies to ITL and to TTFT on unloaded traces. On loaded captures, the TTFT marginal comes from queueing and chunk interference rather than a sampled distribution, so this check can fail by design. Loaded TTFT is checked by the arrival-replay scenarios below.

# 1. Generate a synthetic heavy-tailed trace (lognormal TTFT/ITL).
cargo run --bin vllm-vcr -- inspect gen-demo -o /tmp/demo.jsonl

# 2. Model-level calibration (no transport).
cargo run --bin vllm-vcr -- inspect calibrate /tmp/demo.jsonl

# 3. Wire-level: start the simulator and measure client-side.
cargo run --bin vllm-vcr -- inspect calibrate-e2e /tmp/demo.jsonl --requests 60

--fast on gen-demo produces a small-magnitude trace for quick e2e testing (TTFT ~15-40ms, ITL ~3-10ms). All subcommands accept --json for machine-readable output and --seed for determinism.

Calibration with engine captures

The recording tap (vllm-vcr record, deployment manifests in deploy/trace-capture/) sits between the vLLM Rust frontend and a headless vLLM engine (Qwen3-8B, TP=1, H200), recording per-token inter-token gaps server-side over in-pod localhost ZMQ.

This page explains the figures used to validate the trace-fitted latency model. The important point: the model is not replaying a single request's gaps. It samples from captured observations conditioned on request shape and concurrency, so it can be tested out of sample.

The figures below plot captured vs TraceLatency vs best-fit KnobLatency per-token ITL (survival curve and Q-Q plot), and the same trace as pooled per-token ITLs vs per-request mean ITLs. Client-side benchmark reports such as guidellm usually expose per-request means because they record first/last token timestamps. The knob model's [0.3*mean, 1.7*mean] clamp appears as a vertical cutoff before the captured tail.

Source vs replay vs knob-fit

Per-token vs per-request-mean ITL

To regenerate from any trace with per-token itl_ms arrays:

cargo run --bin vllm-vcr -- inspect calibrate trace.jsonl --dump-samples samples.json
uv run scripts/plot_calibration.py --samples samples.json --trace trace.jsonl --out-dir docs/images

Comparison with llm-d-inference-sim

Same workload (deploy/trace-capture/loadgen.py, concurrency 1 and 16, 512/128 tokens) against three targets: the H200 engine (tap-recorded), this simulator with its latency model fit from the canonical fitting set (a different workload, the counterfactual setting), and the Go llm-d-inference-sim (v0.9.1) with its latency knobs fit to the same trace (the in-sample setting). Both simulators ran on the same host and were measured client-side by the same load generator. The engine curves are the tap recording. Both simulators' timing is modeled.

Real engine vs both simulators

The step model over-predicts TTFT for this saturated fixed-concurrency workload by about 70ms at the median, a known calibration gap in the out-of-sample fit. The knob model clamps both tails by construction.

Note: the trace's std-devs (TTFT 80ms, ITL 8ms) exceed llm-d-inference-sim's config validation, which caps std-dev at 30% of the mean, so it runs with the largest spread it accepts (39ms / 3.3ms).

# llm-d-inference-sim invocation used above
llm-d-inference-sim --port 8001 --model Qwen/Qwen3-8B --mode random \
  --force-dummy-tokenizer --max-model-len 16384 --max-num-seqs 128 \
  --time-to-first-token 132ms --time-to-first-token-std-dev 39ms \
  --inter-token-latency 11ms --inter-token-latency-std-dev 3300us

# this simulator: vllm-rs frontend + trace-fitted model, vLLM-default scheduler
# limits; the fit is the canonical set (sweep + warm multiturn + cold multiturn)
cat traces/h200-qwen3-8b/h200-sweep-full.jsonl \
    <(grep -v '"meta"' traces/h200-qwen3-8b/h200-multiturn-mtfit2.jsonl) \
    <(grep -v '"meta"' traces/h200-qwen3-8b/h200-multiturn-nocache4.jsonl) > /tmp/fit.jsonl
vllm-vcr play --handshake-address tcp://127.0.0.1:5571 \
  --latency-trace /tmp/fit.jsonl \
  --max-num-seqs 1024 --max-num-batched-tokens 8192

Step-granular interference

The engine paces emission with a step clock that mirrors vLLM's per-step schedule: decodes claim the shared token budget first, prefills chunk into whatever remains (in admission order), and every co-running decode's gap is the composed step's duration. Chunk compute is fitted from the trace as a depth-dependent function (attention makes deep chunks cost more per token) plus a max-shape premium for budget-saturated steps; small chunks hide under the batch's decode compute. Queueing, chunk serialization, and decode elongation are produced by the step composer rather than by interference knobs.

The gate is counterfactual: fit on one workload (a constant-load sweep plus a warm multiturn capture), then predict a cold-cache multiturn (~11k-token prompts, prefix caching disabled) the model never saw, whose prefill chunks continuously interfere with running decodes. The capture shows a two-shelf ITL band; the replay reproduces the band's shape, mass (13.9% vs 14.1%), and tail.

Counterfactual cold-multiturn replay

The warm-multiturn factual leg (99%+ prefix-cache hits) and a low-rate cold leg stay calibrated under the same model:

Factual warm-multiturn replay

Low-rate cold-multiturn replay

The same fit procedure refits from a Qwen3-30B-A3B MoE sweep without constant changes and reproduces its counterfactual band.

Open-loop arrival replay

The calibrations above sample the latency model closed-loop. That validates distributions, but it does not cover TTFT queueing, prefill stalls, or concurrency mixing from an external arrival process. calibrate-e2e --replay-arrivals direct-replays a captured arrival schedule in real time (each request sent at its recorded offset, open loop) and compares client-side TTFT/ITL/request-total quantiles against the capture. The arrivals are verbatim; every latency is still modeled. --latency-trace fits the sim's model from a different trace, so the gate runs on an arrival process outside the fitting set.

Setup: use the same frontend → tap → engine stack as the capture rig, then run the replay locally with vllm-vcr play as the engine. The latency model is fit from the canonical H200 fitting set, while deploy/trace-capture/loadgen.py --pattern poisson|burst drives arrival processes the fitting set never contained.

scenario	requests	concurrency seen	TTFT max err	ITL max err	req-total err
poisson, 4 req/s	464	1-15, median 6	36.1%*	1.1%	0.2%
burst, 24 per 10s	288	0 -> 24 spikes	0.4%	0.05%	0.5%
multiturn agentic (see below)	495	1-13	26.0%*	0.9%	2.5%

The max-err columns are the worst single quantile across all concurrency buckets. The starred cells are small-n tail artifacts: poisson's worst cell is its n=2 concurrency-1 bucket, multiturn's is a warm-TTFT p99 where captured 103ms vs modeled 76ms differ by transport jitter the in-process replay does not model. Medians and p90s agree within ~1-2%, and request totals stay within 2.5%.

The burst scenario sends 24 simultaneous 512-token prefills to an idle engine, so TTFT is queueing-dominated (burst TTFT p50 1.2s / p99 2.0s vs poisson's 58ms / 150ms on the same config).

Burst arrival replay

Poisson arrival replay

Per-concurrency-bucket rows shuffle under bursts (admission order inside a burst is not deterministic), which is why the gate compares pooled quantiles plus per-request decode totals.

Replayed prompts are unique-token synthetics: the captured workloads carry cached_tokens: 0, and identical fill tokens would silently turn every replayed request into a prefix-cache hit. Workloads with prefix reuse, such as multiturn and agentic traces, also need prefix structure replayed; that is the next scenario.

To reproduce against any trace with arrival_ms:

# capture: any OpenAI-compatible target
uv run --with httpx deploy/trace-capture/loadgen.py --url http://127.0.0.1:8000 \
  --model Qwen/Qwen3-8B --pattern poisson --rate 4 --duration 120 \
  --prompt-tokens 512 --output-tokens 128 --out run.json --trace-out client.jsonl

# replay the schedule, fitting the model from a different capture
just replay tap-poisson.jsonl /tmp/fit.jsonl

# real-vs-replay survival curves (replay measurements via --dump-trace)
just compare "real=tap-poisson.jsonl" "replay=replay-measured.jsonl"

Prefix cache and agentic multiturn

The agentic scenario (loadgen.py --pattern multiturn) models sessions, not isolated requests. Sessions arrive poisson at --rate; each session runs --turns closed-loop turns whose context grows by the turn's prompt plus the model's response, on top of one of --prefix-count shared --prefix-tokens prefixes. The validation run below is about 100 sessions x 5 turns over two ~10k-token shared prefixes; 493 of 495 requests were prefix-cache hits.

Prefix caching is not a latency knob. The engine runs a block-pool prefix cache; admission computes each request's cached-token count, the trace-fitted TTFT model conditions on the uncached prompt size, and a prefill admission stalls concurrent decodes by its uncached tokens. Replaying prefix-cache workloads requires the workload's sharing structure. The tap fingerprints every prompt with chained per-block hashes (block_hashes), and replay expands each distinct hash to one deterministic token block. Replayed prompts therefore share prefixes at the same block boundaries as the capture.

Two replay modes apply. Pure open-loop replay fires every turn at its recorded offset. --replay-sessions restores the generator's semantics: turn N+1 fires when turn N completes plus the recorded think gap, with sessions inferred from the hash chains. Session pacing matches closed-loop client behavior. Cold turns take seconds, so later turns are delayed by prior responses; open-loop replay would fire every turn on the original warm schedule.

The figure shows captured vs modeled TTFT survival per turn cohort (turn-1 requests: shared prefix hit only; turns 2+: growing context), plus the same schedule replayed with --cold-prompts (prefix reuse disabled). Without the cache, every turn re-prefills ~11k tokens and offered prefill load exceeds engine capacity. On turns 2+, TTFT p50 changes from 36ms to ~24s and p99 from 87ms to ~59s, with closed-loop sessions enabled.

Multiturn cache effect

# capture an agentic workload (10k-token shared prefixes at ~1.5 tokens/word)
uv run --with httpx deploy/trace-capture/loadgen.py --url http://127.0.0.1:8000 \
  --model Qwen/Qwen3-8B --pattern multiturn --rate 1 --turns 5 \
  --prefix-tokens 6500 --prompt-tokens 128 --output-tokens 128 --duration 120 \
  --out run.json

# session-paced replay, then the cache-off replay
cargo run --release --bin vllm-vcr -- inspect calibrate-e2e tap-multiturn.jsonl \
  --replay-arrivals --replay-sessions --latency-trace /tmp/fit.jsonl \
  --sim-arg=--kv-cache-size --sim-arg=65536 --dump-trace replay-measured.jsonl
cargo run --release --bin vllm-vcr -- inspect calibrate-e2e tap-multiturn.jsonl \
  --replay-arrivals --replay-sessions --cold-prompts ... --dump-trace nocache-measured.jsonl

# per-cohort figure
uv run scripts/plot_calibration.py --cache-effect real=tap-multiturn.jsonl \
  --cache-effect replay=replay-measured.jsonl --cache-effect nocache=nocache-measured.jsonl \
  --out-dir docs/images

The cold replay uses the chunk-cost model validated by the cold-multiturn counterfactual gate at this prompt scale. Workload traces with block-hash ids, lengths, and timestamps map onto this schema.

Content-identical replay

By default, traces include timing, shapes, and prefix structure (block hashes), but not tokens. The tap's --record-tokens option adds each request's output_token_ids to the trace. finish_reason is always recorded. With the same tokenizer, recorded token ids decode back to generated text, so token-recording traces can contain user content.

On the replay side, vllm-vcr play --replay-tokens <trace> serves the recorded ids verbatim instead of random tokens and ends each stream with the recorded finish reason. --replay-match controls request-to-record matching:

index (default): the trailing -<index> of the request id, where the index is the record's position in the arrival-ordered schedule (the replay harness names requests replay-{i}). This requires replay-generated request ids. Combined with arrival replay, it reproduces the captured token stream on the wire.
prefix: the incoming prompt's chained block hashes are matched against the records' block_hashes, longest shared prefix wins, ties go to arrival order, and each record is consumed by its first match (a duplicate prompt takes the next duplicate record; once all are consumed, retries re-serve the best match). The matched stream ends where the capture did: the engine clamps the live request's max_tokens to the recorded length. This supports closed-loop clients with their own request ids, such as an agent loop re-run against the simulator. Because block hashes are chained, a tail change in a prompt shortens the match depth without changing earlier block matches.

Unmatched requests fall back to random tokens in both modes. These modes provide deterministic streams for testing routers, EPPs, guardrails, and client SDK streaming behavior without a GPU. Prefix mode can replay a closed-loop agentic workload offline when the agent is deterministic; tests/closed_loop_prefix_replay.rs covers that case.

Every trace touchpoint (--trace-out, --latency-trace, --replay-tokens, trace conversion and replay harnesses) reads and writes gzip transparently when the path ends in .gz; token-recording traces grow by one integer per generated token, so compressing them is recommended.

Replay pacing

Content replay (--replay-tokens) and timing are independent. Pick the content mode first, then choose how quickly and in what shape the engine should emit chunks.

Mode	Invocation
Timing-modeled	`--replay-tokens trace.gz --latency-trace trace.gz` plus scheduler args matching the capture (`--max-num-seqs`, `--max-num-batched-tokens`, ...): gaps and burst sizes sampled from a model fitted to the trace
Timing-verbatim	`--replay-tokens trace.gz --replay-steps trace.gz`: each request replays its recorded per-chunk sizes and gaps
As fast as possible	`--replay-tokens trace.gz` and nothing else: all timing knobs default to 0, the instant model
Compressed but shaped	`--replay-tokens trace.gz --latency-trace trace.gz --time-scale 100`: same interleavings and relative ordering, 100x faster wall clock
Synthetic timing	`--replay-tokens trace.gz --time-to-first-token 50 --inter-token-latency 10`

For the fast path, scheduler limits still apply at zero delay. --max-num-seqs and the token budget control queueing and backpressure; increase them for pass-through replay. --output-token-chunk-size controls output framing.

Speculative decoding and diffusion

Speculative decoding and diffusion can emit multiple tokens from one engine step. A step can deliver one verified token plus accepted drafts, or a diffusion block. Capture and replay preserve this chunk structure.

On the capture side the tap records, per output chunk, one itl_ms gap and the number of tokens that chunk delivered in a parallel itl_tokens array (omitted for plain autoregressive captures, so old traces are unchanged). The first chunk has no gap; its size is output_tokens - sum(itl_tokens). With --step-stats-out the tap also writes a per-step SchedulerStats sidecar, which under speculative decoding carries spec_decoding_stats (per-position acceptance) from the vLLM engine.

There are two replay modes for this structure:

Modeled (--latency-trace): the latency model draws the recorded (gap, tokens) pairs jointly from donor pools fitted to the capture. A step the capture saw deliver four tokens replays as one four-token message after a sampled gap, never four messages at gap/4. It reproduces the burst distribution, not an individual request's recorded sequence, so sampling drift is expected.
Replay (--replay-steps): each matched request emits its recorded chunk sizes at its recorded gaps (the timing analogue of --replay-tokens; requests resolve to records via --replay-match). This mode replays recorded timing but does not transfer to a different workload.

Either way the simulator re-derives spec_decoding_stats from the bursts it emits: speculative budget K = max(itl_tokens) - 1, with a burst of N tokens reported as 1 target token plus N-1 accepted drafts. Scheduler stats therefore use the same structure as the capture. Autoregressive traces report no spec stats, matching a vLLM engine with speculation off.

# 1. Capture: vLLM engine with ngram spec decode behind the tap (writes
#    tap-trace.jsonl + step-stats.jsonl). See deploy/trace-capture/ for manifests.
just capture-up && bash deploy/trace-capture/run-capture.sh && just capture-down

# 2. Replay the recorded schedule with verbatim per-request bursts and gaps.
cargo run --release --bin vllm-vcr -- inspect calibrate-e2e \
    /tmp/trace-capture-h200/tap-trace.jsonl --replay-arrivals \
    --sim-arg=--replay-steps=/tmp/trace-capture-h200/tap-trace.jsonl \
    --dump-trace /tmp/spec-replay.jsonl

# 3. Plot capture vs replay: burst sizes, per-chunk pacing, acceptance.
uv run scripts/plot_calibration.py \
    --spec-fidelity real=/tmp/trace-capture-h200/tap-trace.jsonl \
    --spec-fidelity replay=/tmp/spec-replay.jsonl \
    --spec-steps real=/tmp/trace-capture-h200/step-stats.jsonl \
    --out-dir docs/images

Speculative decoding replay fidelity

The figure is the verbatim --replay-steps path (a 4096-record Qwen3-8B run; ngram on this workload accepts often, so ~45% of steps deliver the full 5 tokens). Left: tokens delivered per decode step, captured vs replayed. Middle: step time vs burst size; speculation verifies all K drafts in one target forward pass, so median step time is ~flat in the burst size (~12ms whether the step delivered 1 or 5 tokens). The dashed line is the ~gap/N result that would appear if one chunk were split into N equal gaps. Right: per-position draft acceptance read back from the SchedulerStats sidecar (pass a second --spec-steps replay=... to overlay the simulator's own emitted stats). Covered without a GPU by tests/spec_replay_fidelity.rs and replay_steps/engine unit tests.

Perfetto trace viewer

Convert a JSONL engine trace into the Chrome Trace Event Format and view it on https://ui.perfetto.dev. The converter is the perfetto subcommand of vllm-vcr inspect; it reads the same trace files the replay and calibration paths use (.gz transparent), and optionally overlays the tap's step-stats sidecar.

For the trace schema itself, see crates/sim-trace/src/trace.rs; for the sidecar, crates/sim-protocol/src/step_stats.rs.

Quick start

Write a Perfetto JSON file and drag it onto https://ui.perfetto.dev:

cargo run --bin vllm-vcr -- inspect perfetto trace.jsonl -o trace.perfetto.json

Or let it serve the trace and open the UI for you (blocks until Ctrl-C, since the hosted UI fetches the file from this process):

cargo run --bin vllm-vcr -- inspect perfetto trace.jsonl --open

Overlay the step-stats sidecar (vllm-vcr record --step-stats-out) for the batch-level counters and the per-step scheduler track:

cargo run --bin vllm-vcr -- inspect perfetto trace.jsonl \
  --step-stats trace-step-stats.jsonl --open

What you see

Two process groups, both on one shared clock (milliseconds since capture start; arrival_ms and the sidecar's ts_ms use the same zero).

inference trace — the per-request shapes. Each request is a prefill span (arrival → first token) followed by one decode span per inter-token gap (itl_ms), with multi-token chunks (spec decode, diffusion blocks) named decode xN. Requests are packed into reusable lanes: a lane frees the moment its request finishes, so the row count is the peak concurrency, not the request count (a 2500-request trace becomes ~16 lanes). Under it sit counter tracks:

Counter	Source	Meaning
`active_requests`	swept from the spans	in-flight request depth over time
`engine_running`	`itl_ctx.num_running`	engine-reported running count per decode gap
`prefill_tokens`	`itl_ctx.prefill_tokens`	prompt tokens that finished prefill in the gap's step
`sched_running_reqs` / `sched_waiting_reqs`	sidecar	scheduler batch and queue depth
`sched_kv_cache_usage`	sidecar	KV-cache pressure (0–1)
`sched_accept_rate`	sidecar	spec-decode acceptance rate (only when drafting)

scheduler steps (with --step-stats) — one span per executed scheduler step, back to back on a single row (steps are sequential, so they never overlap). Each step is classified and colored by what it ran: decode B<n>, prefill, or prefill+decode B<n> (+<r>r <t>t). A prefill step is visibly wider (it costs more), and the args carry running / waiting / prefill_requests / prefill_tokens / kv_cache_usage / step_ms and spec accepted/draft.

Spans are colored by phase so the language is consistent across both groups: orange = prefill, green = decode, with a distinct shade for a mixed prefill+decode step. Recorded output token ids (tap --record-tokens) ride each span's token_ids arg, the hook for a future detokenize-to-text mode.

Options

Flag	Effect
`-o, --output <path>`	Write the JSON here (default: stdout, or nothing with `--open`)
`--step-stats <path>`	Overlay the step-stats sidecar (`.gz` ok): counters + the step track
`--name <label>`	Override the process-row label (default: the trace's model)
`--track-per-request`	One labelled row per request instead of packed lanes (good for small traces)
`--open`	Serve over localhost and open the Perfetto UI; blocks until Ctrl-C
`--port <n>`	Port for `--open`; default `0` lets the OS choose a free ephemeral port

Records without an arrival_ms cannot be placed on a timeline and are dropped (the command prints how many); guidellm-converted and gen-demo traces have none, real tap captures do.

Fidelity: request shape vs scheduler steps

The two views answer different questions, and the difference matters when reading overlap.

The request shapes are a reconstruction of each request's client-observed latency envelope: the prefill span is arrival → first token, which fuses queue-wait and prefill compute into one contiguous bar, and the decode bar is the observed inter-token cadence. Under load these bars overlap heavily, but that does not mean the engine ran that many prefills at once. With chunked prefill, the engine runs roughly one prefill chunk per step, interleaved with decodes, and much of a prefill bar under saturation is queue-wait, not compute.

The scheduler steps track is the truthful counterpart: it shows what the engine actually executed each step, sequential and non-overlapping, including where a prefill chunk genuinely co-occurred with decodes in one step (prefill+decode). Reach for the request shapes to see per-request experience (what replay cares about), and the step track to see scheduler occupancy.

Notes and limits

Times are emitted in microseconds (the Chrome format's unit); displayTimeUnit is set to ms for the axis.
Timestamps are relative to capture start, not wall-clock epoch, so a trace is self-contained but not directly correlatable with an external profiler.
--open runs a minimal localhost HTTP server with permissive CORS so the hosted UI can fetch the file. The trace stays on your machine; the browser fetches it from 127.0.0.1.
Large traces produce large JSON (a ~2500-request capture is ~50 MB / ~370k events); the UI loads it fine but the default zoom fits the whole capture.

vLLM version mapping and release automation

Goal: support a rolling N-3 window (latest vLLM release plus the three before it) from a single repo, with automation that catches protocol drift the day a new vLLM lands and ships one clearly-labelled artifact per supported line.

Where vLLM version actually bites

Only one axis is a hard compile-time coupling. The rest are behavioral and get caught by tests, not the compiler.

Axis	Where it lives	Breaks on version bump?
Wire protocol	`vllm-engine-core-client` git rev → the ~6 imported types (`HandshakeInitMessage`, `ReadyMessage`, `EngineCoreFinishReason`, `ModelDtype`, `encode/decode_msgpack`)	Hard. Cargo allows exactly one rev of a git dep per build. This is the whole problem.
Registration schema	`SimReadyResponse` in `crates/sim-protocol/src/frontend_connect.rs`	Soft, but already drifted (see `block_size` note below). We own a superset, so we absorb adds.
Metrics surface	`vllm:*` Prometheus gauges (e.g. the #45030 `lora_requests_info` move)	Soft. Behavioral, caught by `e2e*.sh`.
Frontend (`vllm-rs` / python) for e2e	`scripts/e2e*.sh`	Must match the protocol rev under test.
Scheduler / step model	`src/` step engine	Soft. Doesn't break the build, shifts replay error (chunked-prefill defaults etc.). Re-validated per line via the figure/gate harness.

The trace and modeling crates (sim-trace, src/ model) are already vLLM-protocol-free. So "support N-3" is almost entirely a sim-protocol + build-matrix problem, which is what makes the matrix approach cheap.

Decision: build matrix, one artifact per line

Cargo cannot hold two revs of the same git dep in one build. A single binary speaking four protocols would mean vendoring the wire structs into per-version modules behind a trait. The surface is small (~6 types) so it isn't insane, but msgpack shapes can diverge in ways a trait can't paper over, and it is real standing maintenance.

We commit to the build matrix: one image per supported vLLM line, a manifest-driven CI matrix, and the handshake vllm_version field used to reject mismatches loudly rather than silently mis-speak the wire. Revisit the single-binary path only if the protocol surface stays this small and we have a concrete reason to ship one image.

Source of truth: `compat.toml`

A manifest at repo root defines the support window. The manifest diff is the release.

# compat.toml — the rolling support window. Oldest release line drops as N advances.

[[vllm]]
line = "nightly"              # tracks vLLM main for drift detection
tag  = "nightly"
protocol_rev = "9c7c74bf..."
fidelity_validated = false

[[vllm]]
line = "0.23"                 # current default line
tag  = "v0.23.0"              # vLLM release tag; also the e2e frontend version
protocol_rev = "17bc1445..."  # rev for vllm-engine-core-client at this line
fidelity_validated = false    # flips true once replay gates validate goldens
default = true                # what :latest / unsuffixed builds point at

[[vllm]]
line = "0.22"                 # older supported release line
tag  = "v0.22.1"
protocol_rev = "0decac0d..."
patch_repo = "https://github.com/wseaton/vllm.git"
patch_rev = "b48f2434..."
fidelity_validated = false

Rules:

Pin release lines to release tags, not arbitrary labels. tag is the vLLM release label for the line. protocol_rev is the vLLM git rev whose in-tree vllm-engine-core-client crate the simulator builds against for that line.
Exactly one default = true. That line is :latest and the unsuffixed build.
A line enters the window only when fidelity_validated = true. New lines land as false, get capture+replay validation, then flip.

Versioning: keep the two axes orthogonal

The sim's own semver (0.1.0 → ...) tracks its features. vLLM compatibility is build metadata, expressed in the image tag, never baked into the sim semver. Conflating them is the classic mistake.

Image tags:

vllm-vcr:0.3.0-vllm0.23 — immutable, the real artifact (sim version × vLLM line).
vllm-vcr:vllm0.23 — floating, latest sim for that line.
vllm-vcr:latest — sim-head × the default = true line.

CI matrix mechanics

The rev is swapped in Cargo.toml, NOT via --config patching. Cargo rejects a [patch] that points a git dependency at a different rev of the same source ("patches must point to different sources"), so --config 'patch...rev=...' fails for every line (including the head). The per-line build instead rewrites the manifest in a throwaway checkout:

cargo xtask pin-vllm "<line>"   # reads compat.toml, edits Cargo.toml
cargo build --workspace               # no --locked: the rev changed

cargo xtask pin-vllm sets the rev in [workspace.dependencies] and rewrites or removes the fork [patch] from the line's patch_repo/patch_rev (the head line carries the #45848 fork; lines without a fork build against protocol_rev upstream). compat.toml stays the single source of truth.

On main and tags the matrix builds + runs the replay gates against every line. That is the payoff: the day vLLM N+1 lands, the matrix tells us whether the wire still parses before we promote anything. Lines that are not yet fidelity_validated run non-gating (job-level continue-on-error), so a line with real API drift (e.g. a removed mock_engine module) surfaces as a non-blocking annotation rather than blocking the merge.

Handshake version guard (do regardless of path)

The frontend registration already carries vllm_version (frontend_connect.rs:43). On connect, assert the peer's version is in the artifact's supported set and refuse with a clear error otherwise. That turns "silent msgpack corruption" into "this image speaks vLLM 0.23, peer is 0.22, abort." Cheap, high value, independent of the matrix.

Rotation when N advances

When vLLM cuts N+1:

Add it to compat.toml with fidelity_validated = false.
Matrix builds it; run capture + replay to validate fidelity.
Flip fidelity_validated = true, move default to the new line.
Drop the now-N-4 line from the manifest.

Build order

compat.toml + handshake version guard. Small, immediately useful, makes mismatches loud. Done.
Manifest-driven CI matrix + conformance runner. The real automation payoff. Done (per-line build/unit/conformance; see conformance.md).
Compatibility shim for the protocol crate's per-line API drift. Done for the 0.22 line; see multi-version-shim.md (capability cfgs + owned/tolerant decodes, cargo xtask pin-vllm). This is what lets one main build against multiple lines without a single multi-version binary.
Nightly canary (.github/workflows/nightly-canary.yml). The nightly line is pinned and only moves when bumped; the canary instead pins to the LIVE upstream main HEAD each night (cargo xtask pin-vllm nightly --rev <sha>), builds, runs unit tests, runs the HEAD-client protocol e2e tests, runs the conformance runner, and publishes a rolling nightly prerelease with the sha in its notes. A red scheduled run is the early warning that upstream moved the engine-core protocol. Done.

Open coupling note: the `block_size` / registration drift

The python frontend's EngineCoreReadyResponse requires six fields including block_size (tokens per KV block). The upstream Rust vllm-engine-core-client::EngineCoreReadyResponse still has only five and is missing block_size — confirmed at both the pinned c9340e6 rev and upstream main as of 2026-06-13. Re-encoding through the crate's struct silently drops the field and the python frontend rejects the registration.

Our workaround is the forward-compatible path that already exists: the sim emits its own complete SimReadyResponse superset, and the tap relays the real engine's response bytes verbatim (immune to future adds). This is the kind of schema drift the matrix is meant to track; if upstream ever adds block_size to the Rust struct we can drop SimReadyResponse for that line, but until then the superset stays. Keep the sim_ready_response_carries_all_python_required_fields test as the canary.

Conformance testing

The matrix above answers "does this line still build and parse the wire." Fidelity (does the sim still reproduce the engine's behavior for a line) is answered by conformance testing: a profile-once/replay-many loop that pairs a manifest of golden captures with a GPU-free replay in CI. The capture runbook is conformance.md; this section ties the pieces together.

Three artifacts cooperate:

compat.toml — the N-3 window (above). Each line carries fidelity_validated, which gates whether its conformance failures block promotion.
conformance/manifest.toml — one [[golden]] entry per captured trace, with line, bucket_path, sha256, config_hash, workload, and role (schema or fidelity). The captures themselves are NOT in the repo; they live in a private bucket and CI fetches them by sha.
The capture runbook (conformance.md) — how a golden gets captured on the GPU cluster, uploaded, and registered.

The CI flow (.github/workflows/ci.yml):

compat-matrix parses compat.toml into a per-line build matrix.
The conformance matrix job builds + tests each line against its own protocol_rev (the --config patch from "CI matrix mechanics"), then fetches that line's goldens by sha, verifies the sha256, and replays them GPU-free, asserting the trace's config_hash (the profile-once/replay-many cache key, --expect-config-hash).
Lines with fidelity_validated = false build and run conformance, but their fidelity failures are continue-on-error (see "Rotation when N advances"): a freshly added line gets signal without blocking the merge. Flip fidelity_validated = true once the golden validates, and the leg becomes a hard gate.

The replay-many half needs no GPU and is the same mechanism as the offline replay rig (deploy/trace-capture/base/offline-replay.yaml): the python frontend talks to vllm-vcr play serving the captured trace, with no real engine behind it. CI runs it headlessly; the rig serves a live agent the same byte-identical streams.

Multi-version vLLM support: the compatibility shim

How the simulator builds against more than one vLLM line from a single main branch. Read versioning.md first for the strategy (build matrix, one image per line, compat.toml as source of truth) and conformance.md for capture/replay. This doc covers the code side: how we absorb the protocol crate's API drift.

The shape of the problem

The wire protocol comes from one git dependency, vllm-engine-core-client, which lives in the vLLM repo (rust/src/engine-core-client/). Its API drifts across releases. Cargo can hold only one rev of a git dep per build, so each line is a separate build (the matrix). The job of the shim is to let the same source compile against each line's crate, isolating the divergences in one place.

Per-line builds: pin, don't patch

Cargo rejects a [patch] that redirects a git dependency to a different rev of the same source ("patches must point to different sources"). So the per-line rev is swapped in [workspace.dependencies], not via --config patch. build.rs cannot do it either: dependency resolution happens before any build script runs.

cargo xtask pin-vllm <line> reads compat.toml and rewrites Cargo.toml: it sets the vllm-engine-core-client rev to the line's protocol_rev, and inserts, rewrites, or removes the fork [patch] to match the line's patch_repo/patch_rev (a fork is a different source, so it is allowed to [patch]). The committed Cargo.toml carries no [patch] block (the default line builds upstream), so a forked line's block is inserted, not rewritten; the script strips any existing block first, so it's idempotent. After the rewrite the rev no longer matches Cargo.lock, so per-line builds omit --locked.

Gotcha: the script changes the manifest but not the environment, and build.rs reads the line from VLLM_TARGET_VERSION (falling back to the compat.toml default). So a per-line build must set both: run the script and export VLLM_TARGET_VERSION=<tag>. The CI matrix does both; a local older-line build must too:

cargo xtask pin-vllm 0.22
VLLM_TARGET_VERSION=v0.22.1 cargo build --workspace   # no --locked

Capability cfgs

Where the crate's API genuinely diverges in a way owning a type can't hide (a field whose type differs per line), the engine gates on a discrete capability, not a version number. build.rs maps the target line to cfgs and declares them with cargo::rustc-check-cfg:

vllm_lora_typed — the crate exposes a typed protocol::lora module and EngineCoreRequest.lora_request: Option<LoraRequest>. On 0.23+. On 0.22 lora is opaque rmpv::Value.

cfgs from build.rs only reach the crate that owns the build script (the root crate) and its targets (incl. its tests/). Keep cfg-gated code in the root crate. rust-analyzer does not run build.rs, so it shows false-positive errors on the inactive cfg branch; cargo build is the truth.

What the shim owns

The principle: own a tolerant decode wherever possible (no cfg), and reach for a capability cfg only when a field's type differs per line. Owned types deserialize the same wire on every line (serde ignores unknown fields).

Concern	Divergence	Shim
Handshake harness types	`mock_engine` module absent before 0.23 (we never used its behavior, only structs)	`sim-protocol::mock_engine` owns `MockEngineSockets`/`MockEngineDataSockets`/`MockCoordinatorSockets` + `DEFAULT_MOCK_MAX_MODEL_LEN` + `default_dtype()`
Request-type frame	`EngineCoreRequestType::from_frame` is head-only	`sim-protocol::wire::request_type_from_frame` (1-byte decode)
Lora request	typed `LoraRequest` (0.23+) vs opaque rmpv (0.22)	`LoraSpec{lora_int_id,lora_name}` (own, decodes both) for the add_lora call + registry; `request_lora_name()` is the one `vllm_lora_typed`-gated fn for the `lora_request` field
Ready response	`EngineCoreReadyResponse.vllm_version` absent before 0.23	tap decodes its own tolerant `CapturedReadyInfo{vllm_version:Option<String>}`
Utility request	`EngineCoreUtilityRequest` derives `Deserialize` only on 0.23+ (crate was client-only)	`engine_core::UtilityRequestSpec` (`Deserialize_tuple`, matches the wire tuple)

The wire types still come from the crate (the matrix's whole point: catch drift at compile time). The shim only covers the spots where our decoding/server role needs something the client-oriented crate lacks on an older line.

Testing across lines

The matrix runs per line (see ci.yml):

cargo build --workspace — the "does the wire still compile" gate.
cargo test --workspace --lib — unit tests (compile + pass on every line).
cargo test --test conformance — the conformance runner (skips until goldens).

The full-stack e2e integration tests (tests/engine_core_e2e.rs, tests/tap_e2e.rs) drive the real EngineCoreClient, whose API is incomplete on older lines, so they target the default line via the build-and-test job, not each matrix leg. The lora lifecycle e2e test is #[cfg(vllm_lora_typed)] so the workspace still compiles tests on lines that have the typed client.

Current window

nightly (nightly): tracks vLLM main, protocol_rev is the latest post-merge commit (bumped regularly). No fork (main carries everything). build.rs treats the non-vX.Y tag as the newest line, so all capability cfgs are on. It exists to catch wire drift before a release lands: the live-HEAD nightly canary pins to upstream main, builds, runs unit tests, runs the HEAD-client protocol e2e suite, and runs the conformance runner. fidelity_validated = false because nightly has no golden capture set yet.
0.23 (v0.23.0, default): builds against upstream 17bc1445, unit tests + conformance green. No fork (#45848 is upstream here).
0.22 (v0.22.1): library + bins + unit tests + conformance all build; the shim absorbs the build-time drift. Builds against the 0.22 crate (0decac0d) [patch]ed to the wseaton/vllm serde-defaults fork (see below). fidelity_validated = false (no captures yet).
0.21 (v0.21.0): the Rust crate did not exist at the v0.21.0 tag, so there is no 0.21 rev to build against. It builds against the same 0.22 crate + fork as the 0.22 line; only its tag label and goldens differ. The capture is what proves the 0.21 engine's wire actually matches the 0.22 client types; if it diverges, 0.21 grows its own shim. fidelity_validated = false.

The 0.22/0.21 serde-defaults fork

EngineCoreSamplingParams has no #[serde(default)] on its omittable fields at 0decac0d (the 0.22 crate), so the tap can't decode a real Python frontend's msgspec omit_defaults request at capture time. This is a runtime decode issue, not a build one, so it's a [patch] (a different source), not a protocol_rev bump. The fix is vllm-project/vllm#45848 backported onto the 0.22 base:

branch wseaton/serde-defaults-0.22 on wseaton/vllm, rev b48f2434.
it adds the field defaults (temperature/top_p/repetition_penalty=1.0, max_tokens=16, the rest zero/None/empty) and nothing else.
both the 0.21 and 0.22 lines carry it via patch_repo/patch_rev in compat.toml; cargo xtask pin-vllm inserts the [patch] block per leg (the committed Cargo.toml has none, since the default line builds upstream).

Follow-ups

Capture a 0.22 golden and flip fidelity_validated = true once it's uploaded, registered in conformance/manifest.toml, and the replay gate passes.
Capture a 0.21 golden against the 0.22-crate frontend image to confirm the wire matches; flip fidelity_validated = true (or add a 0.21 shim if it diverges).

Conformance capture runbook

How to capture a golden trace for one vLLM line, upload it to the private golden bucket, register it in conformance/manifest.toml, and let CI flip that line to fidelity_validated = true. This is the "profile-once" half of the profile-once/replay-many model. The "replay-many" half is GPU-free and runs in CI and on the offline replay rig.

For the version-mapping strategy this runbook serves (the N-3 window, compat.toml, the build matrix, image tagging), see versioning.md. For the trace schema, see crates/sim-trace/src/trace.rs.

The golden captures live in an object-store bucket that the CI workflow fetches from by sha; the concrete bucket, account, and fetch-role identifiers live only in .github/workflows/ci.yml (CONFORMANCE_BUCKET). None of it is needed to build, test, or use the simulator: the GPU-free replay half runs from any trace you have. To run conformance against your own captures, point CONFORMANCE_BUCKET and the bucket_path keys in conformance/manifest.toml at a bucket you control.

Capture topology

The capture stack records the real engine's wire traffic without changing it. Three processes run as sidecars, the load generator drives them, and the tap writes the trace:

client loadgen HTTP workload

OpenAI /v1

frontend vLLM frontend vllm-rs serve

ZMQ :5570

tap vllm-vcr record relays frames verbatim

ZMQ :5580

engine real vLLM --headless, owns GPU

output trace.jsonl plus step-stats sidecar

The engine runs --headless and binds the GPU. The tap dials the engine's handshake, the frontend dials the tap's handshake, so the tap sits on the wire between frontend and engine and copies every frame to trace.jsonl (and, with --step-stats-out, the per-step SchedulerStats sidecar). Load comes from deploy/trace-capture/loadgen.py, driven in-pod by validation-runner.sh.

This is the same topology described in traces/README.md (the local-sim self-captures swap the real engine for the sim, no GPU). For conformance goldens we always use the real engine, because the point is to measure the engine vLLM actually ships for a line.

What pins the vLLM version

The vLLM version under capture is pinned by the engine container image digest, not by a tag. The per-line engine + tap/frontend images live in models.toml under [lines.<vllm_tag>], e.g.:

[lines."v0.23.0"]
engine_image = "docker.io/vllm/vllm-openai:v0.23.0@sha256:6d8429e3..."  # release, by digest
tap_image    = "ghcr.io/neuralmagic/vllm-vcr:vllm0.23"     # built for this line

For a conformance capture, pin the engine to the release tag's published image for the line you are validating (the tag field in compat.toml, e.g. v0.23.0), by its digest, that digest is the ground truth for "which vLLM this golden measures" (record it in the manifest entry's provenance). The nightly line instead points at the post-merge image at its protocol_rev. The tap and frontend stay on the per-line capture image (vllm-vcr record + vllm-rs), built against that line's protocol_rev so the wire parses, which is exactly what CI's docker.yml publishes as :vllm<line>.

Capture hygiene

These rules are carried from traces/README.md; they are the difference between a golden you can gate on and noise:

One workload per pod lifetime. A half-rewired pod records garbage. Each capture pod runs exactly one workload and is torn down.
Never fit and gate on the same trace. A trace used to fit the latency model cannot also be the fidelity gate; that is grading your own homework. Keep fitting captures and gate captures separate (the manifest role field: schema vs fidelity).
Never trust a single seed. 3 of ~10 multiturn captures turned out anomalous and only multi-seed averaging exposed them. Capture multiple seeds for any gate; a lone capture is an anomaly waiting to happen.
Fetch promptly. The cluster reaps idle GPU pods (~35 min) and emptyDir traces die with the pod. The loadgen container self-terminates after 2h if never fetched, so an abandoned run cannot squat on the GPU.

Stand up a capture per line

Each capture is a Kueue-admitted Job on the GPU cluster (waldorf), so Kueue holds the Job until GPU quota admits it and releases the GPU the moment the capture completes. The capture matrix lives in deploy/trace-capture/models.toml: one [[capture]] per model × scenario (baseline / nocache / specdecode / multimodal), each pinned to a line's engine + tap/frontend images. gen-capture-jobs.py turns a target into a Job; the scenario drives the engine and tap flags so the config_hash matches the engine config (prefix-cache, spec-decode). Generated conformance Jobs enable --record-tokens and --step-stats-out, so the trace is usable as a fidelity golden and the sidecar stats are available for timing inspection. To add a model or scenario, edit models.toml, no YAML by hand.

All capture Jobs target conformance-queue (base/conformance-queue.yaml), a dedicated Kueue queue with a one-GPU quota, so they run one at a time: submit as many as you like and they serialize, the rest wait pending (capture hygiene: one workload per pod lifetime, no cross-capture interference). Each pod runs engine/tap/frontend as sidecars and validation-runner.sh as the loadgen container, which runs $PHASES, marks the trace line count at each phase boundary, then idles until the trace is fetched.

The flow (wrapped by the justfile):

just conformance-queue                  # apply the one-GPU queue (once)
just conformance-list                   # see the targets in models.toml
just conformance-capture qwen3-8b       # ship loadgen scripts + submit the Job
just conformance-capture \
  nightly-qwen3-8b-mt-s7 \
  nightly-qwen3-8b-mt-s13 \
  nightly-qwen3-8b-nocache-s7           # rolling vLLM main goldens

# raw equivalent:
python3 deploy/trace-capture/gen-capture-jobs.py qwen3-8b | kubectl apply -f -

# wait for the loadgen to finish (it logs "waiting for fetch"):
kubectl logs -f job/trace-qwen3-8b -c loadgen

To capture a new line (e.g. a new release or nightly), add a [lines.<vllm_tag>] entry (engine image digest + the tap/frontend image built for that line) and point captures at it. The built-in nightly set is deliberately small: two multiturn seeds with prefix caching enabled plus one nocache multiturn capture, all against Qwen3-8B.

Fetch the trace + step stats

Fetch before the reaper window closes. The marker is the loadgen log line "waiting for fetch":

NAMESPACE=${NAMESPACE:-inference-sim}
JOB=trace-nightly-qwen3-8b-mt-s7

# The trace, and the per-step SchedulerStats sidecar if captured.
kubectl exec -n "$NAMESPACE" "job/$JOB" -c loadgen -- cat /trace/trace.jsonl > "$JOB.jsonl"
kubectl exec -n "$NAMESPACE" "job/$JOB" -c loadgen -- cat /trace/step-stats.jsonl > "$JOB.step-stats.jsonl"

# Let the Job complete and release the GPU.
kubectl exec -n "$NAMESPACE" "job/$JOB" -c loadgen -- touch /trace/fetched

Compress before upload (.jsonl.gz); the trace tooling and the sim read gzip transparently.

Compute the config hash

The config_hash is the profile-once/replay-many cache key. It fingerprints the capture config (model, GPU, TP, scheduler flags) so a trace cannot be replayed against a config it was not captured for. The tap stamps it into the trace metadata line via --config-hash, and the sim asserts it at replay via --expect-config-hash (see src/record.rs and crates/sim-trace/src/trace.rs).

The recipe is ConfigFingerprint in crates/sim-trace/src/config_hash.rs: a lowercase-hex SHA-256 over a versioned, order-fixed canonical form (scheme tag config-fingerprint-v3) of three inputs:

gpu (hardware, e.g. H200)
vllm_tag (the line tag, e.g. v0.23.0)
engine_config — a canonical, sorted digest of the deployed behavioral engine flags (model, tensor-parallel, gpu-mem-util, max-model-len, max-num-seqs, block-size, enforce-eager, prefix-caching, speculative config, ...), built by the capture driver.

The design rule is hardware + version + deployed behavioral flags, deliberately not a hand-curated field list (the per-knob trap, you'd forever be adding the next one) and not the entire resolved config: engine defaults aren't enumerated because vllm_tag pins them, and transport/addressing flags are excluded. vllm_tag is the line tag, NOT the engine's raw reported version (a dev build 0.23.0.dev1+g..., not reproducible across rebuilds); the reported version is recorded separately in the trace meta (vllm_version). engine_config is what keeps a cache-off, spec-decode, or eager-vs-graph capture from sharing a fingerprint with another deployment of the same model/hardware. If the input set ever changes, bump the scheme so old hashes deliberately stop matching. (Older goldens keep their own scheme's hashes and stay valid: the sim compares the stamped hash, never recomputes.)

Two ways to get the hash:

Run the tap with --vllm-version <tag>, --gpu <type>, and --engine-config <canonical> (plus --model/--tp/--block-size/--max-num-seqs for the readable trace meta) and let it compute the fingerprint, stamping config_hash into the trace meta. This is the default; gen-capture-jobs.py builds the --engine-config string (it owns the engine flags, the tap can't observe them on the wire), and the manifest entry just copies the hash.
Or pass --config-hash <hash> explicitly to override the computed value.

Upload + register the golden

Goldens are NOT committed to the repo: they are measurement data, some are large, and token-recording captures carry model content. They live under the conformance/ prefix of the bucket CI fetches by sha ($CONFORMANCE_BUCKET, set in .github/workflows/ci.yml). The GHA runner assumes a least-privilege fetch role over GitHub OIDC, scoped to s3:GetObject on conformance/* only.

# 1. Upload to the conformance/ prefix (use credentials with write access; the CI
#    role is read-only). $CONFORMANCE_BUCKET is the bucket defined in ci.yml.
# Path convention: conformance/<vllm_tag>/<gpu>/<model>/<workload>[-<seed>].jsonl.gz
# where <vllm_tag> is the release tag (v0.23.0) or "nightly" (tracks main). These
# mirror config_hash inputs (vllm_tag/gpu/model), so captures across builds, hardware,
# and models don't collide.
aws s3 cp trace.jsonl.gz \
  "$CONFORMANCE_BUCKET/conformance/<vllm_tag>/<gpu>/<model>/<workload>-<seed>.jsonl.gz"

# 2. Record the sha256 (the conformance runner verifies it after fetch).
sha256sum trace.jsonl.gz

Add a [[golden]] entry to conformance/manifest.toml:

[[golden]]
line        = "0.23"                                  # matches a compat.toml line
bucket_path = "conformance/v0.23.0/H200/Qwen/Qwen3-8B/multiturn-seed7.jsonl.gz" # key under $CONFORMANCE_BUCKET (or a full s3:// URI)
sha256      = "<sha256 of the uploaded .gz>"          # runner verifies this after fetch
config_hash = "<the trace's config_hash>"             # replay asserts --expect-config-hash
workload    = "multiturn"                              # human-readable workload label
role        = "fidelity"                              # "schema" or "fidelity"

Nightly entries use line = "nightly" and live under the conformance/nightly/... prefix, for example conformance/nightly/H200/Qwen/Qwen3-8B/multiturn-seed7.jsonl.gz. Once any nightly entry lands, the nightly canary detects it and fetches/replays it before refreshing the rolling prerelease.

The Nightly Golden Capture workflow automates this operator loop. It runs on a nightly schedule and can also be manually dispatched with an alternate target list. It authenticates to the conformance cluster with GitHub OIDC, submits the configured Kueue capture targets, uploads the token-recording traces, and opens or updates a PR with the generated nightly manifest block. It expects these repository variables:

CONFORMANCE_CLUSTER_URL — Kubernetes API server URL for the capture cluster.
CONFORMANCE_CAPTURE_ROLE_ARN — AWS role allowed to write conformance/* objects.

A line gains has_goldens = true in the CI matrix automatically once it has a [[golden]] entry; that is what turns on the AWS fetch leg for it (lines without goldens skip credentials entirely).

role = "schema" captures gate the wire protocol parsing (cheap, never a fidelity gate); role = "fidelity" captures gate replay accuracy. Per the hygiene rules, fitting captures and fidelity-gate captures must be different traces, and a fidelity gate should reference multiple seeds.

Flip the line to validated

A new line lands in compat.toml with fidelity_validated = false. The matrix builds it and runs conformance, but a fidelity failure does not block promotion (continue-on-error in the non-gating conformance step, see .github/workflows/ci.yml).

Once the golden(s) are uploaded, registered, and the replay gates green for that line:

Flip fidelity_validated = true for the line in compat.toml.
On the next run the conformance leg for that line becomes a hard gate.
When the line becomes the head, move default = true to it (:latest follows the default line) and drop the now-N-4 line per versioning.md.

The conformance runner

The runner is the tests/conformance.rs integration test. It is manifest-driven and runs entirely on CPU, so the CI matrix invokes it per line after building against that line's protocol_rev:

CONFORMANCE_BUCKET=s3://your-bucket cargo test --test conformance -- --nocapture

For the line this build targets (VLLM_TARGET_VERSION, stamped from compat.toml), it reads conformance/manifest.toml, and for each [[golden]] on that line fetches the golden straight from S3 via sim-s3 (a full s3:// bucket_path as-is, else the key joined under $CONFORMANCE_BUCKET). It then asserts:

integrity: the fetched bytes hash to the manifest sha256.
line: the golden's recorded vllm_version is on the same major.minor line as the build (assert_same_line).
provenance: the trace's config_hash equals the manifest entry's config_hash.
schema (role = "schema"): the sim's SimReadyResponse carries every field the captured engine emitted (assert_ready_response_schema, decoded from the meta's ready_response_hex). This is the automated, per-line generalization of the block_size canary: a new line that grows a registration field fails here.
fidelity (role = "fidelity"): boots the sim on the golden under the config_hash gate and asserts every recorded token stream replays byte-identically.

It skips cleanly (passes without asserting) when there are no goldens for the line, which is the normal state until captures exist. Set $CONFORMANCE_MANIFEST to point at an alternate manifest. The pure assertions live in src/conformance.rs and are unit-tested independently of any real capture.

The nightly canary is therefore a protocol-drift smoke until nightly captures exist, not a true fidelity gate: after pinning to live vLLM main, it runs the manifest runner for schema/provenance plumbing and also runs the HEAD-client protocol e2e tests (engine_core_e2e and tap_e2e) so a client-wire break still turns the run red. When a line = "nightly" golden is added, the canary detects it, assumes the same golden-fetch role as CI, and replays it before publishing the rolling prerelease.

The GPU-free replay half

The replay-many half needs no GPU anywhere. deploy/trace-capture/base/offline-replay.yaml runs the same python frontend with vllm-vcr play (not a real engine) in the engine slot, serving a captured trace with content-keyed matching:

client agent port-forward :8000

HTTP

frontend vLLM frontend no local engine rank

ZMQ

sim vllm-vcr play prefix-matched replay

reads

input captured trace JSONL or gzip

The frontend MUST run the same model/tokenizer as the capture (prefix matching is on token ids), and stays on the protocol-pin image for the line. This is the same mechanism CI's conformance step uses headlessly: fetch the golden by sha, then replay it against the sim built for that line and assert --expect-config-hash.

Building the capture image on waldorf

The tap + frontend capture image is linux/amd64; cross-building it on Apple Silicon under QEMU is unreliable (rustc SIGSEGVs under emulation). Build natively:

just image-build && just image-push builds and pushes the linux/amd64 image for the compat.toml default line (slow under emulation; run on an amd64 host).
just image-build-line <line> builds the image for an older line, e.g. just image-build-line 0.22: it pins Cargo.toml to that line's rev/fork with cargo xtask pin-vllm, stamps VLLM_TARGET_VERSION, and builds the vllm-rs frontend from the same source as the tap. (It leaves Cargo.toml/Cargo.lock rewritten; restore those files after the build if you do not want to keep the local pin.)
On Apple Silicon, use the build-on-waldorf flow to build natively on the cluster with an unprivileged kaniko pod instead of cross-building locally.

CI publishes these per line automatically (.github/workflows/docker.yml): the same pin step, VLLM_TARGET_VERSION, and frontend-source wiring run per matrix leg, so the floating vllm<line> image is built against that line's wire. Build locally only when you need an image ahead of a CI publish.

Closed-loop agentic replay: SWE-bench offline

Goal: capture one real SWE-bench agent rollout against a live vLLM, then re-run the same agent fully offline against the sim and get the same patches, the same eval results, zero GPU, zero API spend.

The repo side is done: --replay-tokens <trace> --replay-match prefix turns the sim into a content-keyed cassette player. See Content-identical replay for the replay mode; this page is the runbook for the SWE-bench demo around it.

Why this works

SWE-bench splits into rollout (the agent loop, the only LLM-dependent phase) and evaluation (swebench.harness.run_evaluation, which applies patches in per-instance Docker images and runs tests, no model calls). So only the rollout needs capture/replay.

The agent's message history is append-only: turn N+1's prompt = turn N's prompt + turn N's response + new tool output. Replay serves turn N's response byte-identically, so a deterministic agent reconstructs turn N+1's prompt exactly as captured, and the loop closes. Prefix matching keys on the chained block hashes the tap already records; tail noise in tool output (timing strings like pytest's in 0.42s) only shortens the match depth, it cannot change which record wins.

Agent harness choice

What decides replayability: volatile content must not appear early in the prompt (chained block hashes make an early difference fatal), and history must be append-only (compaction rewrites the prefix mid-session). One hazard is universal: a harness that stamps today's date into the system prompt breaks matching when the replay happens on a different day; pin the clock in the rollout container with libfaketime to the capture date.

mini-swe-agent (mini-extra swebench), the v1 and determinism control: strictly linear append-only history, deterministic truncation (pure char-count slicing at 10k chars), no timestamps/hostnames in the default swebench templates, talks OpenAI chat completions via LiteLLM so it works against the vllm-rs frontend as-is, and writes preds.json directly consumable by the official eval harness.
pi, the "real agent" driver: native tool calling, append-only history, no compaction, and its only volatile prompt content is a day-precision date + cwd (cwd is constant inside a SWE-bench container, the date is faketime's job). Point models.yml at the frontend (api: openai-completions, baseUrl), set temperature: 0, run headless per instance, git diff for the patch (model the glue on badlogic/pi-terminal-bench).
Claude Code, the headline demo: set CLAUDE_CODE_ATTRIBUTION_HEADER=0 to disable the attestation line it otherwise prepends to the system prompt (x-anthropic-billing-header: ..., per-conversation hash; vLLM > 0.17.1 also strips it server-side). Its gitStatus snapshot is reproducible when replay starts from the same SWE-bench image; the date needs faketime. The remaining work is plumbing: it speaks the Anthropic /v1/messages API, so capture needs either the native vLLM frontend (>= 0.11.2) or a LiteLLM translation proxy in front of the Rust one.

Avoid opencode for this: it puts a live git-status block early in the system prompt (changes as the agent edits files, poisoning the prefix on most steps) and auto-compacts history by default. Codex CLI can't target vLLM at all (Responses-API-only wire protocol).

Capture (one GPU run)

Standard tap sidecar setup (manifests in deploy/trace-capture/), with token recording on and the python vLLM frontend, which serves /v1/messages for Claude Code (vllm-rs doesn't yet; the protocol-pin image c9340e6f3 already ships the anthropic entrypoint including the billing-header strip):

agent SWE-bench rollout mini-swe-agent or Claude Code

HTTP

frontend vLLM Python data-parallel-size-local 0

ZMQ :5570

tap vllm-vcr record --record-tokens

ZMQ :5580

engine vLLM engine GPU-backed capture

output token trace content-sensitive JSONL

just agentic-capture-up deploys deploy/trace-capture/base/h200-capture-agentic.yaml (tap has --record-tokens on; the trace carries user content, token ids decode back to text, treat it accordingly). First deploy: verify the API-only frontend boots without a GPU; vLLM's platform probe has historically wanted CUDA, and the fallback is a CPU-target build of the same commit.
Point mini-swe-agent at the frontend and run a few instances:
```
mini-extra swebench --subset verified --split test --slice 0:5 \
  --model hosted_vllm/Qwen/Qwen3-8B --workers 1
# model base URL via the usual LiteLLM env (HOSTED_VLLM_API_BASE=http://<frontend>:8000/v1)
```
--workers 1 for the first capture: concurrent instances interleave fine (prompts don't share prefixes across instances past the system prompt), but single-stream makes the trace easy to eyeball.
Keep the agent's sampling deterministic-ish (temperature: 0 in the model kwargs). Not strictly required for replay (the recorded tokens are served regardless of what sampling the replayed client asks for), but it makes capture-vs-replay diffs meaningful.
Save preds.json from the capture run; it's the ground truth the replay must reproduce.

Replay (no GPU)

On the cluster: just replay-up deploys deploy/trace-capture/base/offline-replay.yaml (the same python frontend in front of vllm-vcr play --replay-match prefix, zero GPU), then just replay-load-trace <trace> copies the capture in; the sim starts as soon as the file appears. Or run the same pair locally:

vllm-vcr play --handshake-address tcp://127.0.0.1:5570 \
  --replay-tokens swebench-capture.jsonl.gz --replay-match prefix \
  --latency-trace swebench-capture.jsonl.gz   # optional: replay timing too

Same tokenizer is load-bearing: matching happens on token ids, so the frontend must run the same model name and tokenize identical text to identical ids (keep both rigs on the protocol-pin image).

WARNING: keep --latency-trace on when replaying tool-calling models through the python frontend. Unpaced replay delivers entire responses faster than the frontend's streaming consumer drains them; the per-request RequestOutputCollector then merges deltas, and vLLM's qwen3_coder streaming tool parser corrupts (>= ~8 tokens/delta) or silently drops (whole response in one delta) tool calls, which kills the agent loop after one turn. Paced replay reproduces the captured ~1 token/delta layout and sidesteps the bug (upstream: vllm#45256; local repros in demo/repro-qwen3coder-burst.py and demo/bench-parser-quadratic.py; the parser's full-text rescans are also O(n^2) in response length). The Rust frontend's tool parsers are delta-layout invariant and don't need this.

Then re-run the exact same agent command against it and compare: preds.json (replay) == preds.json (capture), then run the official eval on the replayed predictions:

python -m swebench.harness.run_evaluation \
  --predictions_path preds.json --run_id offline-replay ...

Watch the sim logs: every "no trace record shares a prompt prefix" or "prompt shorter than one block" warning is a request that fell back to random tokens, i.e. a divergence to explain (usually nondeterministic tool output deep enough in the prompt to flip an entire turn, or a tokenizer mismatch).

Known hazards

Tool-output nondeterminism inside the SWE-bench containers (timestamps, partial output on command timeouts, unsorted find/ls). Tail noise is absorbed by longest-prefix matching; noise early in a turn's tool output shifts every later block and degrades the match to the turns before it.
Aborted requests are dropped by the tap, so a client-side timeout during capture leaves a hole in the trace and the replayed agent gets random tokens for that turn.
Concurrent identical prompts (agent retries) consume duplicate records in arrival order; once records run dry, retries re-serve the best match.

Dependencies of note

The dependency graph is intentionally split so offline trace tooling stays light while protocol-facing binaries remain pinned to the vLLM line they speak.

vllm-engine-core-client is a pinned git dependency on vllm-project/vllm (rev in Cargo.toml). The supported-line matrix rewrites this rev from compat.toml; do not let routine dependency tooling bump it.
nixl-sys is a pinned git dependency on ai-dynamo/nixl, matching the source used when the image builds libnixl. The default build does not compile it; enable nixl for real transfers or nixl-stub for typechecking without libnixl.
sim-trace is deliberately vLLM-protocol-free. It owns the trace schema, calibration models, Perfetto conversion, and guidellm conversion so analysis tools can build without compiling the protocol stack.

Keyboard shortcuts

vllm-vcr