Replay pacing

Content replay (--replay-tokens) and timing are independent. Pick the content mode first, then choose how quickly and in what shape the engine should emit chunks.

Mode	Invocation
Timing-modeled	`--replay-tokens trace.gz --latency-trace trace.gz` plus scheduler args matching the capture (`--max-num-seqs`, `--max-num-batched-tokens`, ...): gaps and burst sizes sampled from a model fitted to the trace
Timing-verbatim	`--replay-tokens trace.gz --replay-steps trace.gz`: each request replays its recorded per-chunk sizes and gaps
As fast as possible	`--replay-tokens trace.gz` and nothing else: all timing knobs default to 0, the instant model
Compressed but shaped	`--replay-tokens trace.gz --latency-trace trace.gz --time-scale 100`: same interleavings and relative ordering, 100x faster wall clock
Synthetic timing	`--replay-tokens trace.gz --time-to-first-token 50 --inter-token-latency 10`

For the fast path, scheduler limits still apply at zero delay. --max-num-seqs and the token budget control queueing and backpressure; increase them for pass-through replay. --output-token-chunk-size controls output framing.

Keyboard shortcuts

vllm-vcr

Replay pacing