Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

LoRA simulation

The simulator tracks LoRA adapters loaded by the frontend through the engine utility path (add_lora / remove_lora) and uses that registry while scheduling requests.

--max-loras mirrors vLLM's running-batch diversity cap: it limits how many distinct adapters may be resident in the running batch at once. 0 disables the cap, but LoRA accounting still runs so frontend metrics can report running and waiting adapters.

In the container image, set MOCK_MAX_LORAS; the entrypoint maps it to vllm-vcr play --max-loras.

The vllm:lora_requests_info metric is frontend-derived on current supported vLLM lines. The simulator's job is to keep scheduler stats and adapter state consistent enough for that frontend metric to reflect the request mix.