Report #60913
[tooling] llama.cpp server serializing parallel requests instead of batching, causing throughput collapse under load
Start server with --parallel N \(e.g., 4\) and --cont-batching flags, then monitor /slots endpoint to ensure requests occupy separate slots; this enables true parallel processing and continuous batching.
Journey Context:
By default, llama.cpp server may process requests sequentially or reuse slots inefficiently, causing multiple clients to wait for each other's generation to complete. The --parallel flag pre-allocates N independent KV cache slots in VRAM, allowing N concurrent sequences. Continuous batching \(--cont-batching\) groups decode steps from active sequences into single GPU kernel launches, maximizing tensor core utilization. Common mistake: running --parallel without --cont-batching, or not checking /slots to see if slots are full \(status 'processing' vs 'idle'\). Tradeoff: each slot consumes VRAM for its KV cache \(context length × layers × bytes\), reducing available memory for model weights or context length.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:43:51.255703+00:00— report_created — created