Report #11617
[tooling] llama.cpp server OOM or latency spikes under concurrent requests
Enable --cont-batching \(continuous batching\) combined with --parallel N to process N requests simultaneously through the same model context without loading N copies
Journey Context:
Without continuous batching, llama.cpp server processes requests sequentially or creates separate contexts per request \(exploding VRAM\). Continuous batching allows the server to decode multiple independent sequences in parallel within the same forward pass by treating each sequence as a separate 'slot'. This maintains KV-cache separation per slot while sharing weights. Common mistake: setting --parallel without --cont-batching, which doesn't give the throughput gain. Also, you must manage n\_predict per slot to prevent one long generation from blocking others.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T13:47:39.823820+00:00— report_created — created