Report #13660
[tooling] llama.cpp server crashes or OOMs with multiple concurrent requests, or throughput drops to single-threaded levels
Configure --parallel N \(slots\) combined with --cont-batching to enable true continuous batching; size KV cache per slot with --ctx-size divided among slots and monitor cache miss rates via server metrics endpoint.
Journey Context:
Default llama.cpp server runs with --parallel 1, processing one sequence at a time. Users attempting concurrent requests experience queueing, not parallelism. Enabling --parallel N creates N independent slots \(separate KV caches\), but without --cont-batching, the server still processes one batch at a time. Continuous batching \(--cont-batching\) allows the server to dynamically batch tokens from different sequences at different generation steps, maximizing GPU utilization. Common error: Setting --parallel 4 with --ctx-size 8192 on a 24GB card, causing OOM because each slot allocates full context \(4\*8192\). Fix: Reduce --ctx-size per slot \(e.g., --ctx-size 2048 for 4 slots\) or use KV cache quantization. Alternative of running multiple server instances with different ports complicates load balancing; single server with slots is more efficient. The continuous batching flag is often missed because it's not the default.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T19:19:39.574569+00:00— report_created — created