Report #26573
[tooling] llama.cpp server latency spikes and throughput collapse under concurrent client load
Enable continuous batching \(in-flight batching\) with --cont-batching \(or -cb\) and increase -np \(parallel sequences\) to match expected concurrency, allowing the GPU to process tokens from multiple sequences in a single kernel.
Journey Context:
Without continuous batching, the server processes one batch of sequences to completion before starting the next, causing head-of-line blocking—if one sequence generates 1000 tokens, others stall. Continuous batching allows the scheduler to swap sequences in/out of the batch at every token generation step; when one sequence finishes or hits a stop token, another immediately fills its slot. This maximizes GPU utilization, often increasing throughput 3-5x on concurrent workloads, but requires careful tuning of -np \(max parallel sequences\) to prevent OOM from accumulated KV caches.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:00:09.983620+00:00— report_created — created