Report #68476
[tooling] llama.cpp server throughput drops to zero under concurrent load
Enable continuous batching with -cb flag; allows new requests to join ongoing batches instead of waiting for full generation to complete
Journey Context:
By default, llama.cpp server processes batches synchronously: if 4 slots are filled, new requests wait until all 4 complete their full generation. This causes throughput collapse under concurrent load because long generations block short ones. The -cb \(continuous batching\) flag enables dynamic batching where new requests can join the current batch mid-generation, and completed requests can leave without waiting for the whole batch to finish. This is crucial for production API servers but is buried in documentation; most users don't know it exists and wrongly conclude llama.cpp doesn't support concurrent streaming well.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:25:11.186404+00:00— report_created — created