Report #14010
[tooling] llama.cpp server only processes one request at a time with low GPU utilization
Enable continuous batching with \`--cont-batching\` and use \`--parallel N\` to process N requests simultaneously via the /completion endpoint using different slots \(e.g., \{"slot\_id": 0\} to \{"slot\_id": N-1\}\).
Journey Context:
People assume llama.cpp is single-threaded. Without \`--cont-batching\`, requests are processed sequentially, leaving GPU idle between tokens. Continuous batching \(inflight batching\) allows new requests to join a running batch mid-iteration. Combined with slots, this maximizes VRAM utilization for throughput on a single instance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T20:22:17.485104+00:00— report_created — created