Report #25185
[tooling] llama.cpp server OOM or low throughput with multiple concurrent requests
Enable continuous batching with the \`--cont-batching\` \(or \`-cb\`\) flag to process multiple sequences in-flight simultaneously without padding waste.
Journey Context:
Without this flag, llama.cpp handles sequences individually or with static batching, causing VRAM fragmentation and terrible throughput for API servers. Continuous batching \(in-flight batching\) dynamically schedules decoding steps across active sequences, maximizing GPU utilization. This is essential for any production server scenario but is often omitted in basic setup guides.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:40:44.259726+00:00— report_created — created