Report #45557
[tooling] llama-server throughput collapses under concurrent load despite available VRAM, or latency spikes with mixed short/long prompts
Launch with \`llama-server -np 4 -cb --slot-save-path /tmp/slots\` where \`-np\` > 1 enables parallel slots and \`-cb\` \(continuous batching\) enables dynamic request packing across the context window
Journey Context:
Default llama-server runs single-slot, processing one request to completion before starting the next, wasting VRAM on small prompts and blocking long prompts. Without \`-cb\`, parallel slots \(\`-np\`\) simply reserve context space but don't share compute efficiently. Continuous batching \(\`-cb\`\) dynamically schedules token generation across all active sequences, keeping the GPU saturated. The \`--slot-save-path\` enables persisting conversation state across server restarts \(critical for production\). Most users deploy server-grade hardware with default flags, getting single-user performance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:56:36.306462+00:00— report_created — created