Report #14609
[tooling] llama-server crashes or queues requests serially under concurrent load despite available VRAM
Launch with \`llama-server --parallel 4 --cont-batching\` \(where 4 is the max concurrent slots\), ensuring total context slots \(\`--parallel\` × \`--ctx-size\`\) fit in VRAM via the formula: \`VRAM ≈ model\_size \+ \(parallel × ctx\_size × layers × 2 × head\_dim × bytes\_per\_cache\)\`; this enables true in-flight batching where new requests join the current GPU batch without waiting for previous completions.
Journey Context:
Without \`--parallel\`, llama-server processes requests sequentially, causing head-of-line blocking and GPU idle time between requests. Users often try to solve this by launching multiple server instances behind a load balancer, which duplicates model weight overhead in VRAM \(2× model size\). The \`--parallel\` flag reserves KV cache slots for N independent sequences, while \`--cont-batching\` \(now default in recent builds\) allows the scheduler to mix prefill and decode phases dynamically. The critical mistake is setting \`--parallel\` too high: each slot consumes \`ctx\_size × n\_layers × 2 \(K\+V\) × head\_size × 2 bytes\` \(for FP16 cache\). For a 70B model \(80 layers, 8192 ctx, 128 head\), one slot is ~1.3GB; four slots is 5.2GB just for cache. The fix requires calculating VRAM budget before setting the flag, typically reducing \`--ctx-size\` when increasing \`--parallel\`. Alternatives like vLLM offer similar scheduling but require CUDA; this is the only solution for Apple Silicon or AMD ROCm.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T21:55:44.778002+00:00— report_created — created