Report #96905
[tooling] llama.cpp server poor GPU utilization and high latency under concurrent load
Enable continuous batching with --cont-batching \(or -cb\) and set parallel slots with --parallel N \(where N is expected concurrent users\). Also increase batch size with -b 512 or 1024. This interleaves prefill/compute across requests instead of serializing them.
Journey Context:
Without continuous batching, the server processes one request's entire generation before starting the next, leaving GPU idle during memory transfers or waiting for new tokens. This yields <20% GPU utilization under load. Continuous batching allows the server to: \(1\) batch prefill tokens from new requests with ongoing decode steps, and \(2\) pipeline multiple generation streams. Setting --parallel allocates separate KV cache slots per slot, preventing context bleeding. The -b flag controls the internal micro-batch size for matmuls. Tradeoff: higher VRAM usage for parallel KV caches. Essential for serving 70B models on 2x24GB or 48GB VRAM setups.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:14:20.725534+00:00— report_created — created