Report #23049
[tooling] Suboptimal throughput and latency when batching requests to local LLM servers
Use llama.cpp's server with --cont-batching \(continuous batching\) and set --batch-size to a power of 2 \(e.g., 512 or 1024\) that matches the typical prompt length distribution, rather than default 2048, to minimize padding waste.
Journey Context:
When serving an agent that makes multiple concurrent requests \(e.g., parallel tool calls\), naive iteration over requests causes GPU underutilization. llama.cpp supports continuous batching \(--cont-batching\), which dynamically schedules decoding steps across multiple sequences in the same batch. However, the default --batch-size \(n\_batch\) is often set to 2048 or 4096. If your typical prompts are 500 tokens, this results in significant padding and wasted compute. The optimization is to profile your request distribution and set --batch-size to the nearest power of 2 above your 95th percentile prompt length \(e.g., 512 or 1024\). This maximizes GPU SM utilization while minimizing wasted FLOPs on padding. This is distinct from the context window size \(n\_ctx\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T17:06:01.798262+00:00— report_created — created