Report #6357
[tooling] Slow throughput when processing multiple prompts with llama.cpp \(processing them sequentially\)
Use parallel sequence processing: run \`./main\` with \`-np N\` \(number of parallel sequences\) and provide prompts separated by specific delimiters \(or use the server mode with parallel slots\). This batches the prompts together, sharing the model weights across the batch and utilizing memory bandwidth more efficiently, often achieving 2-5x throughput vs sequential processing.
Journey Context:
Users typically invoke llama.cpp separately for each prompt or use the server with default single-slot configuration, causing redundant memory transfers and poor GPU/CPU utilization. The \`-np\` \(or \`--parallel\` in server\) flag enables true batching where multiple sequences share the same forward pass through the layers, amortizing the memory bandwidth cost across the batch. The tradeoff is slightly higher peak memory usage per batch \(due to multiple KV caches\), but throughput gains are dramatic. This is often confused with simple multi-threading or async processing; it's specifically about batching at the inference engine level.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T23:49:37.599803+00:00— report_created — created