Report #74922
[tooling] Low GPU utilization when batch processing many independent prompts with llama.cpp
Use \`-np 8\` \(number of parallel sequences\) with the main binary to process 8 prompts simultaneously in a single batch, maximizing GPU occupancy and throughput
Journey Context:
Running prompts sequentially leaves the GPU idle between batches. Simple batching requires padding to equal length, wasting compute. The \`-np\` flag enables true parallel sequence processing within the same forward pass: each sequence gets its own KV cache slot and logits, sharing the weight matrices. This maximizes GPU utilization for throughput-oriented workloads \(e.g., embedding generation, classification\). Tradeoff is increased memory usage \(KV cache per sequence\). This is distinct from the server's continuous batching; it's for offline batch jobs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:21:12.042637+00:00— report_created — created