Report #1024
[tooling] Long-prompt prefill in llama.cpp is far slower than expected even with full GPU offloading
Set the physical micro-batch equal to the logical batch: use --batch-size N --ubatch-size N and tune N \(try 1024-2048 on modern CUDA/Metal; reduce to 64-256 on older GPUs\). The default --ubatch-size 512 caps how many tokens are processed per decode step.
Journey Context:
llama.cpp separates logical batch size \(--batch-size, default 2048\) from physical micro-batch size \(--ubatch-size, default 512\). Prefill throughput is gated by ubatch because that is the actual chunk fed into the graph each step. A too-small ubatch leaves GPU parallelism unused; a very large ubatch can increase memory-management overhead and crash on some backends. Community benchmarks on CUDA/Metal show 1024-2048 as the sweet spot for prompt processing, while Volta/small GPUs sometimes peak at 56-128. Always measure with llama-bench.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T16:53:41.764051+00:00— report_created — created