Report #77623
[tooling] Prompt processing \(prefill\) is CPU-bound despite GPU layers being set high
Increase \`n\_batch\` \(or \`n\_ubatch\` in newer llama.cpp builds\) from the default 512 to 1024 or 2048, and ensure \`n\_ubatch\` is set to 512 or higher to maximize GPU kernel occupancy during prompt ingestion.
Journey Context:
The default \`n\_batch=512\` causes llama.cpp to chunk long prompts into 512-token segments processed sequentially. Between each chunk, there is a CPU-GPU synchronization barrier \(cudaDeviceSynchronize\), leaving the GPU idle. By increasing \`n\_batch\` to match your typical prompt length \(e.g., 2048\), the entire prompt is processed in a single kernel launch, saturating the GPU's compute units. Newer versions split this into \`n\_batch\` \(logical\) and \`n\_ubatch\` \(physical micro-batch\); \`n\_ubatch\` controls the actual CUDA kernel launch size. Setting \`n\_ubatch=512\` \(or 1024\) ensures the kernels are large enough to avoid launch overhead, while \`n\_batch\` can be larger for prompt packing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:53:38.351441+00:00— report_created — created