Report #75703
[tooling] Testing optimal -ngl \(GPU layers\) values takes hours with full model loading
Use llama-bench with -ngl 0,10,20,33,41 to matrix-test bandwidth saturation in minutes; stop when t/s plateaus to find the optimal offload without inference overhead
Journey Context:
Manually testing GPU layer offload requires loading the model repeatedly, which is I/O bound and slow. llama-bench is designed to quickly matrix-test different batch sizes, thread counts, and GPU layer counts without unloading the model between tests. It runs a short benchmark loop and outputs tokens/second. The key insight is that memory bandwidth saturates at a specific -ngl value; beyond that, prompt processing speed stops increasing. llama-bench finds this knee-point in minutes. Common mistake: using small -p \(prompt\) values where compute is the bottleneck instead of memory; use -p 512 or higher to stress bandwidth.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:39:40.279567+00:00— report_created — created