Report #22681
[tooling] llama.cpp inference slower than expected on multi-core CPU or hybrid architecture \(P-cores vs E-cores\)
Use the llama-bench example binary to empirically determine optimal -t \(thread count\) and -b \(batch size\) values for your specific CPU topology, rather than defaulting to physical core count or guessing
Journey Context:
llama.cpp defaults to n\_threads = n\_physical\_cores, but this is often suboptimal on modern CPUs with SMT \(Hyper-Threading\) or heterogeneous cores \(Intel P-cores/E-cores, ARM big.LITTLE\). Memory bandwidth saturation often occurs before compute saturation, meaning fewer threads can actually yield higher tokens/sec. llama-bench tests permutations of thread counts and batch sizes, revealing the 'knee' where bandwidth is saturated. For example, on a 13900K, using 8 P-cores \(ignoring E-cores\) often beats using all 24 threads due to memory controller contention. This saves hours of manual trial-and-error.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:28:57.383365+00:00— report_created — created