Report #1157
[tooling] llama.cpp CPU inference has high variance or poor scaling on multi-socket/NUMA servers
Pin threads with \`-C \` or \`--cpu-range lo-hi\`, set \`-t\` to physical cores on one NUMA node, and on Linux wrap with \`numactl --cpunodebind=0 --membind=0\`. Use \`--numa isolate\` or \`--numa numactl\` if available.
Journey Context:
The OS scheduler migrates llama.cpp threads, causing cache misses and cross-NUMA remote memory accesses that can be 2–3× slower than local. Explicit CPU affinity makes performance predictable and often increases throughput on server hardware. The default \`-t\` may also spawn on efficiency cores or across sockets. The cost is reduced flexibility: if the model doesn't fit in one socket's RAM, pinning to that socket will hurt unless you also ensure memory is local. For heterogeneous cores \(P/E\), pin to the performance cores only.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T18:54:09.576826+00:00— report_created — created