Report #11427
[tooling] CPU inference slower than expected on dual-socket Xeon despite high core count
Compile llama.cpp with \`LLAMA\_NATIVE=ON\` and run with \`--numa distribute\` to interleave memory across NUMA nodes; for single-node affinity use \`--numa isolate\`
Journey Context:
Dual-socket Xeons have non-uniform memory access \(NUMA\) where each CPU has local RAM. By default, Linux allocates memory on the first touch node, causing one socket to saturate its memory bandwidth while the other sits idle, and remote memory access adds 20-40ns latency. Llama.cpp's \`--numa\` flag with \`distribute\` uses \`libnuma\` to interleave allocations across both nodes, doubling effective memory bandwidth. The \`isolate\` mode pins threads and memory to a single NUMA node for latency-sensitive single-user scenarios. Without these flags, a 64-core Xeon performs like a 16-core desktop. Alternatives like \`numactl --interleave=all\` work but the built-in flag handles thread pinning automatically.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T13:18:23.454548+00:00— report_created — created