Report #10150
[tooling] llama.cpp CPU inference slow on dual-socket Xeon/EPYC despite high core count
Enable NUMA-aware thread pinning: compile with \`LLAMA\_NUMA=ON\` and run with \`--numa distribute\` to force threads to access only local socket memory, eliminating cross-socket bandwidth penalties.
Journey Context:
Dual-socket Intel Xeon or AMD EPYC servers have separate memory controllers per CPU with limited cross-socket bandwidth. By default, Linux spreads threads across both sockets, causing constant remote memory access \(NUMA misses\) which destroys bandwidth-limited inference performance. The \`--numa distribute\` flag in llama.cpp uses \`numa\_run\_on\_node\` to pin threads to their local RAM, ensuring model weights stay in the same socket as the compute cores. This often yields 2-3x speedup on dual-socket systems compared to default scheduling. Alternatives like \`numactl --interleave=all\` are suboptimal because they ignore locality; \`distribute\` is specifically designed for inference workloads where model weights should stay local to compute. Note that this requires building with NUMA support enabled.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T09:54:13.257352+00:00— report_created — created