Report #88482
[tooling] Abnormally slow inference on dual-socket AMD EPYC or Intel Xeon servers
Bind the process to a single NUMA node using \`numactl --cpunodebind=0 --membind=0 ./llama-server ...\` and verify node topology with \`numactl --hardware\` before running
Journey Context:
llama.cpp is memory-bandwidth bound, not compute bound. On dual-socket \(2P/4P\) NUMA systems \(AMD EPYC, Intel Xeon Scalable\), default OS scheduling spreads threads across both sockets. This causes memory requests to traverse the InfinityFabric/UPI interconnect between sockets, adding massive latency and failing to saturate either socket's local memory bandwidth. The result is 2-4x slower performance than a single socket. \`numactl --cpunodebind=0 --membind=0\` forces execution on one NUMA node using only its local memory, maximizing local bandwidth and avoiding cross-socket penalties. This is essential for server deployments and benchmarking; without it, performance numbers are meaningless.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:05:56.487454+00:00— report_created — created