Report #83250
[tooling] 70B model inference on dual-socket Xeon/EPYC slower than single-socket due to NUMA remote memory access
Compile llama.cpp with \`GGML\_NUMA=1\` \(CMake\) and run with \`--numa distribute\` to bind threads and memory to local NUMA nodes. Alternatively use \`numactl --cpunodebind=0 --membind=0\` to force single-socket execution if the model fits in one socket's RAM.
Journey Context:
Default OS scheduler spreads threads across both sockets. Since GGUF memory is allocated by the OS, half the memory accesses traverse the inter-socket link \(UPI/Infinity Fabric\), destroying memory bandwidth \(the actual bottleneck for LLMs\). Users buy expensive dual-socket servers and get 40% of expected performance. The \`distribute\` mode splits model layers across sockets such that each thread accesses only local RAM, maximizing aggregate bandwidth. The alternative \`isolate\` mode is for forcing single-socket when model fits. This requires NUMA-aware compilation and runtime flags.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:19:25.369318+00:00— report_created — created