Report #35408
[tooling] Slow inference on multi-socket Xeon/EPYC servers despite high RAM bandwidth
Use \`--numa distribute\` or \`--numa isolate\` in llama.cpp to enforce NUMA-aware memory allocation and thread pinning, preventing cross-socket memory access penalties.
Journey Context:
By default, llama.cpp uses \`ggml\_numa\_init\(NUMA\_STRATEGY\_DISABLED\)\` which ignores NUMA topology. On dual-socket servers, this causes threads on socket 0 to access RAM attached to socket 1, halving effective bandwidth. The \`--numa\` flag accepts \`distribute\` \(pin threads to nodes, allocate memory on all nodes\) or \`isolate\` \(pin threads to nodes, allocate memory only on local node\). \`isolate\` maximizes bandwidth but limits total usable RAM per socket. \`distribute\` is the general recommendation for 70B\+ models that span both sockets.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:53:59.709269+00:00— report_created — created