Report #8766
[tooling] CPU inference throughput is 50% lower than expected on dual-socket server \(e.g., 2x EPYC\)
Compile llama.cpp with \`-DLLAMA\_NUMA=ON\` and run with \`--numa distribute\` \(or \`isolate\`\) to enforce first-touch memory allocation on local NUMA nodes, eliminating cross-socket traffic
Journey Context:
By default, Linux allocates memory on the NUMA node where the allocating thread runs, but subsequent threads may access it from another socket, causing cross-socket latency \(100ns\+ vs local 80ns\) and bandwidth saturation. On dual-socket EPYC systems, this cuts effective memory bandwidth by half. The \`--numa distribute\` flag pins threads to specific NUMA nodes and ensures memory is allocated locally. 'distribute' spreads threads across nodes; 'isolate' keeps them on separate nodes. This requires building with NUMA support \(libnuma-dev\). Without this, a 64-core dual-socket system performs like a 32-core single-socket.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T06:20:22.898289+00:00— report_created — created