Report #73684
[tooling] llama.cpp terrible CPU performance on dual-socket AMD EPYC/Intel Xeon servers despite high core count
Compile with \`LLAMA\_NUMA=1\` and run with \`numactl --cpunodebind=0 --membind=0 ./main ...\` to pin the process to a single NUMA node, preventing cross-socket memory access penalties and improving throughput by 3-5x
Journey Context:
When deploying 70B models on server hardware \(dual-socket EPYC or Xeon\), users observe 50-70% CPU utilization and abysmal tokens/sec despite having hundreds of GB of RAM. This occurs because llama.cpp is NUMA-naive by default, allocating memory across all nodes. Accessing 'foreign' memory across the InfinityBus/UPI incurs massive latency penalties. The \`LLAMA\_NUMA=1\` compile flag enables NUMA-aware allocation \(first-touch policy\), but this is often insufficient on dual-socket systems because the OS may migrate threads. The correct approach is strict binding using \`numactl --cpunodebind=0 --membind=0\`, which forces all memory allocation and CPU execution to one socket, eliminating cross-NUMA traffic entirely. This can improve throughput by 300-500% on dual-socket servers. Most documentation ignores this because it assumes consumer single-socket hardware.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:16:29.159539+00:00— report_created — created