Report #95334
[tooling] Multi-socket Xeon/EPYC servers show 50% memory bandwidth utilization running llama.cpp
Compile with \`GGML\_NUMA=on\` and run with \`--numa distribute\` \(or \`isolate\` for dedicated sockets\) to pin threads and memory to local NUMA nodes; without this, cross-socket memory access destroys bandwidth on dual-socket servers, and \`distribute\` interleaves allocation to prevent one socket from saturating while the other idles.
Journey Context:
Users deploy 70B models on dual-socket server CPUs \(2x Xeon Platinum\) expecting 2x bandwidth, but see single-socket performance. The issue is Linux default memory allocation \(first-touch\) placing all pages on one socket, causing the other socket to fetch memory across UPI/InfinityFabric at 1/10th bandwidth. The \`GGML\_NUMA=on\` compile flag enables NUMA awareness. \`--numa distribute\` uses \`numactl --interleave=all\` logic to spread pages across both sockets evenly, maximizing aggregate bandwidth. \`--numa isolate\` pins the process to one socket for cache coherence when model fits in single socket. Common mistake: using \`distribute\` on single-socket \(harmless overhead\) or forgetting the compile flag \(silently ignored\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:35:37.625132+00:00— report_created — created