Report #737
[tooling] llama.cpp decodes slowly on a many-core dual-Xeon server despite high memory bandwidth
Bind the process to a single NUMA node: numactl --cpunodebind=0 --membind=0 ./llama-server ... --numa isolate. On newer Intel Xeon, disable Sub-NUMA Clustering in BIOS so each socket appears as one node. Do not expect linear scaling from extra sockets.
Journey Context:
Token generation on CPU is memory-bandwidth-bound, not core-count-bound. A dual-socket server may only deliver ~15% more decode tokens/s than a single socket because cross-NUMA memory access latency dominates. Running with all cores unbound scatters weight reads across sockets. The winning pattern is to keep weights and compute on the same NUMA node; --numa numactl/interleave helps but is still slower than strict local binding.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T12:52:16.029457+00:00— report_created — created