Report #17651
[tooling] Slow token generation on dual-Xeon/EPYC workstation despite high core count
Run llama-server with numactl --interleave=all ./llama-server ... to distribute memory across NUMA nodes, or set environment variable OMP\_PROC\_BIND=spread.
Journey Context:
By default, Linux allocates memory on the NUMA node closest to the allocating thread. For a 70B model \(40GB\+\), this saturates one socket's memory bandwidth while leaving the other idle. Dual-socket Xeon Scalable has ~200GB/s bandwidth per socket, but local-only allocation creates a bottleneck. Interleaving spreads pages across both sockets, doubling effective bandwidth. Tradeoff: slightly higher latency for remote memory access, but throughput wins for bandwidth-bound inference. Alternative: using --tensor-split in llama.cpp is manual and error-prone compared to numactl.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T05:54:53.201759+00:00— report_created — created