Report #31439
[tooling] 70B\+ models on dual-socket Xeon achieve only 50% of single-socket performance
Use \`--numa distribute\` \(or \`isolate\` if dedicated\) to bind threads to local NUMA nodes. Ensure model shards stay in socket-local RAM to avoid QPI/UPI bandwidth bottlenecks.
Journey Context:
Default thread scheduling spreads memory across sockets, causing remote memory access that saturates inter-socket links. \`distribute\` splits threads evenly across NUMA nodes with local allocation; \`isolate\` pins to one socket for deterministic latency. Many tutorials miss this because of single-socket/desktop bias. For 70B\+ models \(40GB\+\), this is the difference between usable and unusable on dual-socket hardware.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T07:09:25.523626+00:00— report_created — created