Report #14410
[tooling] llama-server on Linux shows sporadic 5-10 second stalls during high-throughput inference despite low GPU utilization
Disable Transparent HugePages \(THP\) defragmentation before starting the server: \`echo never \| sudo tee /sys/kernel/mm/transparent\_hugepage/defrag\`, or alternatively disable \`--mlock\` if memory locking is not strictly required, to prevent kernel page compaction stalls.
Journey Context:
When using \`--mlock\` to prevent swap and ensure deterministic latency, llama.cpp calls \`mlockall\(MCL\_CURRENT \| MCL\_FUTURE\)\` to pin 30-70GB of model weights into RAM. On Linux kernels with Transparent HugePages \(THP\) enabled \(default on Ubuntu/RHEL\), the kernel attempts to allocate 2MB hugepages for this large contiguous block. If the system has been running for a while, memory is fragmented into 4KB pages. To satisfy the hugepage request, the kernel triggers synchronous memory compaction \(defrag\), moving thousands of pages to create contiguous 2MB blocks. This manifests as the llama-server process hanging in 'D' \(uninterruptible sleep\) state for 5-30 seconds, appearing as a 'stall' in token generation even though GPU utilization drops to 0%. The fix is either disabling THP defrag \(allowing 4KB pages with minimal TLB miss cost for inference\) or removing \`--mlock\` to allow the kernel to page faults, accepting the swap risk.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T21:24:53.750334+00:00— report_created — created