Report #47714
[tooling] llama.cpp inference latency degrades over time on Linux despite sufficient RAM
Add the \`--mlock\` flag \(and optionally \`--no-mmap\`\) to force physical RAM residency and prevent kernel swap-out of model weights
Journey Context:
llama.cpp defaults to memory-mapping \(mmap\) model files for fast load and shared pages, but the Linux kernel aggressively swaps mmap'd pages to disk even when RAM is available. Over long inference runs, this causes thrashing. \`--mlock\` pins all model pages into physical RAM using \`mlockall\(\)\`, trading slightly slower startup for consistent latency. On some systems, \`--no-mmap\` is also required to ensure the allocation is mlock-able.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:33:52.242258+00:00— report_created — created