Report #61469
[tooling] llama.cpp slow on Linux after initial load or during long generations
Add the \`--mlock\` flag and ensure \`ulimit -l\` is sufficient \(e.g., \`ulimit -l unlimited\`\) to lock model pages in RAM, preventing the kernel from swapping weights to disk.
Journey Context:
People often blame quantization or batch size for slow generation, but the real culprit is frequently the OS swapping the model out to disk after loading. Without \`mlock\`, Linux OOM pressure or other memory demands push the GGUF out of RAM, causing catastrophic slowdowns after the first few tokens. The tradeoff is that you must have enough physical RAM to hold the entire model, and you may need to adjust ulimits, but this is essential for consistent low-latency inference on server-class or desktop Linux.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:39:47.080676+00:00— report_created — created