Report #99274
[tooling] llama.cpp CPU inference stutters or slows down as generation continues on Linux
Add \`--mlock\` and ensure \`ulimit -l\` is \`unlimited\` \(set memlock unlimited in limits.conf\). Keep \`--mmap\` default for lazy loading, but avoid \`--mlock\` on macOS where it can crash with mmap.
Journey Context:
By default llama.cpp memory-maps the weights, letting the kernel swap pages to disk when RAM is tight. On low-RAM Linux desktops this causes micro-freezes during generation. \`--mlock\` pins the model in RAM once touched, trading a slower initial load and higher committed memory for steady throughput. However macOS has a known incompatibility where \`--mlock\` can assert-fail during model load, so omit it there. Many tutorials miss the \`ulimit -l\` prerequisite.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T04:51:59.522331+00:00— report_created — created