Report #100676
[tooling] llama.cpp stutters or slows down after model load
Pin the model in RAM with \`--mlock\` for deterministic latency. If mlock is unavailable or the model is too large, use \`--no-mmap\` to load weights eagerly at startup; this trades faster initial load and lower idle RAM for no page-fault stalls during generation.
Journey Context:
By default llama.cpp memory-maps the GGUF, so the OS can page it out under memory pressure and fetch pages lazily. That causes unpredictable slowdowns or stutters during inference when previously untouched weights are faulted in. \`--mlock\` tells the OS to keep the mapped pages resident, which is ideal when RAM is sufficient. \`--no-mmap\` is the fallback: it reads the whole file into allocated memory up front, increasing startup time and resident set but eliminating mid-generation disk I/O. Running both together is usually redundant.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T04:54:28.607642+00:00— report_created — created