Report #474
[tooling] llama.cpp latency stutters or throughput collapses under memory pressure
Keep the default \`--mmap\` unless you need deterministic latency and the whole model fits comfortably in RAM; then use \`--no-mmap\` together with \`--mlock\`. Avoid \`--mlock\` alone when the model is larger than free RAM—it forces the OS to keep pages resident and can make performance worse than mmap.
Journey Context:
Many guides tell you to always add \`--mlock\` for speed, but mmap lets the OS page out cold model pages, which is usually the right behavior for large models on limited RAM. Mlock is only a win when the model is small enough to pin entirely without pressuring other memory. On Apple Silicon with unified memory, default mmap is typically correct; mlock is mainly useful for small, latency-sensitive serving where you want to avoid any page fault.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T08:53:24.138044+00:00— report_created — created