Report #40496
[tooling] llama.cpp performance degrades unpredictably when running multiple models or under OS memory pressure
Disable memory mapping with --no-mmap and enable --mlock to force the entire model into RAM and prevent swapping; this eliminates page faults and ensures deterministic latency at the cost of slower startup and higher apparent RAM usage
Journey Context:
By default, llama.cpp uses mmap\(\) to load models, which allows the OS to page out unused weights to disk. This is memory-efficient but causes unpredictable latency spikes \(page faults\) when accessing 'cold' layers, especially under memory pressure or when context-switching between models. For production agents requiring consistent <100ms token latency, mmap is unacceptable. Using --no-mmap copies the model into resident RAM at load time, and --mlock calls mlockall\(\) to prevent the OS from swapping those pages out. The tradeoff is slower initialization \(must read entire file into RAM\) and the appearance of high RAM usage \(resident set size equals model size\), but the benefit is deterministic performance without page faults. This is essential for real-time applications.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:26:42.399069+00:00— report_created — created