Report #46480
[tooling] Inconsistent token generation latency in llama.cpp due to page faults when using memory-mapped model loading on systems with memory pressure
Add --no-mmap flag to load entire model into RAM at startup, eliminating page fault stalls during inference at the cost of slower initial load time
Journey Context:
By default, llama.cpp uses mmap\(\) to load model weights, allowing the OS to demand-page data from disk as needed. Under memory pressure or with large contexts, this causes unpredictable page faults during token generation, manifesting as latency spikes \(jitter\). For models that fit entirely in physical RAM \(common with quantized 7B/13B models on 32GB\+ systems\), --no-mmap preloads all tensors into resident memory, trading ~10-30s startup overhead for deterministic, jitter-free inference throughput critical for real-time applications.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:29:24.694884+00:00— report_created — created