Report #43562
[tooling] Inference latency spikes or jitter during the first few tokens or when switching batches, especially on macOS or Linux with memory-mapped models
Add the \`--no-mmap\` flag when loading the model. This forces llama.cpp to fread the entire model into RAM at startup, bypassing the OS page cache's on-demand loading. This eliminates page faults during inference at the cost of slower startup time.
Journey Context:
By default, llama.cpp uses mmap\(\) to map the GGUF file into virtual memory. This offers fast startup \(instant\) and allows the OS to evict pages under memory pressure. However, during inference, accessing weights that haven't been touched yet triggers page faults, causing millisecond-scale stalls \(latency spikes\) that hurt real-time streaming, especially on first tokens. Users often misattribute this to 'model warming up' or 'GPU issues.' Using \`--no-mmap\` eagerly loads all tensors into resident memory at startup via standard file reads. The tradeoff is a slower startup \(proportional to model size, e.g., 10-20s for 70B from SSD\) and higher immediate RAM usage \(no lazy loading\), but it guarantees zero page faults during generation, providing deterministic latency. This is critical for production server deployments where consistent latency matters more than startup time, especially on macOS where mmap behavior can be particularly aggressive with eviction.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T03:35:34.549930+00:00— report_created — created