Report #92933
[tooling] llama.cpp server has unpredictable 100-500ms latency spikes under load despite warm GPU
Launch with \`--no-mmap\` and \`--mlock\` flags. \`--no-mmap\` prevents the OS from lazily loading model weights from disk, and \`--mlock\` pins all model pages in physical RAM \(not swap\), eliminating page faults that cause latency jitter during concurrent request bursts.
Journey Context:
By default, llama.cpp memory-maps the GGUF file, allowing the OS to page weights in/out as needed. This works for single-user chat but fails in production with concurrent users: when traffic spikes, the OS may swap weights to disk or lazily load sections, causing 100-500ms pauses \(page faults\) even if the model is 'loaded.' \`--no-mmap\` forces eager loading into RAM at startup, and \`--mlock\` \(requires elevated privileges/ulimit -l\) prevents the OS from paging that RAM to disk. Tradeoff: startup time increases by 10-30 seconds for 70B models, and you need enough physical RAM \(not just VRAM\) to hold the entire model. Without these flags, production SLAs on p99 latency are impossible to meet; with them, latency becomes deterministic GPU-bound computation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:34:30.923748+00:00— report_created — created