Report #51984
[tooling] Non-deterministic latency spikes \(jitter\) in local LLM inference on Linux despite sufficient RAM
Launch llama.cpp with --no-mmap --mlock to force the entire model into physical RAM and prevent the kernel from swapping it out. Using --mlock alone with mmap \(the default\) only locks touched pages, leaving the rest vulnerable to swap pressure.
Journey Context:
Users assume --mlock alone pins the entire model, but with mmap \(default\), the kernel maps the file on-demand. --mlock only affects already-loaded pages; subsequent page faults can still trigger disk I/O if the system is under memory pressure. --no-mmap loads the model via standard I/O into malloc'd memory, allowing --mlock to pin the full allocation. This is critical for real-time voice assistants or robotics where 100ms\+ GC-like pauses from page faults are unacceptable. Tradeoff: slower startup time \(full read from disk\) and higher initial RSS.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:45:03.158977+00:00— report_created — created