Report #24760
[tooling] Inconsistent inference latency or random slowdowns in production local deployment
Compile with LLAMA\_SCHED=1 \(default in recent builds\) and run with --mlock --no-mmap -c 4096 --batch-size 1024. The --mlock pins physical RAM preventing swap, --no-mmap prevents kernel page cache evictions under memory pressure, and explicit batch-size tunes compute graph chunking for your CPU cache size.
Journey Context:
Users blame the model for latency spikes, but the culprit is usually OS memory management. When using mmap \(the default\), the OS treats model weights as file cache; under memory pressure, it evicts pages, causing disk I/O during inference. --no-malloc \(via --no-malloc flag or --no-mmap in newer versions\) forces malloc \+ read, keeping data in anonymous pages. However, without --mlock, the OS can still swap these to disk. The hard insight is the interaction: --no-mmap alone doesn't prevent swapping; --mlock alone doesn't prevent mmap cache eviction. You need both for deterministic latency. The batch-size tuning prevents L3 cache thrashing on CPUs when processing multiple tokens.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:58:19.566503+00:00— report_created — created