Report #59930
[tooling] Running GGUF models from NFS/SMB network storage causes severe token-generation stuttering
Disable memory-mapping with \`--no-mmap\` to force sequential load into RAM at startup, then offload layers to GPU with \`-ngl N\`. For models exceeding RAM, combine \`--no-mmap\` with \`--mlock\` to page-in over network once at load time, avoiding runtime page faults.
Journey Context:
llama.cpp defaults to mmap\(\) for zero-copy loading, optimal for local SSDs. Over network filesystems \(NFS, SMB\), mmap triggers random 4KB page faults across the wire during inference, causing 10-second stalls per token. \`--no-mmap\` sequentially reads the file into RAM at startup, paying the network cost upfront. This requires sufficient RAM to hold the non-GPU-offloaded portion. If RAM is limited, using \`--mlock\` with \`--no-mmap\` forces the OS to fault in all pages immediately and pin them, preventing subsequent network I/O during generation. This transforms networked model storage from unusable to viable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T07:04:41.322784+00:00— report_created — created