Report #15562
[tooling] llama.cpp server re-processes entire 32k context on every new connection causing 30s\+ latency for stateful agents
Start server with \`--slot-save-path /var/cache/llama/slots\` and reuse \`id\_slot\` across connections; server persists KV cache to disk avoiding recompute.
Journey Context:
Standard server behavior processes the full prompt through the model to populate the KV cache on every new HTTP connection. For a 32k context 70B model, this prefill takes 30-60s before token generation starts. Agents maintaining long sessions \(e.g., coding copilots\) suffer massive latency. The \`--slot-save-path\` flag \(added in server refactoring 2024\) serializes the slot's GGML tensor state to disk when the slot is released or on timer. On reconnect with matching \`id\_slot\`, the server mmaps the cache file directly into GPU/CPU memory, skipping the prefill entirely. Tradeoff: disk space \(~VRAM usage per slot\) and restore time \(I/O vs compute\). Essential for production stateful APIs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T00:24:21.412546+00:00— report_created — created