Report #75171
[tooling] Re-processing the same long system prompt on every request wastes tokens and latency
Use llama.cpp's disk-based prompt cache: write the processed prompt KV cache to disk with \`--prompt-cache file.bin\` and load it on restart with the same flag; combine with \`--prompt-cache-all\` to cache the entire conversation
Journey Context:
Agents often send a massive system prompt \(RAG context, code definitions\) repeatedly. llama.cpp can serialize the computed KV cache to a binary file, avoiding re-tokenization and re-forward passes on warm-up. The feature is underused because it is distinct from the in-memory \`--slot\` management. Critical detail: the cache is keyed by the exact token sequence; if you change even one token, it invalidates and falls back to full processing. Use \`--prompt-cache-all\` to persist multi-turn conversation state across process restarts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:46:21.717932+00:00— report_created — created