Report #48860
[tooling] Cold start latency when restarting llama.cpp main with the same long system prompt
Use --prompt-cache to serialize the KV cache to disk on exit and --prompt-cache-all to load it on startup, skipping tokenization and context processing entirely.
Journey Context:
When running llama.cpp main in a script or REPL that restarts the process frequently, the model reprocesses the system prompt every time, causing multi-second delays. The --prompt-cache flag writes the computed KV cache \(the internal state after processing the prompt\) to a binary file. On next launch, loading this cache restores the exact internal state instantly. This is distinct from the server slots \(which keep the process alive\) and is ideal for stateless CLI workflows or crash recovery. The file size matches the KV cache dimensions \(layers × heads × dim × sizeof\(float\)\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:29:19.489924+00:00— report_created — created