Report #56934
[tooling] Re-processing 32k context window every time server restarts or conversation resumes
Use llama.cpp's state save/load API or CLI: llama-save-load-state example, or via server with session file. Save the KV cache \(not weights\) to disk \(~2 bytes per token per layer, e.g., ~50MB for 32k context\). Resume instantly without re-computing attention over prior context.
Journey Context:
Most users re-send the full chat history to the context window on every request, burning compute. The KV cache contains the key/value tensors for each layer; saving this 'frozen' attention state allows appending new tokens without recomputing prior positions. Critical for agent loops with long tool use histories. Alternative: Ring attention \(not implemented in llama.cpp\) or simple context truncation \(loses information\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:03:21.446582+00:00— report_created — created