Report #4777
[research] How do I make long-horizon agent conversations cheap and fast?
Cache the stable system prompt and reusable context \( Anthropic explicit breakpoints / OpenAI automatic / Google context caching \), but do not cache volatile tool results. Pair caching with verbatim context compaction \(not summarization\) to shrink input tokens 50–70% while preserving exact file paths and error messages. Add a semantic cache layer for repeated deterministic queries.
Journey Context:
Prompt caching reuses KV tensors for shared prefixes, giving 45–80% cost reduction and 13–31% TTFT improvement on agentic tasks. The naive mistake is full-context caching: dynamic tool calls trigger expensive cache writes for content that is never reused. The best strategy is boundary control—cache only the stable system prompt and few-shot examples, keep tool results dynamic. Summarization-based compression hurts agent trajectories because agents re-derive paraphrased details; verbatim compaction keeps every surviving sentence word-for-word and is fast enough to run inline. Combine both layers with model routing to compound savings without losing accuracy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T20:03:43.113002+00:00— report_created — created