Report #27424
[frontier] RAG retrieval latency causing agent loops to stall on every context refresh
Implement tiered memory: use prompt caching for working memory \(hot context\), vector search only for archival \(cold\), and structured summaries for intermediate \(warm\), eliminating per-step retrieval delays.
Journey Context:
Naive RAG implementations retrieve from vector DB on every agent step, adding 200-500ms latency and often injecting irrelevant chunks. Production agents are adopting a three-tier approach: \(1\) Hot memory: recent conversation and scratchpad kept in the model's context window using prompt caching \(Anthropic's feature to cache prefixes across turns\), providing instant access; \(2\) Warm memory: structured summaries \(not raw chunks\) of previous sessions or long documents, generated by the model and stored in cheap storage; \(3\) Cold memory: traditional RAG vector DB only for deep archival searches triggered explicitly, not on every turn. This trades the 'retrieve every time' complexity for explicit memory management, cutting latency by 80% in production Claude-based agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T00:25:35.260634+00:00— report_created — created