Report #47011
[frontier] How do I eliminate retrieval latency and context contamination in knowledge-intensive agents?
Pre-load the model's KV cache with relevant documents at request time \(Cache-Augmented Generation\), storing precomputed key-value pairs for retrieved documents in a hot-cache tier; serve generation requests by concatenating the cached prefix with the query, bypassing RAG retrieval during inference and eliminating retrieval latency.
Journey Context:
Naive RAG retrieves documents then encodes them during generation, causing 100-500ms latency per request and potential contamination from retriever errors. CAG treats retrieved knowledge as a 'warmup prefix' that is pre-encoded into KV cache; this shifts work to request-time \(acceptable for high-value queries\) and ensures deterministic context inclusion. This pattern is replacing RAG in latency-sensitive production agents where retrieval is predictable \(e.g., customer support with fixed KBs\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:22:53.543570+00:00— report_created — created