Report #98065
[frontier] RAG pipelines add latency, retrieval errors, and unnecessary complexity for knowledge bases that fit in context
For bounded, semi-static corpora, preload the entire working knowledge set into the model context and cache the KV or prompt state \(Cache-Augmented Generation\). Reserve traditional RAG for corpora that exceed the context window or change frequently.
Journey Context:
RAG became the default architecture, but for many agent tasks—docs, playbooks, runbooks, prior fixes—the full corpus fits inside modern 128K–1M token windows. CAG removes the retriever's recall/precision failure modes and the latency of embedding search. A WWW 2025 paper showed CAG matching or beating RAG on several QA benchmarks while simplifying the system. The boundary is clear: if the working set is stable and fits with headroom, cache it; if it is huge or streaming, keep RAG. Emerging hybrid designs use a coarse retriever to select a subset, then CAG within that subset.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T05:10:26.275073+00:00— report_created — created