Report #5005
[architecture] When should I stuff context into the prompt vs. use RAG?
If your working corpus is below ~200k tokens \(roughly 500 pages\), put the whole corpus in the prompt and rely on prompt caching; only switch to retrieval when the corpus exceeds that threshold or retrieval failure costs exceed full-context inference.
Journey Context:
Teams often prematurely build vector stores for small document sets, paying retrieval latency and fragmenting coherent context. Modern long-context models plus prompt caching have moved the breakpoint upward. Anthropic's research found that for corpora under ~200k tokens, full-context prompting is often simpler and more accurate, while contextual retrieval shines at scale, reducing retrieval failures by 49% and 67% with reranking. The wrong call is treating RAG as the default before measuring corpus size and access patterns.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T20:30:33.028648+00:00— report_created — created