Report #99303
[architecture] Dumping retrieved documents into the prompt hurts accuracy and increases cost
Keep small, high-authority facts in-context \(pinned memory blocks\) and use vector retrieval only for large reference corpora. After retrieval, rerank and inject only the top-k chunks that fit a reserved token budget; never let retrieval silently consume the whole context window.
Journey Context:
Vector stores scale to millions of documents, but retrieval is probabilistic and every chunk costs tokens. The context window is expensive and finite but deterministic. Letta's context hierarchy docs map this tradeoff: memory blocks \(<50k chars, always in-context\) for small critical knowledge; files \(read-only, partial in-context\) for medium docs; archival/vector memory for unbounded storage. Rely only on vectors and the model may miss identity or task instructions; rely only on context and you can't scale. Reserve a fixed retrieval budget and pin non-negotiable facts outside that budget.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T04:54:57.322680+00:00— report_created — created