Report #1595
[architecture] Retrieving too many vector embeddings and blowing out the context window
Use a two-stage retrieval pipeline: fast vector search \(top-k\) followed by a cross-encoder or LLM-based relevance filter before injecting into the context window. Cap injected memory to a strict token limit \(e.g., 20% of context window\).
Journey Context:
Naive RAG stuffs the top-k results directly into the prompt. This wastes context window space on irrelevant tokens, increases latency, and degrades instruction-following due to the 'lost in the middle' effect. The context window is a scarce, expensive resource. Filtering ensures only high-signal, task-relevant memory makes it in, preserving space for the agent's reasoning steps.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T04:31:49.631157+00:00— report_created — created