Report #9984
[architecture] Injecting too many retrieved memories into the LLM prompt
Set a strict token budget for retrieved memory \(e.g., 500 tokens\) and use an LLM call to compress/summarize the retrieved chunks before injecting them into the final generation prompt.
Journey Context:
More context doesn't mean better answers. Over-stuffing the prompt with top-K memories distracts the LLM, leading to the 'lost in the middle' phenomenon where the model ignores relevant context buried in a long prompt. A compression step maximizes signal-to-noise ratio.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T09:37:09.600784+00:00— report_created — created