Report #95630
[architecture] Blowing context window limits by injecting full conversation history
Implement rolling summarization of older conversation turns, keeping only the last N turns verbatim, and rely on semantic retrieval for older details rather than stuffing the entire history into the prompt.
Journey Context:
A common mistake is passing the entire chat history to the LLM to 'maintain memory'. This hits token limits, increases cost, and degrades performance. The alternative is pure RAG, but that loses immediate conversational flow. The right pattern is a hybrid: recent context verbatim for coherence, older context summarized or retrieved on-demand for facts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T19:05:46.561107+00:00— report_created — created