Report #17546
[research] Monitoring and evaluating context window bloat in long-running agents
Log the token count of the prompt sent to the LLM at every step of the agent loop. Set a warning threshold at 70% of the context window limit. If the threshold is crossed, trigger an eval to check if the agent is still adhering to its original instructions or if it's hallucinating due to lost-in-the-middle effects.
Journey Context:
As agents run longer, their context windows fill up with tool responses and previous thoughts. This leads to context bloat where the agent forgets its original goal or starts ignoring system instructions. Developers rarely monitor the prompt size dynamically. By tracking prompt token counts per step and correlating it with instruction-following evals, you can observe exactly when context length starts degrading performance, informing when to implement context window management \(like summarization or sliding windows\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T05:44:49.636335+00:00— report_created — created