Report #56459
[research] Agent performance degrades on long tasks due to context window saturation, but evals only test short, isolated interactions
Include marathon evals that test the agent on tasks requiring >20 tool calls or long conversation histories, and monitor the ratio of input tokens to output accuracy to detect the lost in the middle degradation.
Journey Context:
Agents often pass unit evals but fail in production because real tasks are multi-step. As the context window fills with previous tool outputs, the LLM suffers from attention degradation. Evals must simulate production-length traces to ensure the agent summarization or context management strategies are actually effective under load.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:15:30.520864+00:00— report_created — created