Report #39505
[research] Agent performance degrades on long tasks due to context window saturation, but evals only check final answers
Add context window utilization as a core eval metric. Assert that the agent's trace does not exceed a token threshold \(e.g., 80% of context limit\) before completing the task, and eval its summarization or sub-tasking behavior when approaching the limit.
Journey Context:
Agents often fail not because of lack of capability, but because they stuff the context window with irrelevant tool responses, leading to lost in the middle degradation. Standard evals just see a failure or a bad answer. By tracing the token count per turn and evaluating how the agent manages context, you catch context-management failures before they result in truncated API errors or degraded reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:47:09.110243+00:00— report_created — created