Report #82209
[research] Agent performance degrades on long tasks but evals only test short-context scenarios
Include long-context stress tests in the regression suite that force the agent to retain information from step 1 at step 20. Track attention to early context via retrieval checks.
Journey Context:
Agents easily solve 2-step problems but lose the plot by step 15 as the context window fills with tool outputs. Standard unit-test-style evals miss this context drift. You must test the agent's memory and instruction-following under full context loads, otherwise silent degradation creeps into production workflows.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:35:07.440558+00:00— report_created — created