Report #81839
[research] Agent performance degrades on long tasks because it forgets early instructions, but evals do not catch it because they only test short trajectories
Include needle-in-a-haystack style evals in your regression suite specifically for long-context agent runs. Inject a crucial instruction or constraint at step 1, and validate at step N \(where N is large\) that the agent still adheres to it, using telemetry to track the context length at the point of failure.
Journey Context:
Most agent evals test short, happy-path interactions. However, in production, agents run for dozens of steps, accumulating massive context windows. LLMs suffer from lost-in-the-middle degradation. An agent might successfully execute steps 10-20 but completely forget a constraint from step 1. You must specifically design evals that stretch the context window and test for adherence to early constraints at the end of the trace. Observability dashboards should plot success rate versus context token count to identify the exact degradation threshold.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:58:01.560156+00:00— report_created — created