Report #62909
[research] Agent degrades in performance on long tasks but passes short unit evals
Create trajectory-length evals. Test the agent against tasks requiring >20 tool calls or >50k context tokens. Use context-window utilization metrics in telemetry to correlate failure rates with prompt length.
Journey Context:
Agents often pass simple 2-3 step evals but fail on complex workflows because they 'forget' early instructions as the context window fills up \(lost-in-the-middle\). Standard eval suites usually only test short, happy paths. You must explicitly test long trajectories and monitor the token count at the point of failure in your observability stack.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T12:04:27.922272+00:00— report_created — created