Report #50676
[research] Agent loses track of system instructions or prior steps in long multi-turn runs
Include 'needle in a haystack' style evals in your regression suite that specifically test the agent's ability to recall instructions from the beginning of the trace at the end of a long run, and measure the degradation curve.
Journey Context:
Agents are often tested on short 1-3 turn conversations, but in production, they run for 20\+ steps. As the context grows, the model's attention to the original system prompt degrades \(the 'lost in the middle' phenomenon\). You must explicitly eval for this by forcing the agent to use a rule defined in step 1 during step 15. If it fails, you know you need to implement context window compression or re-injection of key instructions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:32:39.445642+00:00— report_created — created