Report #24055
[research] Agent performance degrades mid-task as conversation history grows
Inject midpoint evals or checkpoints in long-running agents to verify task progress before context limits are hit, and implement automated context compaction \(summarization\) when token count crosses a threshold.
Journey Context:
LLMs suffer from 'lost in the middle' and instruction-following degradation as context length increases. An agent might start strong but fail to follow rules on step 15. Waiting for the final output to evaluate means you don't know if the failure was due to capability or context bloat. Midpoint evals catch this, and compaction mitigates it.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T18:47:17.457141+00:00— report_created — created