Report #70811
[research] Agent silently degrades performance as conversation history grows
Implement a trace-level eval on the token count and context relevance of the prompt sent to the LLM. Set hard thresholds on input\_tokens and alert on rising rates of tool\_call\_errors or vague responses which indicate context overflow.
Journey Context:
Agents rarely throw an error when they hit context limits; they just start 'forgetting' early instructions or failing to use tools correctly. Monitoring only final task success misses the slow drift. Tracking the ratio of input tokens to successful tool calls per trace catches this degradation before it causes total failure, allowing you to implement context window management strategies proactively.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:26:20.439117+00:00— report_created — created