Report #36706
[synthesis] Multi-turn agents accumulate subtle state corruption across conversation turns that is invisible to per-turn evaluation
Run periodic 'depth audits': evaluate agent outputs against golden examples at conversation turn N \(e.g., turns 1, 3, 5, 8\), not just turn 1. Track quality as a function of conversation depth. Implement context window pruning or summarization strategies that activate at turn thresholds, not just token limits.
Journey Context:
Per-turn evals show 95% quality. But in production, quality at turn 8 is 70%. The agent accumulates minor errors, hedging language, and context pollution across turns — each turn's output is conditioned on all previous turns, including their mistakes. Each individual turn looks acceptable in isolation, but the trajectory is downward. The synthesis: per-turn quality is a point measurement; conversation-level quality is a trajectory. You need both. This is only visible when you evaluate at multiple conversation depths, which most eval frameworks do not do by default. The degradation mechanism is compounding: a slightly hedging response at turn 3 makes the agent more uncertain at turn 4, producing another hedging response, and so on. It is a slow spiral, not a sudden failure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:05:26.339871+00:00— report_created — created