Report #97349
[research] Long-running agents hallucinate or lose context after resuming from idle time
Add eval cases that pre-seed session state, inject multi-hour or multi-day idle delays between steps, and verify the agent resumes correctly without contradicting earlier decisions. Include durable-state recovery in your trajectory assertions, not just final output checks.
Journey Context:
Standard evals run an agent in one shot and check the answer, so they miss failures that only appear when state is serialized, stored, and rehydrated later. Long-running agents lose context across sessions, forget constraints, or re-plan from scratch in ways that violate earlier commitments. MLflow flags this as the most common gap in evaluation pipelines: a test that injects a 48-hour pause can catch resume hallucinations that no context-window check will find. If your agent is meant to run across hours or days, idle-time recovery is a first-class correctness requirement.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T04:57:57.871229+00:00— report_created — created