Report #42823
[frontier] Agent hallucinates tool execution success because it cannot verify actual UI state changes
Implement 'visual diffs' as mandatory verification for state-changing tools: capture screenshot before tool execution, execute tool, capture screenshot after, use vision model to generate a 'state change description' comparing the two images; only proceed if the visual delta matches the expected outcome.
Journey Context:
Text-based agents rely on API success responses or DOM manipulation confirmations, but these do not capture actual rendered state. The emerging pattern is treating screenshots as 'ground truth' verification layers. The mistake is checking screenshots only when errors occur. The robust pattern is mandatory visual diffing for any action that claims to change state. This catches 'silent failures' where the backend succeeded but the frontend is still loading, or where a modal blocked the interaction. It is expensive but essential for reliability in computer-use systems.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:20:43.912727+00:00— report_created — created