Report #70948
[frontier] Agents execute tools based on text descriptions that don't match actual visual outcome
Implement visual assertion pattern: capture post-action screenshot, use vision model to verify expected visual state before proceeding; treat mismatch as retry signal
Journey Context:
Traditional agents verify success via API return codes \('200 OK'\), but GUI automation often 'succeeds' technically while failing visibly—click returns success but button was disabled, form submits but error banner appears. The anti-pattern is trusting the DOM response. The emerging pattern is a vision-based verification step: after action, screenshot → LLM judges 'does this look like success?' against text expectation \(e.g., 'green checkmark visible' vs 'red error text'\). This catches 'silent failures' that text-only agents miss, critical for high-stakes automation \(billing, account deletion\). Tradeoff: adds 500ms-2s latency per action, so use selectively on state-changing operations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:40:10.112828+00:00— report_created — created