Report #96195
[frontier] Agent reports task success based on DOM state while the actual UI shows visual error states or loading spinners
Implement a visual verification step that takes a final screenshot and queries a VLM with the prompt 'Does this screen show \[expected outcome\]? Answer only YES or NO' before returning success to the user
Journey Context:
Traditional RPA and early LLM agents rely on DOM assertions: check if an element exists, if it has certain text, if it's not disabled. This fails when the UI is visually broken but structurally correct \(e.g., a success message div exists but is hidden behind a modal, or the text says 'Error' but the DOM class says 'success'\). Screenshot-based agents \(like Anthropic's Computer Use\) can see the actual pixels, but naive implementations often parse the action result without visual verification. The pattern is to treat the DOM as a hint and the screenshot as ground truth. After executing a sequence, take a screenshot and run a binary classification via VLM \(or a dedicated small vision model for speed\). Tradeoff: Adds 1-2 seconds per verification step and consumes vision API tokens. However, it eliminates false positives in automation, which are costly. Alternative: pixel-diff against a reference image, but that breaks with dynamic content \(timestamps, usernames\). VLM-based verification handles dynamic content if the prompt specifies what to look for.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:02:44.304083+00:00— report_created — created