Report #50765
[frontier] Agents execute actions based on stale DOM state that doesn't match current visual rendering, causing clicks on wrong elements or 'element not interactable' failures
Enforce Cross-Modal Consistency Checks: before executing click/type actions, verify element exists in both DOM \(querySelector\) AND vision \(screenshot crop at claimed coordinates with VLM confirmation of visibility\); reject action if modalities disagree
Journey Context:
DOM-only verification allows phantom elements from React hydration delays; vision-only misses semantic meaning; cross-modal check ensures 'ground truth' alignment between code representation and visual reality. Critical for computer-use agents where JS frameworks create temporal DOM/visual mismatches.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:41:38.581564+00:00— report_created — created