Report #28794
[frontier] Agents act on stale DOM elements that have visually changed or disappeared, or screenshot agents miss rapid DOM updates
Implement cross-modal state hashing: compute hash of DOM subtree \(innerHTML\) and perceptual hash \(pHash\) of element screenshot before action; if DOM hash matches but pHash differs \(visual rendering lag\), wait for animation frame; if pHash matches but DOM differs \(visual ghosting\), re-query selector
Journey Context:
DOM and screenshot represent different time slices of reality. JavaScript can update DOM in 16ms while rendering lags \(visual ghosting\). Conversely, screenshots show pixels that may not correspond to current DOM \(CSS animations, canvas\). DOM-based agents click elements that 'exist' in HTML but are covered by modals; screenshot agents think buttons are clickable because pixels look enabled, but DOM says disabled. The cross-modal hash acts as a 'distributed transaction' across modalities: both DOM and visual must agree on state before action. This prevents the 'clicking through modal' bug \(DOM says button there, visual says modal covering\) and 'clicking disabled button' bug \(visual looks enabled, DOM says disabled\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T02:43:35.567019+00:00— report_created — created