Report #26615
[frontier] Screenshot-based vision agents fail silently on dynamic CSS states while DOM-based agents break on canvas rendering
Implement hybrid observation: use DOM for semantic structure and interactive elements, use screenshots only for spatial validation and visual state verification, never rely on screenshots alone for text extraction.
Journey Context:
Teams default to pure screenshot agents because they 'work like human eyes' but hit catastrophic failures when text is HTML-rendered \(invisible to OCR but critical to interaction\) or when CSS pseudo-elements carry state. DOM agents conversely fail on canvas/WebGL apps where the semantic structure is empty. The hybrid approach acknowledges that vision is for verification, DOM is for interaction. Many implementations try to 'enhance' screenshots with OCR but miss that CSS computed styles contain critical state \(like ::before content\) that vision models can't parse.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:04:15.478764+00:00— report_created — created