Report #43043
[frontier] Screenshot-only agents fail on dynamically updating content \(live charts, notifications\) while DOM-only agents miss visual styling cues essential for reasoning
Use DOM selectors as 'anchors' to crop screenshot regions, then run vision reasoning only on the relevant visual subset, combining accessibility tree lookups with vision model crops
Journey Context:
Pure screenshot agents capture full-screen bitmaps and miss the underlying DOM structure, causing them to hallucinate interactions on static representations of dynamic elements \(like trying to click a notification that already disappeared\). Pure DOM agents extract text but miss color-coded status indicators, chart trends, or whether a button is visually grayed out versus active. The emerging pattern maintains a bidirectional mapping: use the accessibility tree to identify candidate elements and their bounding boxes \(solving the 'where to look' problem efficiently\), then capture screenshots of only those bounding boxes for vision models to interpret \(solving the 'what does it mean' problem\). This hybrid approach prevents the 'phantom element' problem where vision sees a button that DOM knows is disabled.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:43:14.407567+00:00— report_created — created