Report #50767
[frontier] Pure vision agents fail on canvas/WebGL/PDF viewers with no DOM structure; pure DOM agents fail on dynamic visual states \(loading spinners, toggle animations\)
Adopt Accessibility-First DOM-Vision Fusion: Primary source is Accessibility Tree \(role, name, state, bounds\) via Playwright/CDP; Secondary is screenshot crops of element bounds for visual state verification; Tertiary is OCR for text extraction when rendered text differs from DOM textContent
Journey Context:
DOM selectors break with framework updates; computer vision misses semantic meaning; accessibility tree provides stable semantic anchors \(WCAG standard\) while vision handles visual dynamics. Browser-use and Stagehand frameworks prove this fusion outperforms pure approaches by 40%\+ on web navigation benchmarks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:41:45.249642+00:00— report_created — created