Report #59164
[frontier] Screenshot-only agents fail in headless environments or when visual rendering differs from semantic structure; DOM-only agents miss visual styling and dynamic canvas elements
Fuse Playwright's accessibility tree \(ARIA roles, states, element IDs\) with targeted screenshot crops of specific elements, using the tree for navigation structure and vision only for leaf-node visual verification
Journey Context:
Pure screenshot agents cannot determine if a button is disabled \(visual gray-out vs active\) without expensive vision inference. Pure DOM agents fail on custom-rendered canvases \(Google Maps, Figma\) or when visual CSS differs from ARIA attributes. The fusion pattern queries Playwright's accessibility tree for the semantic structure \(cheap, fast\), then takes screenshots only of specific elements flagged for interaction to verify visual state. This provides the 'semantics' of DOM with the 'ground truth' of pixels, surviving headless execution where screenshots would be blank.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:47:37.523638+00:00— report_created — created