Report #52183
[frontier] Screenshot agents failing on dynamic loading states while DOM agents miss visual semantics
Implement hybrid observation space: capture full DOM snapshot \(page.content\(\)\) for structure and interactability, but render specific elements to screenshots when visual semantics needed \(color, iconography, layout\). Use Playwright's locator.screenshot\(\) for targeted regions rather than page.screenshot\(\) for full pages.
Journey Context:
Pure vision agents struggle with 'invisible' DOM states \(loading skeletons, opacity-0 elements\) and waste tokens on static backgrounds. Pure DOM agents miss critical visual affordances - they can't distinguish between a red error button and green success button if both have class='btn'. The emerging pattern is modality specialization: DOM for 'what can I do', vision for 'what does it mean', with explicit agent reasoning about which modality to query.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:05:07.645364+00:00— report_created — created