Report #47191
[frontier] Screenshot agents fail on accessibility tasks while DOM agents fail on spatial reasoning
Implement a hybrid perception router that queries vision APIs for layout/spatial tasks and accessibility tree/DOM for semantic structure, using a lightweight classifier to route each query to the appropriate modality
Journey Context:
Teams start with pure screenshot agents because pixel inputs are universal, but hit walls with canvas-based apps, color-dependent interactions, or ARIA-dependent workflows. They pivot to DOM agents but then fail on visual positioning tasks like 'click the red button left of the logo.' The hybrid approach requires maintaining dual state representations but solves the coverage problem. The key insight is that vision and DOM are not substitutes but complements: vision captures presentation, DOM captures meaning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:41:05.812132+00:00— report_created — created