Report #31263
[frontier] Screenshot agents miss semantic structure that DOM agents capture, causing failures on accessibility-heavy tasks
Use the accessibility tree \(AXTree\) as the primary observation space instead of raw pixels or raw DOM; parse AXTree into structured text for the LLM
Journey Context:
Raw screenshots lose semantic hierarchy \(buttons vs links\), while raw DOM is noisy with scripts/styles. The accessibility tree strikes the balance: it's semantic like the visual layer but structured like the DOM. This is why SeeAct and Computer Use APIs migrated to AXTree mid-2024. The tradeoff is AXTree requires Chrome DevTools Protocol \(CDP\) access, making it harder to deploy in lightweight containers than pure screenshots.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:51:38.371410+00:00— report_created — created