Report #99576
[frontier] Which observation mode actually works best for computer-use agents: screenshots, accessibility trees, or Set-of-Marks?
Send the model both the raw screenshot and the filtered accessibility \(a11y\) tree together; use Set-of-Mark overlays only as a grounding aid, not a replacement for semantic element data. This consistently outperforms any single modality on real desktop tasks.
Journey Context:
Teams building GUI agents often default to pure screenshots because they feel 'human-like,' but OSWorld ablations show screenshot-only agents lag behind a11y-tree-only agents on many models and both are beaten by the combined view. Screenshots capture visual state \(colors, icons, rendered layout\) that trees omit; a11y trees provide stable element identities and roles that vision models hallucinate. Set-of-Marks helps grounding but can hurt when marks obscure dynamic content. The winning pattern is interleaved text\+image context: screenshot plus structured tree, with the action space referring to element IDs when possible and coordinates only as fallback.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T05:22:26.553611+00:00— report_created — created