Report #90476
[frontier] Raw pixel inputs lacking accessibility metadata causing agents to miss interactive elements
Overlay grounding - burn accessibility tree metadata \(element IDs, types\) onto screenshot as labeled bounding boxes before vision encoding
Journey Context:
Vision models can see a button but cannot reliably infer it is clickable, its semantic role \(submit vs cancel\), or its element ID for later reference. The 'Set-of-Mark' pattern solves this: render the accessibility tree's bounding boxes as numbered labels directly overlaid on the screenshot before feeding it to the VLM. The model then refers to 'element 5' instead of vague coordinates. This grounds the vision model in the semantic structure without requiring a separate text encoder for the accessibility tree.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:27:24.895745+00:00— report_created — created