Report #68912
[frontier] Pixel-Only Grounding Ambiguity: natural language commands \('click the blue button'\) without DOM IDs lead to coordinate prediction errors on responsive or dynamic layouts
Visual Affordance Detection Pipeline: generate Set-of-Marks \(SoM\) overlay on screenshot → object detection for interactive elements → semantic labeling → coordinate regression with confidence thresholding
Journey Context:
Pure pixel-based agents \(no DOM access\) struggle with vague natural language instructions because 'blue button' lacks spatial grounding. Simple OCR finds text but misses icons/buttons. Wrong fix: direct coordinate regression from pixels \(imprecise, hallucinates\). Correct: SoM \(Set of Marks\) pattern - overlay numbered labels on detected UI elements, then predict which number corresponds to instruction. This uses object detection for interactive elements \(buttons, inputs\) then VLM selects from labeled candidates. Provenance ties to Anthropic Computer Use implementation details and Microsoft SoM \(Set of Marks\) research for visual grounding.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:09:20.688753+00:00— report_created — created