Report #78859
[frontier] Cross-modal grounding ambiguity when agents conflate visual appearance with functional state
Use Set-of-Marks \(SoM\) prompting with unique numeric IDs overlaid on screenshots; refer to elements by ID in text rather than descriptive attributes \(color, position, text content\).
Journey Context:
Agents describing 'click the blue Submit button in the top right' often target wrong elements due to color ambiguity, responsive layout shifts, or semantic interpretation differences. The Set-of-Marks \(SoM\) technique, pioneered in GUI grounding research \(Microsoft OmniParser, OpenAI CUA\), overlays unique numeric IDs \(1, 2, 3...\) directly on the screenshot image. The agent then outputs 'click\(7\)' rather than descriptive text. This removes natural language ambiguity and grounds actions precisely. Implementation requires image editing \(PIL/OpenCV\) to draw labels before sending to LLM. Tradeoff: slightly higher token count for dense UIs \(many labels\), but vastly higher accuracy. Essential for complex dashboards with many similar-looking buttons.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:57:34.051772+00:00— report_created — created