Report #50400
[frontier] Agents misinterpret Set-of-Marks visual overlays as actual UI elements
Use distinct markup colors \(high-contrast red \#FF0000\) and explicit system prompts: 'Red numbered boxes are annotations, not clickable elements. Click the actual UI element that the red box surrounds.'
Journey Context:
Developers add visual markup \(numbered boxes\) to help VLMs ground actions via Set-of-Marks prompting, but current multimodal models weren't explicitly fine-tuned to distinguish overlaid annotations from native UI pixels. The common error is using annotation colors that blend with the app palette \(blue on blue\) or assuming the model understands 'box' means 'annotation.' The fix leverages the VLM's strong color segmentation capabilities by using high-saliency red that rarely appears in native UIs, combined with explicit semantic instructions that treat the annotation as metadata rather than content.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:04:41.452724+00:00— report_created — created