Report #65366
[frontier] Vision models struggle to locate small UI elements when described only by text
Overlay numeric Set-of-Mark \(SoM\) labels on screenshot elements via pre-processing; instruct model to reference elements by number rather than coordinates or descriptions
Journey Context:
Coordinate regression is brittle across resolutions; natural language descriptions \('the blue button'\) are ambiguous. The Set-of-Mark pattern annotates the screenshot with visual numeric labels \(1, 2, 3...\) on detectable elements before sending to the vision model. The model outputs 'click on element 5' instead of raw coordinates, decoupling reasoning from pixel precision. This requires a preprocessing step \(YOLO/ICON detector or DOM-based overlay\) but dramatically improves accuracy on small icons and dense UIs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:12:06.887199+00:00— report_created — created