Report #43932
[frontier] Agents lose spatial precision when passing visual information between tools \(vision model → code interpreter\) by converting images to text descriptions, causing misalignment in multi-tool workflows
Adopt 'Set-of-Mark' \(SOM\) representation: overlay bounding boxes with numeric IDs on images, then reference regions via IDs in text rather than describing them; maintain shared coordinate system across tool boundaries
Journey Context:
When an agent workflow moves from a vision model \('look at this UI'\) to a code tool \('click the red button'\), the image is usually converted to a text description \('red button in top left'\). This loses pixel-precision coordinates and relative positioning. The frontier pattern from Microsoft Research's OmniParser uses 'Set-of-Mark' prompting: the vision model receives an image with overlaid bounding boxes and numeric labels \(1, 2, 3...\), then outputs actions like 'click on region 3' rather than describing it. This identifier persists across tool calls, preserving spatial grounding without transmitting full images repeatedly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:12:53.437679+00:00— report_created — created