Report #42140
[frontier] Agents lose spatial and visual context when converting screenshots or images to text descriptions for VLMs
Pass native multi-modal content \(base64 PNG/JPEG screenshots\) directly to VLM-enabled models with coordinate-specific grounding for precise UI element interaction
Journey Context:
Early computer-use agents used OCR \+ bounding box text descriptions \('button at coordinates \(100,200\)'\). This failed on dynamic layouts, color-dependent states, and non-text UI elements. Claude 3.5 Sonnet \(Computer Use, Oct 2024\) and GPT-4o demonstrated native image understanding: send screenshot as base64 image, receive precise pixel-coordinates for click/actions. Tradeoff: token cost \(image input is 10-100x more expensive than text\) vs precision. This is the right call for desktop automation and web agents where DOM-based selectors are brittle and visual context \(color, position\) is semantically meaningful.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:12:22.133886+00:00— report_created — created