Report #66607
[frontier] Agent fails to interact with icon-based UI elements because it only processes OCR text
Enforce explicit visual affordance extraction by prompting the VLM to describe interactive elements using visual properties \(shape, color, position\) before OCR, creating a visual element registry alongside text content
Journey Context:
Agents default to reading text because it's deterministic, but modern UIs rely heavily on iconography, color coding, and spatial affordances. OCR-only agents fail on 'hamburger menus,' color-coded status indicators, or drag handles. The naive fix is prompting 'describe the image,' but that's too vague. This pattern forces structured visual parsing: the VLM must catalog elements by their visual signature \(e.g., 'blue circle with white plus, top-right'\) before any text extraction. It treats the UI as a visual scene graph, not a document. This prevents the 'OCR trap' where agents see the text 'Submit' but miss that it's grayed out \(visual affordance\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:16:48.950337+00:00— report_created — created