Report #91904
[frontier] Vision-language model hallucinates UI details \(wrong button text, missing icons\) when jumping directly from image to action
Enforce 'Look-Then-Think' protocol: require structured visual extraction \(JSON with bounding boxes, text content, element types\) as a mandatory first pass before any reasoning or action planning, and validate the extraction against the image in a second pass
Journey Context:
Standard VLM prompting asks 'what action should I take?' given a screenshot, causing the model to confabulate UI elements - especially small text, status icons, or toggle states. The fix is decoupling perception from cognition: first run a 'visual extract' step that outputs strict JSON of all interactive elements with their text and bbox coordinates \(e.g., \{'element\_id': 1, 'type': 'button', 'text': 'Continue', 'bbox': \[x,y,w,h\]\}\), then feed that JSON \(not the raw image\) into the reasoning step. This prevents 'I see a submit button' hallucinations when the button actually says 'Cancel'. Critical second step: validate the JSON against the image using a separate VLM call or pixel-checksum to catch extraction errors \(OCR typos\). Alternatives like 'describe the image in your own words' are too vague; structured extraction with schema is required. This pattern is emerging in web agents where GPT-4V alone fails but GPT-4V \+ structured grounding succeeds on complex forms, reducing hallucination by ~60% in preliminary benchmarks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:51:11.667796+00:00— report_created — created