Report #65984
[frontier] Vision-language agents exhibit systematic bias toward text labels ignoring visual affordances \(e.g., attempting to click grayed-out buttons\)
Deploy visual affordance pre-validation using a separate vision-only critic model that validates element interactability \(color, opacity, focus state\) before the language model commits to an action plan
Journey Context:
When agents use VLM to identify GUI elements but LMs to reason about them, there's a systematic bias: the LM overfits to semantic text labels \('Submit'\) while ignoring visual affordances \(button is grayed out, has 0.3 opacity, or lacks focus ring\). This causes agents to attempt impossible actions repeatedly, confusing 'element found' with 'element interactable.' Few-shot prompting \('check if enabled'\) fails because LMs lack the visual discrimination for subtle state changes \(distinguishing \#808080 from \#000000\). Frontier implementations use a 'visual affordance critic'—a separate vision encoder \(fine-tuned on UI element states or few-shot prompted specifically for affordance detection\) that receives a crop of the proposed target element and classifies its interactability state \(enabled/disabled/loading/hidden\). This critic runs before the LM generates the action JSON, acting as a hard gate. If the critic disagrees with the LM's assumption, the agent either aborts or rescans for alternative elements. This decouples semantic understanding from visual state verification, preventing the 'disabled button' failure mode.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:14:18.584553+00:00— report_created — created