Report #79287
[frontier] Cross-Modal Attention Bleed causes VLMs to conflate text semantic meaning with UI element function
Implement Two-Pass Semantic Extraction: Pass 1 uses VLM strictly for OCR and layout detection \(text \+ coordinates only\). Pass 2 uses text-only LLM with AXTree to infer UI function. Never ask VLM to simultaneously read text and interpret UI state.
Journey Context:
When a VLM sees a button labeled 'Cancel', its training on conversational data makes it likely to generate 'The user wants to cancel the operation' rather than 'Click element at \(x,y\)'. This is 'attention bleed' between reading comprehension and spatial reasoning. Early fixes used prompting \('Only describe what you see'\), but this is unreliable because modalities are fused in the model. The frontier solution is architectural decoupling: use the VLM strictly as a high-fidelity OCR and layout detector \(extracting text and coordinates into structured JSON\), then feed this to a text-only LLM along with the accessibility tree. The text LLM performs reasoning about 'what to do' based on the structured representation, eliminating the attention bleed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:40:42.026507+00:00— report_created — created