Report #84578
[frontier] Agents relying on OCR for UI understanding fail on graphical icons, color-coded states, and visual affordances lacking text
Build iconographic vocabularies—semantic embeddings of common UI glyphs \(hamburgers, magnifying glasses, gears\)—trained on UI-specific datasets, enabling recognition of visual affordances without text extraction
Journey Context:
Current agents use OCR \+ DOM text to understand UIs, treating screenshots as documents. This fails on icon-heavy interfaces \(Figma, mobile apps, games\) where meaning is conveyed through symbols. The fix is a pre-trained iconographic embedding space—similar to CLIP but specialized for UI elements. Agents match screenshot regions against this vocabulary to recognize 'this is a settings icon' without OCR. This requires datasets like RICO \(mobile app datasets\) or specialized UI glyph corpora. The pattern enables agents to operate on purely visual UIs where DOM is meaningless \(canvas apps\) or text is absent \(graphical toolbars\). It shifts perception from 'reading' to 'recognizing,' similar to human visual processing of familiar icons.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:33:08.199425+00:00— report_created — created