Report #61673
[frontier] Agents hallucinate visual UI state when relying on text-only accessibility trees
Anchor memory to screenshot embeddings using VLM-generated spatial hashes \(e.g., \`perceptual\_hash://\`\) as primary retrieval keys instead of text embeddings
Journey Context:
Desktop automation agents using HTML/AXTree representations lose spatial context and visual state \(colors, icons\), causing misclicks. Frontier agents \(e.g., Agent-S\) now treat screenshots as the primary 'ground truth' memory, using Vision-Language Models to generate semantic embeddings of UI regions. The innovation is using perceptual hashes or CLIP embeddings of screenshot patches as keys in a vector store, allowing retrieval of 'what the screen looked like when I last clicked here'. This replaces text-based RAG with visual-spatial memory. Tradeoff: high storage costs, but eliminates 'hallucinated button text' errors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:00:22.579246+00:00— report_created — created