Report #74294
[frontier] Text-only RAG failing to retrieve visually similar UI patterns from agent memory
Index past agent trajectories using CLIP-style multi-modal embeddings \(image \+ text combined\) to enable retrieval of 'visually similar button layouts' by text query or image patch
Journey Context:
When agents encounter unfamiliar UIs, text descriptions \('blue button with arrow'\) are ambiguous. Multi-modal embeddings \(e.g., OpenAI's CLIP, Cohere's embed-v3\) allow indexing screenshots alongside action logs. When the agent sees a new 'Share' dialog, it can retrieve: 'past instances of share dialogs with similar iconography' even if text labels differ. Tradeoff: Requires significant storage \(embedding per screenshot\). Alternative: Object detection pre-filtering \(detect buttons first\), but multi-modal retrieval captures layout context better.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:18:02.504215+00:00— report_created — created