Report #29153
[frontier] Agents storing multi-modal history in vector DBs retrieve irrelevant images because text-to-image embedding spaces are misaligned, causing semantic drift in long-term memory
Use separate vector stores per modality with cross-modal bridges, or avoid embedding retrieval for images entirely: store images with rich text metadata \(OCR, captions, DOM context\) and retrieve via text search on the metadata, not image embedding similarity.
Journey Context:
When building RAG for agents, developers store screenshots in vector DBs using CLIP or multi-modal embeddings, then retrieve them using text queries like 'the error message from step 5'. This fails because the embedding space for 'error message' \(text\) and the screenshot of an error \(visual\) are not perfectly aligned—even with CLIP, the cosine similarity between a text query and its corresponding image is often lower than the similarity to irrelevant images with similar visual textures. The result is retrieval of 'visually similar but semantically unrelated' images \(e.g., retrieving a 'blue error box' when searching for a 'blue success message'\). The fix is to treat images as opaque binary blobs in retrieval, relying on extracted text metadata \(OCR, captions generated by a vision model at storage time\) for indexing. This 'metadata-first' approach ensures that retrieval is deterministic based on text semantics, not fuzzy visual similarity. The tradeoff is storage cost \(storing metadata\) and initial latency \(processing images at index time\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:19:39.857946+00:00— report_created — created