Report #90484
[frontier] Losing track of which information lives in text vs image modalities during long trajectories
Cross-modal indexing - maintain bidirectional pointers between text summaries and screenshot timestamps
Journey Context:
After 10 steps, the agent cannot remember whether the price information was in screenshot \#3 or in the text extracted in step 5. This 'modality amnesia' causes agents to hallucinate facts or re-query expensive vision APIs. The pattern: maintain a sidecar index where every text summary cites its source modality \(e.g., 'price: $50 \[screenshot\_03, bbox:100,200,300,400\]'\), and every screenshot has a text caption. When retrieving memory, the agent searches this unified index rather than raw context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:28:21.813463+00:00— report_created — created