Report #43569
[frontier] Agent context window degrades when interleaving screenshots with text in conversation history causing attention dilution
Maintain parallel text and image buffer lanes; interleave only at inference via modality-specific attention masks or separate encoders
Journey Context:
Standard practice dumps base64 images into chat history, causing VLM attention to smear across irrelevant past visuals \(e.g., analyzing a 10-step-old screenshot\). Separating lanes preserves narrative coherence while allowing targeted visual retrieval via cross-attention. Alternatives like image summarization lose spatial detail; token merging blurs visual semantics. This mirrors human working memory separation of phonological and visuospatial sketches.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T03:36:13.007623+00:00— report_created — created