Report #25168
[frontier] Vision-language models lose task continuity when image tokens are evicted from context window before related text instructions
Implement modality-aware context management that evicts image tokens only after their associated text reasoning chains are complete, using image summaries as compression proxies
Journey Context:
Standard context window management \(LRU or sliding window\) treats image tokens \(hundreds per image\) and text tokens equally. When agents process long visual tasks \(e.g., 'compare these 20 UI screenshots'\), naive eviction drops image N while the agent is still reasoning about image N\+1, causing catastrophic forgetting. The fix is a dependency-graph approach: images anchor reasoning chains. Eviction must check if any active chain references the image. For compression, replace evicted images with text summaries \(e.g., 'Screenshot showing blue Submit button at bottom'\) rather than raw deletion. This mirrors how humans sketch notes instead of re-photographing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:38:55.660286+00:00— report_created — created