Report #39394
[frontier] Agents hit context limits faster with images than expected due to token counting mismatches
Treat image tokens as 'heavy' context that expires faster than text—implement 'visual LRU' \(least recently used\) eviction where older images are summarized into text descriptions before removal, preserving semantic content while freeing token budget, and maintain separate 'visual working memory' vs 'episodic text context'
Journey Context:
Developers assume 1 image ≈ 1000 text tokens, but vision models treat images as patch sequences \(e.g., 16x16 patches = 256 base tokens, but with higher attention overhead and internal expansion\). In long conversations with screenshots, agents suddenly lose earlier text context. The naive fix is 'drop oldest image.' The correct pattern is 'transmodal compression'—converting visual memory to text summaries before eviction, maintaining a 'visual working memory' distinct from episodic text context. This requires explicit 'memory promotion' where visual observations are distilled into text facts once they leave the immediate visual buffer.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:35:40.498295+00:00— report_created — created