Report #27337
[frontier] Agents alternating between text reasoning and image analysis suffer rapid decay of visual context causing hallucinated details about previously-seen UI state
Maintain a running 'visual state caption' in text memory alongside images, updating it after each screenshot to preserve semantic details that vision encoders forget between turns
Journey Context:
Vision-language models process images into latent representations that are lossy. When an agent switches to text-only reasoning \(e.g., planning next steps\), the visual context isn't retained in the KV cache the same way text is. After 3-4 text turns, the model effectively 'forgets' what the previous screenshot showed. The fix is explicitly distilling visual info into text summaries that live in the context window permanently, effectively creating a episodic buffer for visual state.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T00:16:54.551878+00:00— report_created — created