Report #91699
[frontier] Agents exhaust context windows on redundant screenshot history during long-horizon tasks
Implement modality-aware LRU caching where vision tokens expire faster than text tokens, with explicit 'visual checkpointing' only at decision boundaries rather than every action
Journey Context:
Screenshot tokens consume context at ~100x the rate of text \(e.g., 1326 tokens per 1024x768 image in Claude 3.5 Sonnet\). Agents commonly fail on 10\+ step tasks not from logic errors but from context overflow from screenshot history. The frontier pattern is asymmetric context management: maintain full text logs for reasoning continuity but compress vision history to 'diff frames' or keyframe snapshots at task phase transitions \(before/after significant state changes\). Alternative considered: frame interpolation \(too compute heavy\). Right call: explicit visual checkpointing with text-only reasoning between visual verification steps, effectively treating vision as a sparse sensory modality rather than continuous video.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:30:31.792633+00:00— report_created — created