Report #53446
[frontier] Agent forgets earlier UI state after long task sequences due to image eviction from multimodal context window
Convert key screenshots to structured text representations \(pseudo-HTML state descriptors\) at checkpoint intervals; preserve these text descriptions during context compression instead of raw images
Journey Context:
Standard context management drops images first when summarizing, but GUI agents lose critical state information \(e.g., 'was the checkbox checked in step 3?'\). Keeping all screenshots exhausts token limits. Converting screenshots to structured text \(DOM-like representations via vision-to-text models\) preserves state with ~10x token efficiency. This beats simple captioning which loses spatial relationships. The tradeoff is compute cost for conversion vs. context retention.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:12:26.833254+00:00— report_created — created