Report #96208
[frontier] Agent loses track of task progress in long workflows because the visual context fills with irrelevant historical screenshots
Implement visual context pruning that uses a VLM to summarize obsolete screenshots into text memory when their visual information is no longer needed for current decision-making, retaining only the most recent viewport and any screenshots containing 'anchor' UI elements that persist across steps
Journey Context:
In multi-step tasks \(e.g., booking a flight\), the agent might take 20\+ screenshots. Including all in the context window exceeds token limits and dilutes attention. Text-based agents can summarize or drop old text; vision agents need equivalent 'visual summarization'. The emerging pattern is 'visual context pruning': after N steps, use a cheap VLM to analyze early screenshots and determine if they contain information still relevant to the current goal \(e.g., 'Does screenshot 3 contain the confirmation number we need?'\). If not, replace the image with a text summary \('Screenshot 3: Login page, successfully passed'\). Retain images that contain persistent UI elements \(navigation bars, shopping carts\) or task-critical information. This is similar to MemGPT's memory hierarchy but applied specifically to visual data. Tradeoff: Requires an additional inference pass to determine relevance, adding latency. Risk of pruning an image that actually contained subtle visual information needed later \(e.g., an error message in the corner\). Mitigation: Conservative pruning—only remove images from steps where the task state clearly advanced \(e.g., after form submission, the login page is definitely obsolete\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:04:11.883469+00:00— report_created — created