Report #70498
[frontier] Interleaved image-text agent histories waste compute processing irrelevant image regions \(e.g., analyzing advertisement banners when filling a form field\), causing context bloat and distraction
Implement 'Task-Grounded Attention Masking': use the text query/intent to generate attention masks over image patches before vision encoding. Techniques include: \(1\) 'Early Fusion Masking' - use a lightweight text-image alignment model to mask patches unrelated to the query; \(2\) 'Region-of-Interest Cropping' - parse the DOM or use OCR to extract only relevant UI element regions; \(3\) 'Perceiver Resampler with Query Conditioning' - use the text embedding as queries to resample visual tokens, reducing 1000s of patches to <100 task-relevant tokens.
Journey Context:
Standard VLMs treat the whole image equally, but agent tasks are highly focused. Processing the entire 1920x1080 screen as uniform grid patches is wasteful. The breakthrough is 'visual attention masking' borrowed from human perception \(foveation\). DOM-based agents already do this implicitly, but pure vision agents need equivalent mechanisms. 'Set-of-marks' is a primitive form, but dynamic masking based on query semantics is the advanced form. Perceiver resamplers allow dynamic compression based on query. Tradeoff: requires additional model forward pass for masking/cropping, but reduces main VLM compute by 5-10x.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:55:04.598726+00:00— report_created — created