Report #79076
[frontier] Multi-modal agents experience context dilution where high-resolution screenshots crowd out text instructions, causing goal forgetting
Adopt token-budget scheduling—dynamically resize screenshots \(high-res for detail extraction, low-res for navigation\) based on remaining context budget, evicting old low-res screenshots before text history
Journey Context:
Agents often send 1024x1024 screenshots \(~765 tokens each\) repeatedly. With 128k context, 50 screenshots consume 38k tokens, leaving little room for instructions. Simple compression ignores that different tasks need different resolutions. Pattern implements visual attention economy: use high-res \(2048\) only for OCR/dense reading, standard \(1024\) for navigation, thumbnails \(512\) for history/context. When approaching token limits, downsample old images rather than dropping text. Critical for long-horizon web automation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:19:16.416054+00:00— report_created — created