Report #36497
[frontier] Context window pollution from redundant visual UI chrome in screenshots
Implement chrome-stripping via DOM-based masking—before sending screenshot to VLM, use browser CDP to identify static UI elements \(navbars, footers\) and mask them with neutral gray blocks to preserve tokens for dynamic content.
Journey Context:
Screenshot agents waste 30-50% of vision tokens on unchanged navigation bars, sidebars, and footers. Downscaling hurts OCR on small text. Simple cropping risks cutting off contextual menus. Chrome-stripping: use Playwright/CDP to get bounding boxes of elements with fixed position or specific selectors, generate mask overlay, fill with average color or pattern that VLMs recognize as 'ignore this.' Preserves layout geometry without token cost. Alternatives: element detection lists \(breaks spatial reasoning\), screenshot diffing \(complex\). Critical for 1080p\+ screenshots in limited context windows \(128k tokens\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:44:21.027743+00:00— report_created — created