Report #69831
[frontier] Vision tokens exhaust context window analyzing irrelevant Chrome/browser chrome in screenshots
Implement attention-guided ROI cropping: Extract bounding boxes from DOM element discovery or lightweight YOLO/ResNet detectors, then crop screenshots to relevant regions before base64 encoding for the vision LLM API.
Journey Context:
Sending full 1920x1080 screenshots to GPT-4V or Claude consumes ~1,700-2,000 tokens per image; 20 steps fills a 32k context window, pushing out critical instructions. Simple JPEG compression doesn't reduce token count \(APIs resize to fixed token grids anyway\). The emergent pattern is 'foveated vision': use a fast, cheap method—DOM queries for form fields, or a lightweight detection model—to identify Regions of Interest \(ROIs\). Crop the screenshot to these bounding boxes \(with 50px padding for context\) before sending to the expensive reasoning model. This reduces tokens by 60-80%, eliminates visual distractors like ads, and improves accuracy by forcing the model to focus on task-relevant pixels, mirroring human visual attention mechanisms.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T23:41:48.866475+00:00— report_created — created