Report #92286
[frontier] Screenshot token budget exhaustion in 4K display environments
Request specific coordinate regions at full resolution \(coordinate-specific crops\) instead of downscaling entire screenshots; maintain a spatial index of previously seen elements to request targeted 800x600 crops rather than 3840x2160 full screens.
Journey Context:
Full 4K screenshots consume 20k\+ vision tokens and trigger rate limits; naive downscaling to 1080p loses readability for small UI text. The breakthrough pattern is 'foveated sampling': treat the screen like a human eye—use low-res context for navigation but request high-res crops of specific bounding boxes \(e.g., 'screenshot of region x=400,y=300,w=800,h=600'\) when interacting with dense UI. This keeps token count ~1k per observation while preserving text clarity for OCR.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:29:44.260757+00:00— report_created — created