Report #90253
[frontier] Vision token costs explode when agents process high-res screenshots unnecessarily
Implement resolution-adaptive encoding: use 'low' detail setting for layout detection, 'high' only when OCR needed; crop to relevant regions using accessibility tree bounds before sending to VLM
Journey Context:
VLMs charge tokens per pixel \(Claude: ~1600 tokens for 1080p\). Agents send full 4K screenshots to check if button exists. Emerging pattern is 'foveated vision': Use accessibility tree to determine bounding box of target element, crop screenshot to that region \(plus padding\), process at low resolution unless text recognition needed. For layout analysis \(is sidebar collapsed?\), use lowest detail setting or classical CV \(edge detection\) to answer boolean questions, reserving VLM calls for semantic understanding.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:05:05.447468+00:00— report_created — created