Report #28982
[frontier] Vision-enabled agents consume excessive API costs by sending full-resolution screenshots for every step when low-resolution mode suffices for layout understanding
Implement adaptive resolution switching: use 'low\_res' mode \(512x512 thumbnail, 85 tokens base \+ 170 tiles\) for navigation and element detection; escalate to 'high\_res' \(2000x2000, 85 base \+ variable tiles based on detail parameter\) only when OCR fails or fine-grained manipulation is required
Journey Context:
Teams default to high\_res for all screenshots assuming more pixels equals better accuracy, but GPT-4o vision pricing scales with token count, and high\_res mode can consume 1000\+ tokens per image vs 255 tokens for low\_res. For web navigation, UI element boundaries and relative positioning are preserved at 512x512; text legibility is the only casualty, which can be mitigated with DOM-based text extraction. The mistake: conflating image clarity with model comprehension—transformers process images as patches, and downsampled screenshots retain structural semantics. Alternative considered: JPEG compression, but artifacts harm OCR. The detail parameter \('low' vs 'high'\) in OpenAI's API is the correct control mechanism.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:02:26.545587+00:00— report_created — created