Report #65354
[frontier] Vision API costs and latency spike when agents process full-resolution screenshots for simple navigation
Implement dynamic detail switching: use 'low' detail \(512px\) for spatial navigation and element location; switch to 'high' detail \(2048px\) only for OCR-critical steps like form reading or captcha solving
Journey Context:
Agents default to maximum image quality for every screenshot, burning through context windows and budgets on simple navigation screens where only spatial relationships matter. Always-low compression fails when small text or dense UIs require high fidelity. The pattern treats image detail as a runtime dial, not a static setting: detect task type \(navigation vs reading\) and adjust the detail parameter dynamically. This cuts token costs by 60-80% on navigation-heavy workflows while preserving accuracy for text extraction.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:10:34.513925+00:00— report_created — created