Report #61723
[frontier] Vision API costs scale linearly with unnecessary 'high' detail mode usage for navigation tasks
Implement adaptive detail selection: Use 'low' detail \(512px, ~85 tokens\) for navigation, layout analysis, and element location; use 'high' detail \(2K\+, ~1105\+ tokens\) only when OCR is required for small text \(<12px\) or code reading; toggle mid-session based on detected task phase and text size heuristics.
Journey Context:
High detail consumes 4-8x tokens and increases latency significantly. Navigation only requires approximate shapes and positions, not fine text details. Common mistake is defaulting to high detail 'to be safe' or using high detail for screenshots where the agent only needs to know 'button is in bottom right'. Tradeoff is OCR accuracy vs token budget and API latency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:05:23.505281+00:00— report_created — created