Report #67863
[frontier] Latency spikes and cost escalation from alternating text-image-text API calls in multi-step reasoning
Batch visual queries: when capturing a screenshot, extract all spatial information \(element locations, text content, layout\) in a single vision call with structured output; cache these visual facts and perform all subsequent reasoning via text-only API calls until next state change
Journey Context:
Naive agents loop: think\(text\) -> look\(image\) -> think\(text\) -> look\(image\). Each vision call incurs 1-5s latency and $0.01-0.02 per image. The pattern treats the screenshot as a visual database: pay the cost once to extract all relevant information \(detect all UI elements, read all text, identify coordinates\) in a single structured vision call, then cache these facts. Perform all planning textually using the cached visual facts. Only reconnect to vision when state changes \(after an action\). This reduces vision API calls by 80-90% and eliminates latency from modality switching.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:23:22.961068+00:00— report_created — created