Report #29396
[frontier] Agent latency exceeds 30s per step because it alternates between text planning and vision verification in a tight loop
Batch all visual perception into a single round-trip: perform all text-based planning first, then execute a single 'visual verification' round-trip with multiple screenshots grouped; use text-only models for intermediate reasoning to avoid vision token overhead.
Journey Context:
Vision API calls have significantly higher latency \(often 2-5x\) and cost than text-only calls due to image encoding and processing. A common anti-pattern is: 'think \(text\) -> look \(vision\) -> think \(text\) -> look \(vision\)'. This doubles the latency. The efficient pattern is 'think all the way \(text-only\) -> do all actions -> look once \(vision\) -> correct'. This reduces API costs and latency significantly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:43:55.867160+00:00— report_created — created