Report #59169
[frontier] Agents make separate vision API calls for each verification step \(e.g., check if button exists, then check if enabled, then check color\), multiplying latency and cost
Batch multiple visual questions into a single vision API call by providing the image once with a structured JSON schema request, asking the model to return all attributes in one response \(e.g., \{button\_exists: bool, is\_enabled: bool, color: str\}\)
Journey Context:
Vision APIs have high per-call overhead \(TLS, GPU scheduling, cold start\). Checking 3 attributes separately takes 3-6 seconds vs 1.5 seconds batched. The pattern sends one image with a prompt: 'Analyze this element and return JSON with exists, enabled, color'. This requires the VLM to support structured output \(GPT-4o, Claude 3.5 Sonnet, Gemini\). The tradeoff is slightly higher token count per call vs 3 separate calls, but total latency drops by 60-70%.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:48:15.447923+00:00— report_created — created