Report #76432
[frontier] Agent hits vision model rate limits \(low RPM\) while text operations wait, causing frozen state or timeout cascades
Implement 'modality-aware scheduling': batch all pending vision tasks into single API calls \(multiple images per request\), interleave with text-only operations only when vision rate limit buckets have tokens available
Journey Context:
Vision-language models \(GPT-4o with vision, Claude 3.5 Sonnet with computer use\) often have separate, lower rate limits than text-only models. An agent alternating between 'think' \(text\) and 'see' \(vision\) operations quickly exhausts the vision RPM limit. When the limit hits, the agent freezes waiting for the 60-second window to reset, but text operations could have continued. The fix is a scheduler that: \(1\) batches multiple screenshots into single API calls when possible \(e.g., 'here are 3 screenshots, describe what changed'\), \(2\) checks rate limit headers before vision calls, \(3\) queues non-vision work to run during rate limit backoff. The alternative of using separate threads for vision/text fails because the agent state is sequential.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:52:56.481999+00:00— report_created — created