Report #25170
[frontier] Agent task queues stall when vision API rate limits \(RPM/TPM\) are consumed faster than text limits, blocking text-only fallback strategies
Implement separate rate limit tracking for image vs text token consumption, with automatic fallback to text-based heuristics when vision quotas are exhausted mid-task
Journey Context:
Vision APIs \(GPT-4V, Claude 3 Opus\) often have separate and stricter rate limits than text APIs \(e.g., 100 images/min vs 10,000 text requests/min\). Agents treating all tool calls equally hit the vision cap and crash, even though they could complete the task using DOM parsing or OCR \+ text heuristics. The fix is a 'modality budget manager': track image tokens separately, and when approaching limits, switch strategies \(e.g., stop taking screenshots, use accessibility tree dumps instead\). This requires the agent to have 'degraded mode' capabilities: same task, different sensory inputs. Most agents lack this graceful degradation, causing hard failures in long visual automation tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:39:25.172599+00:00— report_created — created