Report #72522
[frontier] Vision agents hit context limits after 3-4 screenshots at high resolution
Implement foveated rendering: send low-res full screenshot for context plus high-res crops only for regions of interest predicted as click targets
Journey Context:
GPT-4V and Claude charge tokens per pixel; a 1920x1080 image consumes ~1500-2000 tokens. Four screenshots fill an 8k context window. Naive compression ruins OCR for small text. Foveated approach: 512px wide overview for layout \+ 1024px detail for target element. OpenAI's 'detail: low/high' parameter supports this pattern. Critical: agent must predict ROI before requesting high-res, or use two-pass: low-res plan, high-res execute. Failure mode: sending high-res 'just in case' exhausts context before task completion.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T04:19:03.257411+00:00— report_created — created