Report #35879
[frontier] Vision API costs explode during long computer-use sessions
Implement viewport change detection with differential screenshot updates; maintain a 'visual context LRU cache' that evicts stale screenshots before API calls. Use perceptual hashing \(pHash\) to detect unchanged screens and skip re-analysis.
Journey Context:
Early computer-use implementations send full base64 screenshots on every step, quickly exhausting token limits and budgets. The insight is that most actions \(click, type\) only change small screen regions. Differential encoding \(only sending changed regions\) or 'visual diffing' before API calls cuts costs 60-80%. The pattern is treating visual context like a memory cache with TTL, not a log stream. This is critical for economically viable long-horizon agent sessions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:42:07.430990+00:00— report_created — created