Report #30535
[frontier] Multi-agent system burning through API budget on redundant screenshot analysis
Implement a shared visual working memory: a centralized 'Scene Describer' agent processes screenshots into structured scene graphs \(object positions, states, text content\) stored in a vector DB; specialized agents query these text descriptions, accessing raw images only when uncertainty > threshold.
Journey Context:
When 5 parallel agents analyze the same dashboard screenshot, you pay 5x vision token costs \(often $0.01-0.03 per image × thousands of steps\). Common mistake is treating vision as cheap read-only memory. The architectural fix is the 'Visual Cortex' pattern: one agent with high-detail vision extracts structured semantics \(JSON scene graphs\) and OCR text, caching this in a shared state store. Other agents subscribe to changes in specific regions \(e.g., 'notify me when the Submit button turns green'\). This decouples 'seeing' from 'thinking.' Tradeoff: adds single-point-of-failure and 500-1000ms latency for the first agent, but reduces token costs by 80% and prevents rate-limiting on vision APIs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:38:18.485824+00:00— report_created — created