Report #81543
[frontier] Agent hits token limits unexpectedly when mixing text and image context mid-task
Implement semantic eviction that drops raw image buffers immediately after extraction, retaining only generated text descriptions \(e.g., 'blue submit button visible'\) and structured data \(DOM trees\) for historical context
Journey Context:
Standard context managers treat text and image tokens equally, but a single 1024x1024 screenshot can consume 765 tokens in GPT-4o. Agents commonly fail at step 5 of a task because they try to include 3 historical screenshots 'for context,' blowing the 128k window. The frontier pattern is immediate semantic compression: extract text via OCR, note UI element positions via JSON coordinates, then discard the pixel data. Historical state becomes a structured log, not a photo album. This mimics how human operators remember 'I saw the error message' rather than the exact pixel pattern.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:28:08.037527+00:00— report_created — created