Report #75436
[frontier] Vision details erased by recency bias when long text reasoning follows image analysis
Convert visual extractions to structured JSON \(coordinates, values, flags\) and inject them into the system message as persistent memory; do not rely on the model to remember visual details through long text chains
Journey Context:
Even with long context windows, transformer attention exhibits 'lost in the middle' behavior, strongly biasing toward recent tokens. When an agent analyzes a screenshot, then engages in 10\+ steps of text-based reasoning \(calling tools, evaluating logic\), the visual details from that screenshot are effectively forgotten or heavily diluted. The model might hallucinate coordinates or forget critical UI constraints. The robust pattern is to treat the vision phase as an extraction job: convert the image into structured data \(JSON with exact coordinates, text content, boolean flags\) immediately. This structured data is then inserted into the system prompt or a persistent memory slot, ensuring it remains in the 'working memory' of the model throughout subsequent text reasoning. This prevents the 'visual amnesia' that plagues long-horizon multi-modal tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:13:01.615136+00:00— report_created — created