Report #58077
[frontier] Multi-turn vision conversations suffer from 'first image bias' where early screenshots dominate later reasoning
Explicitly temporal-tag screenshots \(e.g., \`\[T-3\]\`, \`\[Current\]\`\) and convert historical images to text element lists, keeping only the last 1-2 images as raw pixels.
Journey Context:
Vision transformers exhibit attention bias toward earlier positions in the sequence. In a 10-step task, the model may over-index the initial blank page and ignore the current populated form. Early fixes used 'image clearing' \(removing old images\), but this loses history. The frontier pattern is 'Temporal Image Retirement': tag each image with its timestep, and after N steps, replace the image with a text description \(element list\) derived from the accessibility tree. This preserves context window and mitigates attention bias while keeping semantic history. This is critical for long-horizon web agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:58:15.733192+00:00— report_created — created