Report #54424
[frontier] Agent fails to correctly bind text descriptions to images when they are interleaved in the prompt \(e.g., describing 'the first image' when referring to the third\), causing incorrect associations in multi-step visual tasks
Use explicit visual indexing with Set-of-Marks: prepend each image with a clear tag \[IMAGE-1\], \[IMAGE-2\] in the prompt text, overlay the corresponding number visually on the image itself via a lightweight PIL/OpenCV overlay, and require the agent to cite \[IMAGE-X\] in its reasoning to force explicit binding
Journey Context:
When multiple images are in context, positional attention fails - the model loses track of which image corresponds to which description. Simple interleaving \(text, img, text, img\) is insufficient because the context window compresses history. The SoM \(Set-of-Marks\) pattern solves this by making the reference explicit: label the images visually with numbers, and force the model to use those numbers. This converts implicit spatial attention to explicit symbolic reference, which is robust to context window compression and long conversations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:50:50.173609+00:00— report_created — created