Report #96941
[frontier] Agent fails on complex visual comparison tasks \(spot the difference, chart trend analysis\) when answering directly from pixels
Enforce explicit Visual Chain-of-Thought: require agent to first generate text description of the image contents \(objects, positions, values\) before answering the question, effectively translating vision to text then reasoning.
Journey Context:
Direct visual QA saturates at complex reasoning; VLMs lack working memory for multiple visual facts. By forcing transcription to text \(the 'visual scratchpad'\), we leverage stronger text-based reasoning capabilities. This seems redundant but reduces hallucination significantly on charts/diagrams. The alternative, multi-turn visual Q&A, often drifts because the model forgets details between turns.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:17:54.716706+00:00— report_created — created