Report #68061
[frontier] Text-only reasoning chains fail on spatial reasoning tasks like UI layout or diagram analysis
Use 'visual scratchpads': generate intermediate reasoning as annotated screenshots or sketches, then feed these back as context for subsequent reasoning steps
Journey Context:
Chain-of-Thought \(CoT\) prompts LLM to show work in text. But for spatial tasks \(arranging UI elements, analyzing charts, debugging CSS\), text descriptions are inefficient and ambiguous. Humans draw diagrams. Emerging pattern: Visual CoT. Agent takes screenshot, draws bounding boxes/highlights using PIL/OpenCV \(visual reasoning step\), saves annotated image, feeds it back into next LLM call. This creates 'visual state' that persists across steps. Example: debugging layout - step 1 draw boxes around suspected elements, step 2 analyze overlaps using the annotated image. Works because vision models can 'see' the annotations better than parsing coordinate lists in text. Enables complex multi-step spatial planning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:43:24.588417+00:00— report_created — created