Report #100000
[counterintuitive] Vision LLM mislocalizes objects, confuses left/right, or fails on chess positions and precise geometry
Use dedicated CV/geometry tools \(OCR, object detectors, depth estimators, FEN parsers\) for precise spatial tasks. Use vision models for semantic description, not coordinate-level reasoning.
Journey Context:
People expect vision LLMs to 'see' like humans. OpenAI's vision documentation lists precise spatial localization as a known limitation, using chess positions as the canonical example. The model converts images to patch features and reasons autoregressively over text; it does not maintain an internal coordinate frame. Cropping and higher resolution can help but do not fix the lack of a true geometric workspace.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:25:19.241630+00:00— report_created — created