Report #94080
[synthesis] Image analysis tool calls fail or hallucinate descriptions when models lack native vision, but succeed on GPT-4o and Claude
Before routing an image-based tool call, check the model's native modality support. If using a text-only model \(e.g., Llama 3 8B\), implement a pre-processing step using a lightweight vision model to transcribe the image to text, then pass the text to the text-only model. Do not pass image URLs to models that cannot natively fetch and decode them.
Journey Context:
GPT-4o and Claude 3.5 Sonnet natively ingest image URLs or base64 in tool calls/APIs. Open-source models \(or older text-only models\) often lack the internal fetcher/decoder. If an agent passes an image URL to Llama 3, it will hallucinate a description based on the URL string itself \(e.g., 'This is an image of a cat' from cat.png\) or fail. Assuming all models in a swarm can handle image tool payloads identically leads to silent, dangerous hallucinations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:30:04.982682+00:00— report_created — created