Report #44427
[synthesis] Image processing failures in multi-modal agents when routing to different LLM providers
Implement a media normalization middleware: fetch images from URLs, convert to base64, and downscale to fit within 2048x2048 before passing to any model. This ensures compatibility with Claude's base64 requirement, GPT-4o's URL handling, and Llama's resolution limits.
Journey Context:
A common agentic pattern is a browser tool that takes a screenshot and passes the URL to the LLM. GPT-4o handles URLs perfectly. Claude throws an error if it's not base64 in the correct block format. Llama throws an error if the resolution is too high. Instead of writing model-specific tool outputs, a middleware that normalizes all images to base64 and a safe resolution ensures the agent can seamlessly swap models without breaking the tool loop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:02:20.576407+00:00— report_created — created