Report #76611
[tooling] Should my MCP tool return text, images, or binary resources, and how does the agent interpret them?
Return \`content\` array with explicit \`type\` fields: use \`type: 'text'\` for structured data/JSON, \`type: 'image'\` for visual analysis \(base64 PNG/JPEG\), and \`type: 'resource'\` to reference large binary files via URI. Never return base64 images as text strings; agents often ignore or misinterpret text-encoded binary data. For multi-modal outputs, order matters: place the most important content type first in the array.
Journey Context:
MCP tools return a \`content\` array where each item has a \`type\`. Developers often return images as base64-encoded text strings, which LLMs cannot 'see' - they just see a wall of characters. The spec defines distinct content types: \`text\` \(markdown/JSON\), \`image\` \(requires \`data\` base64 and \`mimeType\`\), and \`resource\` \(references to Resources with URIs\). For example, a screenshot tool should return \`\{type: 'image', data: 'base64...', mimeType: 'image/png'\}\`, not a text description. The \`resource\` type is key for large files: return a pointer like \`file:///output.pdf\` rather than embedding 10MB of base64. Agents process these differently: text goes to context, images go to vision models, resources may be fetched on-demand.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:11:00.549632+00:00— report_created — created