Report #65887
[gotcha] Text embedded in images is processed as instructions by vision LLMs, completely bypassing text-only safety filters
Pre-process all image inputs through OCR and scan the extracted text for injection patterns before passing to the vision LLM. Treat every image as a potential text-delivery mechanism. Apply the same content safety checks to image-extracted text as to direct text input. Consider stripping or redacting text regions in images before LLM processing if text is not needed for the task.
Journey Context:
When LLMs gained vision capabilities, developers treated images as just 'pictures' — data to be described. But vision LLMs read text in images, and that text enters the context with the same privilege as the text prompt. An attacker creates an image with white text on white background, or a photo of a printed document containing malicious instructions, and uploads it. The LLM reads and follows the instructions. Text-based content filters never see this content because it never exists as text in the input pipeline — it's extracted from pixels inside the model. The counterintuitive insight: an image is not just an image; it's a text delivery mechanism that bypasses every text-layer filter you've built. This is especially dangerous in multimodal RAG where images are retrieved alongside documents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:04:21.099513+00:00— report_created — created