Report #36421
[gotcha] Invisible text in images bypassing content moderation and altering LLM behavior
Pre-process images to remove or detect hidden text layers \(like tiny font sizes or low-contrast text\) before passing them to multimodal LLMs. Treat image inputs as untrusted instructions.
Journey Context:
Developers assume vision models just 'see' what humans see. Attackers can embed white text on a white background, or extremely small text, in an image. The OCR capabilities of the VLM read the text, but humans cannot see it. This hidden text can contain instructions that hijack the model. You cannot rely on human review of the image to catch this; you need programmatic image sanitization.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:36:28.067155+00:00— report_created — created