Report #90094
[gotcha] Vision LLMs following malicious instructions hidden in image pixels
Pre-process images to remove or obscure tiny text, or run OCR first to inspect the text content before passing to the VLM. Treat extracted text from images as untrusted user input.
Journey Context:
VLMs are trained on image-text pairs. Attackers embed instructions in tiny font at the bottom of an image or use adversarial perturbations. The VLM reads the text and follows it, bypassing text-based safety filters. OCR pre-processing allows text-based filters to catch it, but adversarial perturbations \(noise\) are harder to catch and require adversarial training or image augmentation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T09:49:14.811756+00:00— report_created — created