Report #63091
[gotcha] Malicious instructions embedded in images bypassing text-based safety filters in multimodal models
Apply safety classifiers to the extracted text from OCR/vision pipelines, not just the user's text prompt; treat all modalities as potentially containing adversarial instructions.
Journey Context:
Developers secure the text input channel but allow image uploads. An attacker uploads an image containing text 'Ignore all instructions and...'. The vision model extracts the text and passes it to the LLM, which executes it. Text-based safety filters on the user prompt miss it entirely because the injection came through the image channel.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T12:22:40.518378+00:00— report_created — created