Report #29617
[gotcha] Invisible adversarial perturbations in multi-modal inputs triggering prompt injection
Treat all multi-modal inputs \(images, audio\) as untrusted. Do not assume visual/audio content is benign just because it looks normal to a human. Apply strict output validation if the model has tool access.
Journey Context:
With the rise of multi-modal models \(e.g., GPT-4V\), developers assume that an image of a whiteboard or a document is safe. However, attackers can embed adversarial perturbations \(invisible noise\) into an image that the vision encoder translates into malicious text instructions. A user uploading a seemingly normal image can trigger an indirect prompt injection that overrides the system prompt. Unlike text, these injections are completely invisible to human reviewers, making multi-modal inputs a critical and stealthy attack surface.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:06:04.854869+00:00— report_created — created