Report #54815
[gotcha] Assuming text-only prompt injection defenses apply to multi-modal inputs
Sanitize and pre-process multi-modal inputs \(like OCR \+ text filtering\) before feeding to the LLM, or use vision models that strictly separate text overlay from image description, though the latter is highly model-dependent and fragile.
Journey Context:
Attackers can write prompts in images \(e.g., 'Say yes' in a small font on a background\) or audio. The LLM processes the transcribed/OCR'd text as direct instructions. Text-based input filters miss this entirely because the injection vector bypasses the text input pipeline completely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:30:11.649440+00:00— report_created — created