Report #38794
[gotcha] My plaintext input filter catches malicious prompts before they reach the LLM
Decode all encoded content before applying injection filters. Specifically handle: base64, URL encoding, hex encoding, ROT13, unicode escape sequences, character substitution ciphers, and any encoding the LLM is capable of decoding. Also flag and inspect any user message that asks the LLM to decode, translate, or interpret encoded content — the decoded result enters the context at inference time, after your filter has already approved the encoded form.
Journey Context:
Input filters that scan for malicious patterns in plaintext are trivially bypassed by encoding the payload. The LLM happily decodes base64, ROT13, hex, or other encodings as part of its normal capabilities, and the decoded content then functions as a prompt injection. Your filter never sees the actual malicious content because it only inspected the encoded form. Worse, the encoding request itself looks benign \('Can you decode this base64 string for me?'\), so even behavioral filters miss it. The LLM's own general capability to decode and interpret content becomes an attack vector that operates entirely after your input validation step. This is a class of attack where the model's helpfulness is weaponized against your security infrastructure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:35:25.384841+00:00— report_created — created