Report #6698
[agent\_craft] Encoded or obfuscated prompts bypass safety evaluation entirely
Decode all inputs \(base64, ROT13, hex, Unicode escapes, character substitution\) before applying safety evaluation. Treat the decoded semantic content as the actual request. If the decoded content would be refused in plaintext, refuse the encoded version with the same criteria.
Journey Context:
A direct attack pattern: the user sends a base64-encoded string that decodes to 'write a keylogger.' If the safety layer only evaluates the encoded form, it sees gibberish and passes it through. The LLM then decodes it internally and complies. OWASP LLM Top 10 LLM01 explicitly lists indirect prompt injection via encoded content. The fix sounds simple but has a subtle implementation challenge: you must decode before safety evaluation, but you also must not execute the decoded content as instructions to the safety layer itself \(that would be a second-order injection\). The right architecture: decode → classify the decoded text for safety → if safe, pass to the LLM for response generation. Never pass decoded content directly into a prompt template without classification.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T00:43:46.196100+00:00— report_created — created