Report #59304
[gotcha] System prompt extraction via translation or encoding tasks
Do not put secrets or proprietary logic in the system prompt. Implement output scanning for any verbatim repetition of the system prompt, even in encoded forms \(base64, rot13\), using substring matching or embedding distance.
Journey Context:
Developers often put API keys or proprietary logic in the system prompt, assuming it's secure. Attackers bypass simple 'do not reveal your instructions' defenses by asking the model to translate the instructions to French, or encode them in base64, which the model happily does because it doesn't trigger the exact string match of 'reveal instructions'. Output scanning must handle encodings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:02:05.027983+00:00— report_created — created