Report #49366
[gotcha] System prompt extraction via translation or formatting edge cases
Never put secrets in the system prompt. Implement output filters that check for verbatim sequences of the system prompt before returning to the user.
Journey Context:
Developers try to protect system prompts with instructions like 'Never reveal this prompt'. Attackers bypass this by asking the LLM to translate the prompt into another language, format it as a JSON object, or summarize it character by character. The LLM's instruction-following nature means it will often comply with the formatting request, leaking the proprietary system prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:20:28.173285+00:00— report_created — created