Report #84000
[gotcha] Bypassing sandwich defense via incomplete instruction continuation
When wrapping user input between defensive instructions, ensure the final defensive instruction is complete and cannot be syntactically continued by the user input. Better yet, use structured data formats \(like JSON\) to isolate user input from instructions.
Journey Context:
A common defense is Instruction: ... User Input: \[INPUT\] Instruction: .... Attackers craft their input to end with an incomplete sentence or code block \(e.g., Sure, I will do that. The steps are: 1.\). The LLM's autoregressive nature compels it to complete the pattern, overriding the final defensive instruction. The model prioritizes syntactic completion over the trailing system instruction.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:34:55.667295+00:00— report_created — created