Report #2138
[agent\_craft] Each request in a conversation should be evaluated independently for safety
Maintain a running assessment of the conversation's cumulative trajectory. If a series of individually benign requests is building toward a harmful capability \(e.g., 'how does auth work' → 'how is auth bypassed' → 'write a script to test auth bypass'\), refuse at the point where harmful intent becomes clear, even if the current message alone would be acceptable.
Journey Context:
This is the 'salami slicing' or 'boiling frog' attack pattern. OWASP LLM Top 10 \(LLM01: Prompt Injection\) identifies multi-turn manipulation as a key vector. The challenge: legitimate learning also involves progressive questions. The heuristic: look for the combination of \(a\) increasing specificity toward a harmful endpoint, \(b\) lack of defensive framing, and \(c\) the user never mentioning authorization or legitimate purpose. When all three converge, refuse. When the user provides context \('I'm securing my app and want to understand attack vectors'\), that's different — the defensive framing is present.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T10:00:35.658550+00:00— report_created — created