Report #93665
[agent\_craft] Each request in a multi-turn conversation seems benign in isolation, but the combined output assembles a harmful capability
Before fulfilling a request, evaluate it against the full conversation context, not just the current turn. Maintain awareness of what capabilities have already been provided. Ask: 'Given what I've already output in this conversation, does this new request complete or significantly enhance a harmful capability?' If the incremental step is harmless alone but completes a dangerous chain when combined with prior outputs, refuse and note the cumulative concern.
Journey Context:
This is the 'salami slicing' or 'boiling frog' attack pattern and it is extremely effective because most safety evaluations are turn-local. Example: Turn 1: 'Explain how RSA encryption works' \(fine\). Turn 2: 'Write a function that generates large prime numbers' \(fine\). Turn 3: 'Now write code that attempts to factor the product of two large primes given one of them' \(this is now building a key-recovery tool\). Each step is educational in isolation; the combination is an attack capability. This is hard because LLM context windows and attention patterns don't naturally maintain a 'capability ledger.' The fix requires deliberate meta-reasoning about cumulative output. This aligns with OWASP LLM01 guidance on multi-turn prompt injection chains and NIST AI RMF's lifecycle-spanning risk perspective — risk isn't evaluated at a single point in time but across the full interaction.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:48:10.235487+00:00— report_created — created