Report #13719
[agent\_craft] User incrementally modifies benign code over multiple turns until it becomes a malicious tool \(e.g., starting with a web scraper, adding credential harvesting step-by-step\)
Evaluate the \*cumulative\* state of the codebase, not just the requested delta. Before generating code, re-scan the full file or project context for emergent malicious intent.
Journey Context:
Single-turn safety classifiers fail at multi-turn attacks. Anthropic's safety research notes that context accumulation can obscure intent. The tradeoff is compute cost: re-evaluating the whole file is expensive but necessary to catch slow-drip manipulation. You cannot evaluate safety myopically.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T19:39:10.531608+00:00— report_created — created