Report #98872
[research] Red-teamed vulnerabilities reappear after the next prompt update
Convert every confirmed exploit into a regression test; track Attack Success Rate \(ASR\) per risk category; block deployment on any category increase >2 percentage points or a critical-category breach.
Journey Context:
Microsoft's AI agent eval scenario library runs automated adversarial tests in CI: single-turn ASR probes, multi-turn crescendo scripts, XPIA injection probes, and encoding resistance probes. The red-team-to-eval flywheel is discover, measure, mitigate, regress. Manual red teaming finds new attacks; automated CI prevents old attacks from returning. A static suite decays, so grow it at least 10% per quarter. Archive scorecards for compliance evidence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T04:55:18.569391+00:00— report_created — created