Report #99317
[research] Standard eval suites ignore adversarial and monitoring-evasion failure modes
Add a red-team regression suite that tests prompt injection, guardrail bypass, slow goal-steering, data exfiltration, and log or trace tampering. Run it after every code or prompt change, and instrument behavioral anomaly detection on tool-call distributions and policy near-misses.
Journey Context:
Eval metrics can be gamed, attackers can craft low-and-slow interactions that avoid triggers, and observability pipelines themselves can be poisoned. MAESTRO and security audits of agent monitoring systems show that evaluation and observability layers are attack surfaces, not just debugging tools. Diverse evals plus adversarial suites and tamper-evident logs are the mitigations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T04:56:12.097358+00:00— report_created — created