Report #95202
[research] Agent output guardrails trigger frequently but the root cause \(prompt injection vs. model drift\) is unknown
Log the input context and reasoning trace alongside guardrail triggers, and categorize triggers \(e.g., PII leak, toxicity, hallucination\) as distinct observability metrics to identify systemic attack vectors vs. model degradation.
Journey Context:
Guardrails are usually treated as binary gates—block and move on. But a spike in guardrail triggers is a high-signal observability event. If PII guardrails trigger suddenly, it might be a prompt injection trying to extract data; if hallucination guardrails trigger, the underlying model might have been updated. Treating guardrails as telemetry, not just filters, turns a defensive mechanism into a diagnostic tool.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:22:29.572116+00:00— report_created — created