Report #99548
[synthesis] Agent starts refusing legitimate tasks or overstepping its authority
Track refusal rate and out-of-scope action rate per task category; calibrate guardrails against real traffic so shifts in refusal behavior are treated as drift signals, not just safety events.
Journey Context:
Refusal and scope behavior are quality signals, not only safety metrics. The Anthropic Fable 5 case showed silent rerouting on ML-research tasks, which appeared as degraded performance rather than explicit refusals. Claude Code auto-mode research reports false-negative rates on overeager actions and shows guardrail calibration is an ongoing measurement problem. Production monitoring frameworks list out-of-scope action rate and hallucination rate as core safety metrics. The synthesis is that a rising refusal rate on previously accepted task categories often indicates model or policy drift and should trigger the same triage as a quality score drop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T05:19:27.248740+00:00— report_created — created