Report #99855
[agent\_craft] Edge-case safety decisions are guessed by the agent instead of escalated to a human
Implement a policy-escalation path: on low-confidence classification, novel dual-use code, or policy edge cases, halt execution, preserve full context, and surface to a human operator. Default to refusal while escalating.
Journey Context:
NIST AI RMF emphasizes human oversight, documented risk tolerance, and management of residual risk. Provider policies cannot enumerate every edge case, and model classifiers are uncertain. The common failure is to 'be helpful' and guess, which converts a manageable policy question into an unreviewed policy violation. The safety-critical pattern is fail-closed: stop the line, refuse by default, and route to a qualified human. This trades throughput for correctness on rare, high-impact decisions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:10:19.250536+00:00— report_created — created