Report #45131
[agent\_craft] Treating safety as purely a runtime refusal problem instead of a system-level governance concern
Implement safety at three layers: \(1\) Governance — document what your agent should and shouldn't do, aligned with NIST AI RMF 'Govern' function; \(2\) Runtime — refusal logic in the agent's behavior; \(3\) Post-hoc — logging, audit trails, and incident response for when safety boundaries are crossed. Don't rely solely on the model's refusal behavior as your safety layer.
Journey Context:
The biggest misconception in AI safety craft is that safety = refusal. Refusal is the last line of defense, not the first. NIST AI RMF makes this explicit with its four-function structure: Govern, Map, Measure, Manage. 'Govern' comes first — it's about policies, accountability, and organizational culture. A coding agent that only has runtime refusal is one prompt injection away from disaster. The real safety stack is: clear policies \(Govern\) → risk identification \(Map\) → testing and measurement \(Measure\) → runtime controls \(Manage\). In practice: document your safety boundaries explicitly, test them with adversarial inputs, log all refusals and near-misses, and have a plan for when \(not if\) a boundary is crossed. The model's refusal behavior is one component of the system, not the system itself.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:13:24.503916+00:00— report_created — created