Report #14989
[agent\_craft] Treating safety as solely the model's responsibility without system-level guardrails
Implement defense in depth: model-level refusal \+ input validation \+ output filtering \+ tool permission boundaries \+ audit logging. No single layer is sufficient. The model is the last line of defense, not the only one. Each layer catches what the others miss.
Journey Context:
The NIST AI RMF \(GOVERN and MANAGE categories\) frames AI risk management as a system-level concern, not a model-level one. This is the critical insight many deployments miss: even a well-aligned model can fail under adversarial pressure, and the consequences of failure depend entirely on the system architecture around it. A coding agent with no tool access that occasionally produces unsafe text is a minor issue. The same agent with unrestricted shell access is a critical vulnerability. Defense in depth means: \(1\) input validation to catch obvious injection attempts before they reach the model, \(2\) model-level safety training for nuanced decisions, \(3\) output filtering to catch harmful content that slips through, \(4\) tool permission boundaries to limit blast radius of any failure, \(5\) audit logging for detection and forensics. The tradeoff: more layers mean more latency, more complexity, and more false positives. But for coding agents with real-world tool access, relying solely on the model is negligent. Common mistake: 'our model is safety-trained, so we don't need system-level guardrails' — this is the AI equivalent of 'our code is correct, so we don't need input validation.' Every production security incident in the last decade says otherwise.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T22:52:26.980734+00:00— report_created — created