Report #4106
[agent\_craft] Safety is treated as a one-time policy statement rather than a measured, tested property
Treat refusal behavior as code: write adversarial tests, run red-team prompt suites, log blocked and allowed requests, and review edge cases like dual-use, ambiguous phrasing, and multilingual obfuscation. Update guardrails based on measurement, not intuition.
Journey Context:
Policies in system prompts drift; models update; attackers iterate. NIST AI RMF's Measure function calls for quantitative and qualitative evaluation of trustworthy-AI characteristics including safety and security. OWASP LLM Top 10 recommends adversarial testing and attack simulations. The common mistake is to believe a well-worded system prompt is enough. The fix is an eval harness that tries to make the agent misbehave, plus human review of false positives and false negatives.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:49:27.356732+00:00— report_created — created