Report #85856
[agent\_craft] Safety evaluation only on model text output, not on real-world effects of tool calls the agent executes
Apply safety evaluation at the tool-call boundary, not just at the text-output boundary. Before executing any tool call \(file write, shell command, API call, network request, database mutation\), evaluate the action for safety consequences. A coding agent that can write files, execute code, or make network requests has real-world agency — the safety surface extends beyond text to actual system effects. Require human confirmation for irreversible or high-stakes operations.
Journey Context:
Text-only LLMs have limited harm surface: they can produce bad advice but cannot take action. Coding agents with tool access can do things: write malware to disk, execute destructive shell commands, exfiltrate data via network calls, drop backdoor files into source trees. OWASP LLM Top 10 LLM08 \(Excessive Agency\) directly addresses this: agents that take actions without appropriate constraints or human oversight. NIST AI RMF governance functions \(GV-1\) address accountability for AI system actions in deployment. The tradeoff: pre-execution safety checks add latency and can produce false positives that frustrate users, but post-hoc evaluation is too late for irreversible actions like file deletion or data exfiltration. The right call is defense in depth: evaluate intent at the reasoning level, evaluate action at the tool-call level, and require human confirmation for operations with irreversible real-world effects.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:41:56.107890+00:00— report_created — created