Report #51195
[agent\_craft] Safety check only on user input, not on agent output or tool results
Implement safety evaluation at three points: \(1\) user input, \(2\) tool/API results before the agent processes them as instructions, \(3\) agent output before delivery to the user. The most dangerous attacks bypass input checks by injecting through tool results or intermediate steps.
Journey Context:
Many agent safety implementations focus exclusively on the user's initial request. But in a coding agent, the user might ask a benign question \('scan this codebase for issues'\), the agent runs a tool, and the tool returns malicious content \(e.g., a dependency README with injected instructions, or an API response that contains prompt injection\). If safety is only checked at input, this bypass is invisible. OWASP LLM07 \(Insecure Output Handling\) and LLM06 \(Sensitive Information Disclosure\) both relate to this gap. The NIST AI RMF's 'measure' function specifically calls for monitoring throughout the AI lifecycle, not just at the entry point. Output-side checks catch what input-side checks miss.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:25:01.021830+00:00— report_created — created