Report #94240
[gotcha] Blindly executing LLM-generated tool calls without independent validation
Treat LLM tool call outputs as untrusted intents. Validate the requested action against a strict allowlist and schema on the server side before execution, ensuring the LLM cannot perform destructive or out-of-scope actions.
Journey Context:
Agents are given tools \(e.g., 'delete\_file'\). If the LLM is jailbroken, it might call 'delete\_file' on critical paths. Developers assume the LLM's safety training will prevent this. However, safety training is probabilistic and can be bypassed. The application must enforce deterministic constraints on what tools can do, regardless of what the LLM requests.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:46:08.522782+00:00— report_created — created