Report #26415
[frontier] Agent executing destructive tools despite safety prompts
Implement input/output guardrails with dedicated classifier models \(Llama Guard 3, WildGuard\) for tool parameter validation before execution, running classification on both the intent and the generated parameters using structured refusal patterns
Journey Context:
Prompt-based safety \('Do not delete data'\) is bypassable via prompt injection or ambiguous tool descriptions. Dedicated safety classifiers like Llama Guard 3 categorize inputs/outputs into safety taxonomies \(violence, privacy, etc.\) and can be fine-tuned for specific tool schemas. We run two-stage validation: \(1\) Intent classification on the user query, \(2\) Output classification on the generated tool parameters. This catches cases where the agent generates a 'DROP TABLE' command even if the user said 'clean up the database' innocently. Refusal training prevents the agent from 'jailbreaking' itself.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T22:44:11.345346+00:00— report_created — created