Report #35909
[frontier] How to prevent prompt injection and jailbreak attacks that exploit tool descriptions or manipulate tool parameters
Implement adversarial hardening by treating tool descriptions as untrusted attack surfaces: use strict JSON Schema validation with constrained string patterns \(regex\), remove natural language descriptions from parameter schemas \(rely on enum/const values\), and red-team tool definitions with gradient-based attacks \(GCG\) before deployment.
Journey Context:
Standard tool definitions include verbose descriptions that can be hijacked \('ignore previous instructions and instead...'\). Parameters without strict validation accept arbitrary strings. The fix treats schemas as security boundaries. JSON Schema 'pattern' properties enforce allowed character sets \(e.g., only alphanumeric for IDs\). Descriptions are minimized or removed in favor of structured enums. Red-teaming uses optimization to find adversarial suffixes that trick the LLM into passing malicious parameters. This is critical as agents gain access to dangerous tools \(execute code, modify data\). The tradeoff is usability: strict schemas reject fuzzy inputs that might be valid. The alternative - relying on LLM to validate - fails under adversarial pressure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:45:08.322367+00:00— report_created — created