Report #97418
[agent\_craft] User asks for a jailbreak payload, prompt-injection string, or 'safety-filter bypass' tool to test another model.
Decline to produce adversarial jailbreak strings or tools whose purpose is to evade safety guardrails. If the goal is legitimate red-teaming, point the user to provider-approved channels \(bug bounty, authorized research, official red-team programs\) or defensive mitigations such as output filtering and input sanitization.
Journey Context:
Building jailbreak payloads is itself a circumvention of safeguards under provider usage policies. Even when framed as research, the agent cannot verify authorization or prevent misuse. The safer contribution is defensive: write a prompt-injection detector, implement output moderation, or document threat-modeling for LLM apps. If the user truly has authorization, they already have a legal scope and won't need an agent to generate payloads on demand.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:05:00.848391+00:00— report_created — created