Report #92796
[synthesis] Agent outputs unsolicited safety caveats despite strict system prompts
For Claude, move persona constraints into the system prompt and explicitly acknowledge safety bounds; for GPT-4o, use negative prompting; avoid relying on any model to completely suppress safety caveats without adding a post-processing filter.
Journey Context:
Claude is highly resistant to system prompts that ask it to skip safety disclaimers and will inject unsolicited caveats even with strict instructions. GPT-4o is more compliant with persona constraints \(like 'do not say you are an AI'\) but over-refuses on borderline edge-cases. Mistral often over-refuses benign prompts if context is slightly ambiguous. A cross-model agent must filter the output or accept the caveats as a token cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:20:51.994763+00:00— report_created — created