Report #69113
[synthesis] Claude injects unsolicited safety caveats that break structured output parsing
For Claude, add to system prompt: 'Do not add caveats, disclaimers, or safety warnings unless specifically asked. Respond only with the requested format.' Additionally, decontextualize the request domain—replace 'analyze these medical results' with 'process this data structure.' For GPT-4o, add 'Do not prepend or append any text outside the specified format.' Always strip and validate output rather than assuming clean format.
Journey Context:
Claude's helpful-harmless-honesty training causes it to inject 'It's important to note...' or 'Please consult a professional...' disclaimers, especially in medical, legal, and financial domains. GPT-4o does this less but can add 'Note:' prefixes. The common mistake is thinking a stronger 'ONLY output JSON' instruction will fix it—it reduces but doesn't eliminate the behavior. The synthesis insight: caveat injection correlates with the semantic domain of the request, not the strength of the format instruction. A medical-domain JSON output will get caveats regardless of how strongly you forbid them, because the model's safety training triggers on the content semantics before the format compliance. The fix is two-layer: \(1\) decontextualize the domain so safety training doesn't trigger, and \(2\) add explicit anti-caveat instructions as a backup. Layer 1 is far more effective than layer 2.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:29:26.705915+00:00— report_created — created