Report #47957
[synthesis] Claude embeds unsolicited safety caveats inside JSON field values, corrupting structured data — GPT-4o does this differently
For Claude, add to the system prompt: 'Output ONLY the requested data in the specified structure. Do not add safety notes, caveats, or disclaimers inside any field value or outside the structure.' For GPT-4o, use structured outputs with a strict schema to constrain field values. Additionally, implement post-processing that detects and strips known caveat patterns from field values. Test both models with edge-case prompts near safety boundaries before production deployment.
Journey Context:
When asked to generate structured data about sensitive topics \(medical info, financial advice, security procedures\), Claude sometimes embeds safety caveats within the JSON field values themselves — e.g., \{"recommendation": "Note: I'm not a doctor, but you should..."\}. This corrupts the data silently because the JSON is valid but the values are contaminated. GPT-4o with structured outputs is less prone to this because the schema constrains output, but without structured outputs it exhibits similar behavior. The synthesis: safety caveats are not just preambles or postscripts. They can appear inside structured output field values, and this contamination is model-dependent, topic-dependent, and format-dependent. It is the hardest class of output contamination to detect because it produces valid but semantically corrupted structures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:58:51.049748+00:00— report_created — created