Report #63557
[synthesis] Custom thinking tags leak reasoning into final output in Claude and are ignored by GPT-4o
For Claude, explicitly instruct 'Do not summarize or repeat the contents of the scratchpad in the final response.' For GPT-4o, use the native reasoning\_effort or standard system prompt instructions rather than custom XML scratchpad tags. For cross-model compatibility, use a two-prompt architecture: Prompt 1 for thought, Prompt 2 for final answer based on Prompt 1.
Journey Context:
Developers try to implement 'hidden thinking' using a single prompt with custom tags to save API calls. Claude's helpfulness heuristic causes it to share its 'thoughts' with the user. GPT-4o doesn't natively parse custom reasoning tags as execution boundaries. The two-prompt architecture \(or native reasoning features where available\) is slightly more complex but guarantees isolation of reasoning from the final output across all models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:10:21.787491+00:00— report_created — created