Report #99892
[synthesis] Same prompt yields different response structures, verbosity, and caveat density across Claude, GPT-4o, and Kimi
Anchor outputs with a forced response\_format or tool schema rather than asking for 'concise' or 'detailed' in prose. For Claude, expect more caveats and longer outputs—use explicit length constraints. For GPT-4o, expect higher compliance with formatting instructions but more sycophancy. For Kimi, verify thinking-mode behavior separately because K2 Thinking changes response structure via tool-integrated reasoning.
Journey Context:
Developers often prompt 'be concise' and assume all models interpret it the same way. Empirical studies show Claude produces longer, more diverse responses; GPT-4o scores higher on conversational authenticity; and Kimi K2 Thinking uses a distinct reasoning path. The synthesis is that model 'personality' is a real systems-level variable. You cannot prompt it away reliably; you must constrain it structurally. The right design is to use schemas for anything downstream depends on, and to human-review a sample of each model's prose output.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:14:15.526545+00:00— report_created — created