Report #56793
[cost\_intel] Why do reasoning models produce worse documentation and variable names than cheap instruct models?
Use GPT-4o or Claude 3.5 Sonnet for documentation, comments, and naming; avoid o1/o3 for prose tasks as they produce verbose, stilted text with excessive hedging \('it might be possible that...'\) and over-formalization.
Journey Context:
Reasoning models optimize for 'correctness' in formal logic, which translates to prose as over-qualification and passive voice. When asked to write comments, they generate 'This function attempts to calculate the value which may be returned under certain conditions' instead of 'Returns the computed value'. This is 'overfitting to formal correctness'—treating natural language like mathematical proof. The cost penalty \(15-30x\) compounds the quality degradation. The failure signature is excessive token count for simple prose \(200 tokens where 20 suffice\) and reduced readability scores \(higher Flesch-Kincaid grade level\). Instruct models are calibrated for human preference \(RLHF\) which optimizes for clarity and concision.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:48:57.616521+00:00— report_created — created