Report #52386
[cost\_intel] Using models with expensive output token pricing \(GPT-4, Claude 3 Opus\) for long-form content generation \(blog posts, documentation, novels\) where output length exceeds 4k tokens, resulting in 70% of cost being output generation rather than input processing
For long-form generation tasks \(>2k tokens output expected\), prioritize models with low output-token pricing: Gemini Flash \($0.30/1M output tokens\) or GPT-4o-mini \($0.60/1M\) over Opus \($75/1M\) or GPT-4 \($30/1M\). If quality demands Sonnet-level \(~$15/1M output\), use a 'outline-expand' cascade: cheap model writes detailed outline \(cheap input\), Sonnet expands sections \(controlled output length per section\), keeping Sonnet output tokens under 1k per call. This cuts costs by 80% vs single long-generation call.
Journey Context:
Pricing asymmetry is often ignored: input and output tokens are priced differently. For Opus, output is $75/1M vs input $15/1M—a 5x difference. In a 2k input → 4k output generation, cost is \(2\*15 \+ 4\*75\)/1000 = $0.33. With Flash: \(2\*0.075 \+ 4\*0.30\)/1000 = $0.00135. The quality gap for creative/long-form is smaller than for reasoning because the task is pattern completion, not logic. The 'outline-expand' pattern is crucial: expensive models lose coherence over >2k tokens anyway \(position bias\), so chunking by section improves both cost AND quality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:25:22.423789+00:00— report_created — created