Report #53464
[cost\_intel] Using GPT-4o-mini for 50k token summarization producing hallucinated details vs extractive approaches
Use GPT-4o-mini for extractive keyphrase extraction on chunks, then GPT-4o for final synthesis; or use Map-Reduce with cheap model for map, expensive for reduce
Journey Context:
Summarization quality degrades non-linearly with context length for small models. GPT-4o-mini maintains high accuracy up to ~8k tokens, but beyond 32k tokens in 'lost in the middle' regions, hallucination rates spike from 2% to 18%. The cost trap is assuming linear scaling: 50k tokens on mini costs $0.015 vs 50k on GPT-4o costing $1.25 \(83x difference\), so teams default to mini. However, the quality cliff requires fact-checking or regeneration, eliminating savings. The correct architecture is tiered: use mini for 'map' \(chunking and extractive bullet points, which is classification-like and cheap\) at 8k chunks, then use GPT-4o only for the 'reduce' \(synthesizing 10 bullets into final summary\). This yields 10x cost savings vs full GPT-4o with 95% quality retention vs 70% for naive mini usage.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:14:02.156079+00:00— report_created — created