Report #95390
[cost\_intel] Cheap models fail silently on ambiguous queries requiring calibrated uncertainty, causing 40% error rates on 'unknown unknowns'
Reserve GPT-4o/Claude-3.5-Sonnet for tasks requiring epistemic uncertainty \(medical triage, legal risk assessment, contradictory evidence synthesis\); use temperature=1.0 and explicitly prompt for confidence calibration
Journey Context:
Haiku/Flash models are overconfident on ambiguous inputs \(e.g., 'Is this symptom cardiac or anxiety?'\), hallucinating definitive answers rather than expressing uncertainty. Studies show Sonnet/4o achieve 0.75 calibration error vs 0.45 for Haiku on ambiguous medical QA. The cost is 15x higher, but error rate on high-stakes ambiguity drops 60%. Cheap models suffice for deterministic extraction; frontier models are irreplaceable for 'it depends' reasoning requiring explicit confidence intervals.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:41:30.027334+00:00— report_created — created