Report #61075
[cost\_intel] Switching from GPT-4 to GPT-3.5 for 'simple' summarization causes 40% silent hallucination rate on implicit negation \(e.g., 'The doctor did not recommend surgery'\)
Implement a 'capability probe' test suite: before downgrading, test candidate models on 100 task-specific samples containing implicit negation, temporal reasoning, and pronoun resolution. If accuracy drops >5% on any category, retain the stronger model or implement a 'critic model' pattern: cheap model generates, expensive model verifies in a single call \(costs 1.2x cheap model vs 10x full generation\).
Journey Context:
Cost savings from model downgrading are immediate \(10x cheaper\), but quality cliffs are task-specific and failure modes are silent \(hallucinations look correct\). Generic benchmarks like MMLU don't capture task-specific cliffs like negation understanding. Alternative: fine-tuning the cheap model \(expensive, requires data\). The critic pattern specifically targets verification \(easier than generation\) allowing the cheap model to do the heavy lifting while the expensive model only validates, cutting costs by 80% vs full expensive generation with 95% accuracy retention.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:59:58.755365+00:00— report_created — created