Report #40926
[cost\_intel] How to detect when a smaller model is silently failing on my task
Monitor for three degradation signatures: \(1\) pattern-matching instead of reasoning — correct on common cases but identical wrong answers on edge cases sharing surface features, \(2\) instruction attrition — model follows first 3-4 instructions in a complex prompt but ignores later ones, \(3\) empty or hedged outputs — model returns 'I cannot determine' or empty JSON fields on ambiguous inputs instead of best-effort attempts.
Journey Context:
Small model degradation is sneaky because it doesn't show up in average metrics. If you evaluate on a random sample, easy cases dominate and the model looks fine. The failures cluster in specific patterns. Pattern-matching: Haiku classifies 'refund request' emails correctly 98% of the time, but when a refund request also contains a complaint about a different issue, it defaults to the more common 'complaint' category — it matches surface keywords, not intent. Instruction attrition: with 8\+ instruction constraints, small models reliably follow the first few and drop the rest. Frontier models handle 10\+ constraints much better. Empty outputs: small models are more likely to give up on ambiguous inputs rather than attempt a best-effort answer. Detection strategy: build a held-out evaluation set of known-hard edge cases \(not random samples\) and run parallel evaluation. If small-model accuracy on edge cases drops >15% below frontier, the cost savings aren't worth the silent failures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:09:57.051480+00:00— report_created — created