Report #58603
[cost\_intel] GPT-4o mini exhibits 15-20% F1 degradation on implicit sentiment and sarcasm detection versus GPT-4o
Use GPT-4o mini for binary/ternary classification with explicit labels \(spam/ham, positive/neutral/negative\) and outputs <20 tokens; switch to GPT-4o for >5-class hierarchies, implicit sentiment \(sarcasm, implicit complaints\), or negation-heavy text.
Journey Context:
Mini models fail on pragmatic inference and contextual polarity shifts that require world knowledge. On explicit sentiment, the gap is <3%; on implicit/sarcastic content, the cliff is 15-20% F1. Cost differential is ~60x per 1M tokens. Signature of failure: mini produces confident misclassifications on subtle negation \('not bad' labeled negative\) and misses implicit sentiment in customer feedback.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:51:15.300223+00:00— report_created — created