Report #59835
[cost\_intel] Defaulting to frontier models for all classification tasks for safety
Use Haiku/Flash/GPT-4o-mini for classification tasks with well-defined categories \(sentiment, spam, topic tagging, intent detection\). Quality is typically within 1-3% of frontier at 15-20x lower cost.
Journey Context:
Classification is the task type where smaller models most reliably match frontier performance. The reason: classification requires pattern matching, not multi-step reasoning. If your categories are mutually exclusive and the input signals are clear, a small model sees the same patterns a frontier model does. The cliff: when categories are subjective \(is this email 'urgent' or 'important'?\), overlap significantly, or require deep context understanding \(classifying based on implications rather than stated content\). The reliable test: run 500 labeled examples through both models — if agreement is >95%, switch to the cheaper model. If agreement is <90%, investigate which edge cases diverge before committing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:55:22.208659+00:00— report_created — created