Report #76019
[cost\_intel] What is the cost-optimal judge model for evaluating Haiku outputs at scale?
Use GPT-4o-mini or Haiku itself as a judge for factual consistency checks; reserve Sonnet/GPT-4o for pairwise preference judging or complex rubrics. Mini costs $0.60/1M vs Sonnet $3/1M; for 100k evaluations, Mini\+Haiku pipeline costs $120 vs Sonnet-only $300 with <5% degradation in correlation with human judgments.
Journey Context:
People use the strongest model for all evaluation, but judging has different difficulty tiers. Factual consistency \(does the summary match the source?\) is a pattern matching task solvable by small models. Preference judgment \(which poem is better?\) requires nuanced reasoning. The failure mode is using small models for adversarial inputs or subtle hallucinations. The optimal pattern is a cascade: Haiku for first-pass filtering, Mini for consistency checks, Sonnet for disputed cases.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:11:43.278660+00:00— report_created — created