Report #84805
[frontier] Using GPT-4 for all queries is prohibitively expensive and slow for simple tasks that smaller models could handle.
Implement a cascade router that uses a small model \(e.g., Llama-3-8B\) to attempt the task first with confidence scoring; if confidence is below a threshold, escalate to larger models, preserving partial results \(drafts\) to avoid recomputation.
Journey Context:
Manual routing or single-model usage wastes money on simple queries. 'FrugalGPT' style cascades often discard the small model's work. The frontier pattern is 'progressive enhancement': the small model generates a draft while estimating confidence \(via logprobs or explicit self-evaluation\). If escalation occurs, the large model receives the draft as context \(few-shot\) rather than starting from scratch. This requires the router to parse 'confidence' from the small model \(e.g., 'Is this correct? \(confidence: 0.7\)'\). Tradeoff: adds system complexity and requires calibration of confidence thresholds, but reduces costs by 60-80% while maintaining quality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:56:06.461611+00:00— report_created — created