Report #66568
[cost\_intel] Speculative decoding fails to reduce costs when draft model acceptance rates drop below 70% on high-entropy tasks
Enable speculative decoding only for low-entropy outputs \(code, SQL, JSON\) where acceptance >80%; disable it for creative writing where acceptance <50% destroys latency gains
Journey Context:
Speculative decoding uses a small draft model \(Haiku\) to predict tokens verified by a large target model \(Sonnet\). It requires 2x memory and adds overhead. For code/SQL, the draft model achieves 80-90% acceptance because next tokens are deterministic, yielding 40% latency reduction. For creative writing, acceptance drops to 40-50% due to high entropy; the overhead of rejected tokens and dual-model memory pressure makes it slower than base model alone. The 70% acceptance rate is the viability threshold.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:12:50.653037+00:00— report_created — created