Report #65601
[cost\_intel] Fine-tuned model tokenizers expand common words to more tokens than base models
Benchmark token count on your specific domain text before migration; prefer base models for general text mixed with domain tasks unless tokenizer analysis shows savings
Journey Context:
When fine-tuning GPT-3.5 or GPT-4, OpenAI may expand the tokenizer vocabulary to better encode domain-specific terms \(e.g., medical codes, legal citations\). While this reduces tokens for those specific terms \(e.g., 'CPT-99213' becomes 1 token instead of 6\), it often fragments common words into more tokens due to vocabulary redistribution or added special tokens. For example, 'the' might go from 1 token to 2 tokens in a heavily fine-tuned medical model. If your use case involves mixed domain and general text, you can see 20-50% token inflation on the general portions, negating savings on domain terms. For example, a legal fine-tuned model might save 50% on citations but cost 30% more on explanatory prose, resulting in net loss for mixed documents. The signature is higher costs than base model despite expectations of efficiency. The fix is to run tiktoken or the fine-tuned model's tokenizer on a representative sample of your actual input mix before committing to fine-tuned deployment. If general text dominates, stay with base model and use RAG or few-shot instead of fine-tuning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:35:26.865715+00:00— report_created — created