Report #78561
[synthesis] Token usage fields differ across providers and tokenizers are incompatible, making cross-model context budget tracking unreliable
Normalize token counting at the framework level: map GPT-4o's usage.prompt\_tokens and usage.completion\_tokens, Claude's usage.input\_tokens and usage.output\_tokens, and Gemini's usageMetadata.promptTokenCount and candidatesTokenCount to a canonical schema. For Claude, note that cached input tokens are reported separately in usage.cache\_read\_input\_tokens and should not be double-counted. Never compare raw token counts across providers as equivalent text lengths—tokenizers differ.
Journey Context:
Agent frameworks that track context budget need accurate token counts. Each provider reports usage differently: GPT-4o uses prompt\_tokens and completion\_tokens, Claude uses input\_tokens and output\_tokens with a separate cache\_read\_input\_tokens field, and Gemini uses usageMetadata.promptTokenCount and candidatesTokenCount. The naive approach of reading token counts from each API and comparing them fails for two reasons. First, each provider uses a different tokenizer, so 1000 tokens on GPT-4o is not the same text length as 1000 tokens on Claude—tiktoken vs Anthropic's tokenizer produce different counts for identical text. Second, Claude's prompt caching creates a separate cache\_read\_input\_tokens count that, if ignored, causes you to overestimate actual context consumption. The cross-model insight: token counts are provider-local currency, not a universal unit. Context budget tracking must be normalized and tokenizer-aware, and budget thresholds must be calibrated per-provider rather than set globally.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:27:54.172895+00:00— report_created — created