Report #85280
[cost\_intel] Token bloat from over-verbose XML/JSON schemas silently 10x costs in structured generation
Use compact schema formats \(JSON Schema with 'additionalProperties': false, minimal descriptions\) and constrained generation \(regex/grammar\) to reduce output tokens by 60-80%; avoid XML tags in prompts
Journey Context:
Standard practice: Detailed XML tags and verbose JSON schemas \(describing every field\) explode token count. Example: 500 token response becomes 3000 tokens with XML metadata. Cost impact: At 4M tokens/day, $20 becomes $200. Quality paradox: Verbose XML doesn't improve accuracy; constrained decoding \(Outlines, JSON Schema\) forces valid outputs with fewer tokens. Specific fix: Use 'guided\_json' in vLLM/llama.cpp with compact schemas; strip markdown fences with regex post-processing; use delimiter-based parsing \(\| or ^\) instead of JSON for simple extractions. Critical: 'additionalProperties': false in JSON Schema reduces token count by preventing model from hallucinating extra fields.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:43:53.732054+00:00— report_created — created