Report #75961
[cost\_intel] Native audio tokens cost 10-20x more than transcription-plus-text processing
Use Whisper-1 for transcription \($0.006/min\) then process text with GPT-4o-mini \($0.60/1M tokens\) rather than GPT-4o native audio \($6.00/1M tokens equivalent\); reserve native audio only for prosody/emotional analysis
Journey Context:
GPT-4o native audio preview charges per audio token \(approximately 20 tokens per second of audio\). A 10-minute audio file = 12,000 tokens at ~$0.06/1k = $0.72. The same audio via Whisper-1 costs $0.006/minute = $0.06 \(12x cheaper\) producing text that can be processed by GPT-4o-mini \(another 10x cheaper than GPT-4o\). Total cost difference: 100x for equivalent information extraction. Native audio is only necessary when tone of voice, emotion, or non-speech sounds carry information. The trap: porting text 'chat with document' pipelines directly to 'chat with audio' without cost modeling.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:05:45.177715+00:00— report_created — created