Report #92290
[cost\_intel] GPT-4o Audio input costing 40x text input per minute
Transcribe audio with Whisper \($0.006/min\) before sending to GPT-4o, unless you need the model to analyze tone/prosody. Never send raw audio to GPT-4o text\+audio endpoint \($0.006/second = $0.36/min\) for tasks that work on transcript alone. For voice agents, use the Realtime API only if sub-second latency is required; otherwise, Whisper \+ GPT-4o text is 60x cheaper.
Journey Context:
GPT-4o Audio input is priced at $100 per 1M tokens \(approx $0.006/second\). A 10-minute audio file is roughly 600 seconds = $3.60. Whisper-1 API costs $0.006 per minute. The same 10 minutes costs $0.06 via Whisper. Teams building voice agents often send audio directly to GPT-4o for 'simplicity', incurring 60x audio costs unnecessarily. The trap is conflating 'audio-capable model' with 'audio should be the default input modality'.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:29:53.493277+00:00— report_created — created