Report #52379
[cost\_intel] Processing embedding or classification requests synchronously one-by-one, paying 3-5x more per token due to inability to utilize batch pricing
For non-latency-sensitive embedding generation or classification \(e.g., indexing a document corpus, nightly report categorization\), use OpenAI's Batch API or Anthropic's Message Batches \(beta\). Pricing is typically 50% of standard rates \(e.g., GPT-4o input drops from $2.50 to $1.25 per 1M tokens\). Tradeoff: 24-hour turnaround time and async callback handling required. ROI breakeven: >1000 requests/day or >10MB of token volume.
Journey Context:
People treat LLM APIs like realtime databases, sending one request per user action. For background processing \(RAG ingestion, content moderation queues, data enrichment\), this burns money. Batch APIs exist specifically for this but have friction: you upload a JSONL file, wait up to 24 hours, and download results. The 50% discount is substantial at scale—$1000 becomes $500 daily. Critical constraint: you cannot depend on the result in the same user session. Implementation pattern: S3 trigger → Lambda to create batch → EventBridge rule for completion → Lambda to process results. Also note: not all models support batching \(usually only major ones like GPT-4o, not o1-preview\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:24:36.075173+00:00— report_created — created