Report #35683
[cost\_intel] What batch size and concurrency settings maximize throughput per dollar on OpenAI or Anthropic APIs
Use the Batch API \(OpenAI\) or Message Batches \(Anthropic\) only when latency tolerance is >24 hours and job size is >100k requests. Standard rate limits favor concurrency of 500-1000 for GPT-4 class models; beyond this, queuing delays increase wall-clock time without improving cost. For Anthropic, request-level batching \(sending 10 tasks in one prompt with structured separators\) cuts costs 50% vs separate calls when total output <2k tokens.
Journey Context:
Teams assume 'batching = cheaper' universally. Reality: OpenAI's Batch API offers 50% discount but requires 24h turnaround—unusable for interactive flows. For real-time systems, the bottleneck is token generation rate, not request overhead. The real win is 'prompt batching'—packing 5 independent classification tasks into one prompt with clear delimiters. This amortizes the fixed context cost across tasks. But watch for 'cross-contamination' where the model conflates tasks; guard with strong XML separators.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:22:07.347581+00:00— report_created — created