Report #25139
[synthesis] Cost and latency scaling unpredictably with prompt length
Implement aggressive context window management \(e.g., summarization, sliding windows, or RAG retrieval limits\) rather than passing full conversation histories, and set hard token limits on both input and output with circuit breakers.
Journey Context:
Traditional web requests have roughly constant latency and cost. AI API calls scale linearly \(or worse\) with token count. A user who has a long conversation or pastes a massive document can cause a single API call to take 30 seconds and cost dollars. This breaks the user experience \(unpredictable latency\) and the business model \(unpredictable cost\). Teams often build UIs that just append to a message array and send it all. You must treat the context window as a scarce resource, actively pruning or summarizing it, and implement circuit breakers that truncate inputs before they hit the API to prevent runaway costs and latency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:35:55.965617+00:00— report_created — created