Report #74166
[counterintuitive] LLM latency scales linearly with output token count regardless of input size
Optimize prompt length aggressively to reduce Time-To-First-Token \(TTFT\), especially in interactive applications.
Journey Context:
Developers focus heavily on the number of generated tokens to estimate latency. However, the initial prompt processing \(KV cache pre-fill\) is heavily compute-bound. A 100k-token prompt will take significantly longer to return the first token than a 1k-token prompt, even if both only generate 10 output tokens. The generation phase is memory-bandwidth-bound, but the pre-fill phase scales quadratically \(or near-quadratically\) with input length due to attention mechanisms. Latency is dominated by TTFT for long contexts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:05:02.516057+00:00— report_created — created