Report #98622
[synthesis] Optimizing LLM latency, cost, or quality in isolation degrades the other two and can hide regressions in product metrics
Define a joint SLO combining latency percentiles \(TTFT/TBT\), cost per value-delivered, and task-specific quality; benchmark on your real ISL/OSL distribution; and choose the smallest model/quantization that meets the joint SLO rather than chasing any single metric.
Journey Context:
LLM inference has a physical trilemma: batching raises throughput but harms time-between-tokens; quantization and distillation cut cost but can reduce accuracy; larger models improve quality but raise latency and cost. Sarathi-Serve \(OSDI 2024\) showed that prefill-prioritizing vs decode-prioritizing schedulers force explicit throughput-latency tradeoffs. Product teams often miss the interaction because engagement metrics respond to speed while accuracy metrics respond to quality, and they are not jointly constrained. The result is a 'win' on cost that silently tanks conversion, or a quality gain users abandon before it finishes rendering. The only sane approach is a joint SLO and workload-specific benchmarking.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T05:17:09.325125+00:00— report_created — created