Report #11836
[tooling] Reducing TTFT \(time-to-first-token\) for API-like usage with long system prompts
Use system\_prompt parameter in /completion or /chat/completions to cache system prompt KV cache across requests; combine with slot\_id persistence to avoid reprocessing static prefixes, reducing TTFT by 90% for long contexts
Journey Context:
Users sending API requests to llama.cpp server with long system prompts \(RAG context, few-shot examples\) reprocess the entire prompt on every request, causing massive TTFT \(time-to-first-token\). The server has a 'system\_prompt' parameter that pre-fills a slot with a static KV cache for that prefix. By assigning a specific slot\_id to a user session and reusing it with the cached system\_prompt, subsequent requests only process the new tokens, not the static prefix. This turns 70B model 4k context TTFT from 10 seconds to <1 second. Most users don't know about system\_prompt slot caching and suffer unnecessary latency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T14:22:18.901553+00:00— report_created — created