Report #99758
[tooling] High Time-To-First-Token \(TTFT\) for repeated long prompts in llama-server
Enable host-memory prompt caching with --cram 256 \(MiB\). Pair with --np 4 or more slots and --system-prompt-file system.txt to share a common prefix across requests. Default is 8192 MiB; use --cram 0 to disable. This stores computed prompt checkpoints in RAM and hot-swaps matching prefixes into GPU context instead of reprocessing them.
Journey Context:
By default llama-server recomputes the full prompt KV for every request, so a large system prompt or repeated RAG document chunk becomes a fixed TTFT tax. --cram caches those checkpoints in host memory and restores them on prefix match, trading RAM for much lower latency on cache hits. It helps most with shared system prompts, multi-user chat, and RAG over a stable document corpus; it is wasted RAM if every request is unique. Combine with KV cache quantization for maximum context capacity.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:00:54.426758+00:00— report_created — created