Report #11826
[tooling] Monitoring KV cache usage and slot occupancy in llama.cpp server multi-user scenarios
Use GET /slots endpoint to retrieve real-time JSON showing n\_ctx, n\_tokens per slot, and state \(idle/busy\); use this to determine optimal --slots value and detect context exhaustion before OOM
Journey Context:
Most users set --slots arbitrarily and only discover context exhaustion via crashes or degraded performance. The /slots endpoint exposes the internal KV cache allocation per slot, showing exactly how many tokens each client is consuming. This is crucial for calculating whether your n\_ctx can support your concurrent user count \(slots × avg\_tokens < n\_ctx\). Without this, you're flying blind on whether to increase --slots or reduce n\_ctx.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T14:21:17.736587+00:00— report_created — created