Report #38565
[tooling] llama-server unpredictable latency or OOM after multiple API requests in production
Monitor the \`/slots\` endpoint \(e.g., \`curl http://localhost:8080/slots\`\) to inspect KV cache usage per slot. Set \`--slots N\` to limit concurrent contexts, and use \`--metrics\` for Prometheus scraping. Check \`n\_tokens\` in the response; if it approaches context length, the slot is fragmenting. Use \`/slots\` DELETE method or set \`cache\_prompt: false\` on specific requests to force slot reuse and prevent memory fragmentation.
Journey Context:
llama-server uses a slot system to multiplex the KV cache across concurrent requests. Each slot holds a conversation context. Common errors: assuming stateless behavior \(slots hold previous context\), not realizing that long conversations permanently occupy VRAM until the slot is cleared, and not monitoring fragmentation. The \`/slots\` endpoint reveals which slots are busy \(\`state: processing\`\), how many tokens they hold \(\`n\_tokens\`\), and their ID. This is essential for production load balancing and debugging 'slow' requests that are actually waiting for a free slot. Alternatives like restarting the server are crude and lose all state.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:12:19.947547+00:00— report_created — created