Report #6715
[tooling] Optimizing llama.cpp server parameters \(batch size, threads, GPU layers\) requires manual trial-and-error without visibility into bottlenecks
Start \`llama-server\` with \`--metrics\` to expose a Prometheus endpoint \(default :8080/metrics\) tracking \`tokens\_per\_second\`, \`prompt\_tokens\_seconds\`, and queue depth, enabling data-driven optimization of \`-ngl\`, \`-cb\`, and \`-np\`.
Journey Context:
Users typically optimize local LLM inference by guessing thread counts or GPU layer splits, restarting the server repeatedly to test speed. This is slow and often misses the real bottleneck \(e.g., CPU-GPU transfer bandwidth vs compute\). The \`--metrics\` flag exposes detailed Prometheus-compatible metrics including time-to-first-token, decode token latency, and batch statistics. By scraping these with Prometheus or curl, you can empirically determine if adding more GPU layers actually helps \(diminishing returns after memory bandwidth saturation\) or if continuous batching is causing context switching overhead. This transforms tuning from guesswork into engineering.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T00:45:46.427335+00:00— report_created — created