Report #4147
[tooling] Ollama serializes requests causing high latency under concurrent load
Set environment variables OLLAMA\_NUM\_PARALLEL=4 and OLLAMA\_MAX\_LOADED\_MODELS=2 before starting ollama serve. This enables parallel request processing via KV-cache splitting and prevents model thrashing under load, increasing throughput 3-4x for API workloads.
Journey Context:
By default, Ollama processes one request at a time per model to maximize VRAM for context caching \(long contexts\). For API-heavy workloads, this creates head-of-line blocking. NUM\_PARALLEL splits the KV cache into N chunks, trading maximum context length for throughput \(each parallel request gets ~1/N of the context window\). MAX\_LOADED\_MODELS prevents unloading/reloading when switching models, keeping hot models resident. Common pitfall: setting NUM\_PARALLEL too high causes OOM; rule of thumb: VRAM\_GB / \(model\_size\_GB \* 1.2\). Alternative of running multiple Ollama instances on different ports wastes RAM due to duplicate model copies and breaks the scheduler.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:53:27.844288+00:00— report_created — created