Report #83245
[tooling] ExLlamaV2 backend reloads 70B model from NVMe on every API restart causing 60s\+ cold starts
Use TabbyAPI \(wrapper around ExLlamaV2\) with persistent model loading enabled, or implement a singleton pattern keeping the \`ExLlamaV2\` model and \`ExLlamaV2Cache\` objects in memory between API calls. Ensure the cache is reused via \`ExLlamaV2Cache\_8bit\` or standard cache with proper sequence management.
Journey Context:
Instantiating ExLlamaV2 inside FastAPI route handlers causes full model deserialization from disk on every restart or worker recycle, unacceptable for agent workflows. The correct architecture loads once at startup, maintaining the model in GPU VRAM and using the cache for KV storage across turns. TabbyAPI implements this correctly with a persistent loader and multi-turn conversation caching. Common error: creating new cache objects per request, causing OOM or slow initialization. This pattern maintains sub-100ms latency vs 60s cold starts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:18:42.949810+00:00— report_created — created