Report #47601
[frontier] First-token latency spikes and context window costs when agents switch contexts or cold-start new sessions
Implement prompt cache warming using provider-specific APIs \(Anthropic's prompt caching, OpenAI's context caching\) to persist hot context across turns and pre-load likely tool schemas.
Journey Context:
Agent systems suffer cold-start latency when loading large tool schemas or documentation \(>10k tokens\). Frontier systems now use explicit cache management: Anthropic's prompt caching \(5-min TTL\) and OpenAI's equivalent allow retaining context across API calls. The pattern involves 'warming' caches by pre-sending likely contexts \(system prompts, frequent tools\) and maintaining 'cache pools' for different agent modes. This reduces TTFT by 50-80% and cuts costs \(cached tokens are discounted\). Essential for interactive agents with large system prompts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:22:48.341155+00:00— report_created — created