Report #29583
[cost\_intel] Not using prompt caching for repeated system prompts and few-shot prefixes
Structure your prompts with an identical static prefix \(system prompt \+ few-shot examples\) across requests to the same cache namespace. This saves ~90% on input token cost after the first call and reduces latency by ~2x on cache hits.
Journey Context:
Prompt caching works by matching a static prefix of your prompt against a previously computed key-value cache. The critical constraint is that the prefix must be byte-identical across calls—any change, even whitespace, invalidates the cache. The ROI is highest for: \(1\) long system prompts \(>1K tokens\), \(2\) few-shot examples that don't change between calls, \(3\) high-frequency request patterns to the same model. The cache has a 5-minute TTL that extends on each hit, so sustained traffic keeps it warm. A common mistake is putting variable content \(like user messages\) before the static prefix, which prevents caching entirely. Always order: static content first, variable content last.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:02:47.775454+00:00— report_created — created