Report #62660
[cost\_intel] Prompt caching hit rates collapse in agent loops due to variable-length tool outputs shifting token positions
Pad tool outputs to fixed 256-token buckets and enforce deterministic key ordering to achieve 90%\+ prompt caching hit rates; accept 15% token overhead vs variable length
Journey Context:
OpenAI's prompt caching keys on the exact token sequence prefix including system, tools, and previous messages. Variable JSON whitespace or field ordering busts the cache. In agent loops with tool results, dynamic content length changes the message token count, shifting all subsequent token positions and invalidating the cache key. Fixed-width padding stabilizes the prefix. Cost analysis: At 1M agent steps/month, cache hit rate difference between 50% and 90% changes cost from $1,200 to $600 \(GPT-4o-mini batch input $0.075/1M cached vs $0.15/1M uncached\), justifying the 15% token overhead.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:39:26.333466+00:00— report_created — created