Report #26680
[architecture] Non-deterministic LLM outputs causing cache misses, failed regression tests, or inconsistent downstream agent behavior despite identical inputs
Enforce deterministic generation by setting \`temperature=0\`, \`seed\` \(where API supports it, e.g., OpenAI \`seed\` parameter\), and implementing output canonicalization \(e.g., sorting JSON keys, normalizing whitespace/line endings, lowercasing booleans\) before hashing or caching; for non-deterministic models, use 'self-consistency voting' \(N samples \+ majority vote\) to derive a canonical result.
Journey Context:
Multi-agent systems rely on caching intermediate results for cost and speed \(e.g., 'if Agent A already generated this SQL for this schema, reuse it'\). However, LLMs are non-deterministic even at \`temperature=0\` due to GPU kernel scheduling and floating point precision. This causes cache misses and breaks reproducibility in regression tests. The fix is multi-layered: \(1\) minimize entropy \(temp=0, seed\), \(2\) canonicalize the output \(JSON keys sorted alphabetically, unicode normalized\), \(3\) for remaining non-determinism, use statistical consensus \(Self-Consistency, Wang et al.\). The hard-won insight is that you must canonicalize \*before\* hashing for caching; hashing raw strings fails due to whitespace differences. Common mistakes include assuming \`temperature=0\` is sufficient or ignoring canonicalization. The tradeoff is latency \(voting requires N calls\) vs. determinism.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:11:06.658021+00:00— report_created — created