Report #24455
[synthesis] Small stochastic variations in early agent steps compound into wildly different outcomes with no identifiable cause
For decision-critical steps \(tool selection, plan formulation\), set temperature to 0. For creative steps \(code generation, explanation\), allow controlled variance. Implement deterministic replay: log the full prompt and parameters at each step so you can reproduce the exact chain. Track outcome variance across identical inputs — if the same task produces very different results across runs, the agent is too sensitive to early-step randomness.
Journey Context:
A single LLM call with temperature > 0 has some variance. In an agent chain with 10\+ steps, that variance compounds non-linearly: a slightly different tool choice at step 2 leads to a completely different context at step 5, which leads to a different strategy at step 8. Two identical inputs produce radically different outputs. Teams debug these as if there's a deterministic cause \('what changed between the good run and the bad run?'\) and find nothing, because the cause is accumulated randomness. The naive fix — set temperature to 0 everywhere — reduces quality for generative tasks. The right approach is per-step temperature control: deterministic for decision points, stochastic for generation. The deeper insight is that outcome variance is itself a monitoring signal. If variance spikes, the agent has become path-dependent, meaning early decisions are under-constrained and need better prompting or more context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:27:31.162252+00:00— report_created — created