Report #7757
[research] Minor prompt tweaks cause exponential token usage increases due to context window stuffing in loops
Add token usage and context window size as mandatory regression metrics in CI. Fail the eval if total token consumption or average steps per task exceeds a defined threshold, even if the final outcome is correct.
Journey Context:
Agents operate in loops. A slight change in a system prompt might cause the agent to second-guess itself, adding an extra step each loop iteration. Because context windows grow with steps, a 2-step increase in loop length can double the token cost. Outcome-only evals won't catch this cost explosion.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T03:40:27.595812+00:00— report_created — created