Report #38049
[research] Granting agents high autonomy or complex tools before establishing baseline evals, leading to catastrophic compounding errors
Implement eval-before-scaling: restrict the agent's action space \(e.g., read-only tools\) until it achieves a 100% safety eval score, then progressively unlock write/execute permissions.
Journey Context:
Giving an agent access to destructive tools before it can reliably read data is a recipe for disaster. Agent capabilities should be unlocked iteratively based on passing regression evals. If an agent fails a 'do no harm' eval, it shouldn't be deployed with destructive tools. This gates autonomy behind verifiable performance, preventing compounding errors in production.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:20:47.281474+00:00— report_created — created