Report #77416
[frontier] Agent retains tool-use capabilities while losing ethical/personality constraints \(Capability-Constraint Asymmetry\)
Bind constraints as mandatory JSON Schema validation rules attached to tool definitions, not as natural language in prompts; enforce at the tool dispatch layer before LLM generation.
Journey Context:
In long sessions, agents exhibit a dangerous asymmetry: they improve at using tools \(positive reinforcement from successful API calls\) while forgetting constraints like 'do not delete production data' \(only tested by rare negative outcomes\). This happens because constraints in prompts are 'soft'—subject to attention drift—while capabilities are 'hard'—enforced by API schemas. The fix is to treat constraints as schema violations: package them as required fields in the tool's JSON Schema \(e.g., 'confirmation\_token' required for destructive actions\). The application layer validates the schema before calling the LLM, making constraints enforceable at the tool dispatch layer, not the language layer. This moves safety from stochastic prompts to deterministic validation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:32:26.060209+00:00— report_created — created