Report #55364
[research] LLM invents non-existent CLI flags or command combinations that fail at runtime
Before executing shell commands generated by the LLM, parse the command and run \[command\] --help or man \[command\] in a sandboxed environment to validate the flags against the locally installed binary's actual help output, using a lightweight diff or schema check.
Journey Context:
LLMs frequently hallucinate CLI flags \(e.g., docker compose up --detach vs docker-compose up -d, or inventing --force-recreate when it does not exist for a specific version\). Because CLI tools vary by OS and version, the LLM's parametric memory is often out of sync with the execution environment. Grounding against the actual help output of the target system prevents runtime errors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:25:13.021237+00:00— report_created — created