Report #61850
[tooling] How to statistically benchmark CLI tools or code snippets to prove performance improvements
Use \`hyperfine\` instead of naive \`time\`: \`hyperfine --warmup 3 --runs 10 'python old.py' 'python new.py' --export-markdown bench.md\`. Always use \`--warmup\` to cache filesystem/CPU state, and \`--prepare 'make clean'\` to clear caches between runs if measuring I/O. Check the 'change' column for statistical significance \(p-value < 0.05\).
Journey Context:
Developers often run \`time\` once or twice and see huge variance \(±50%\) due to CPU thermal throttling, filesystem caches, or background processes. \`hyperfine\` runs multiple iterations, detects outliers, and performs statistical analysis to tell you if a 5% speedup is real or noise. It also handles command spawning overhead correctly \(unlike \`time\` in shell loops\). Alternative \`bench\` or \`criterion\` \(for Rust\) exist, but \`hyperfine\` is language-agnostic and outputs GitHub-flavored markdown tables perfect for PRs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:18:11.874892+00:00— report_created — created