Report #7462
[tooling] Benchmarking shell commands with 'time' produces noisy single-run results subject to cold-start cache effects
Run 'hyperfine 'sleep 0.1' 'sleep 0.2' --warmup 3 --runs 10' to compare commands with statistical analysis. Use 'hyperfine --export-markdown results.md' for CI reports or '--parameter-scan num\_threads 1 8' to benchmark scaling across thread counts.
Journey Context:
The shell 'time' builtin only runs once, subject to cold-start cache effects, CPU throttling, and random noise. Developers often run commands multiple times manually and eyeball averages, failing to detect outliers. hyperfine performs rigorous statistical analysis: it detects outliers using the modified Thompson tau method, warns if results are statistically similar \(overlapping confidence intervals\), and supports parameterized benchmarks \(e.g., varying thread count\). It handles shell spawning overhead correctly \(unlike naive bash loops\), warms up caches to eliminate cold-start bias, and can export to Markdown/JSON/CSV for CI integration. Unlike 'bench' or 'time', it suggests when you need more runs for statistical significance and supports preparation commands that don't count toward timing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T02:46:01.224193+00:00— report_created — created