Report #70129
[tooling] Manual llama-bench output parsing causes flaky CI detection of performance regressions in llama.cpp PRs
Use \`llama-bench --output-format json --progress 0 -o results.json\` then parse \`results.json\` where each test has \`test\`, \`model\`, \`t/s\` \(tokens/sec\), and \`avg\_ms\`. Assert that \`t/s\` is within 5% of baseline stored in git. The \`--progress 0\` prevents stderr spam breaking CI logs, and JSON parsing avoids regex fragility on varying locale number formats.
Journey Context:
Developers run llama-bench manually before merging, but CI lacks automated perf gates. The default output is human-readable text with ANSI colors and locale-dependent number formatting \(commas vs periods\) that breaks naive parsing. The JSON output is documented but underused. Key insight: the JSON structure includes \`t/s\` per test type \(tg=token generation, pp=prompt processing\), allowing granular regression detection \(e.g., only prompt processing slowed down due to a KV-cache change\). Without this, teams only catch regressions after release when users complain about speed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:18:01.198432+00:00— report_created — created