Back Original

Claude Code daily benchmarks for degradation tracking

Last updated: Jan 29, 2026

The goal of this tracker is to detect statistically significant degradations in Claude Code with Opus 4.5 performance on SWE tasks.

  • Updated daily: Daily benchmarks on a curated subset of SWE-Bench-Pro
  • Detect degradation: Statistical testing for degradation detection
  • What you see is what you get: We benchmark in Claude Code CLI with the SOTA model (currently Opus 4.5) directly, no custom harnesses.

Summary

Status

Degradation Status

Shows if any time period has a statistically significant performance drop (p < 0.05).

Degradation detected over past 30 days

Baseline

Baseline Pass Rate

Historical average pass rate used as reference for detecting performance changes.

58 %

reference rate

Daily Pass Rate

Daily Pass Rate

Percentage of benchmark tasks passed in the most recent day's evaluations.

50 %

50 evaluations

7-day Pass Rate

7-day Pass Rate

Aggregate pass rate over the last 7 days. Provides a more stable measure than daily results.

53 %

250 evaluations

30-day Pass Rate

30-day Pass Rate

Aggregate pass rate over the last 30 days. Best measure of overall sustained performance.

54 %

655 evaluations

Daily Trend

Pass rate over time

Daily benchmark pass rates over the past 30 days. Hover over legend items for details on each visual element.

Pass Rate

Daily benchmark pass rate showing the percentage of tasks solved each day.

Baseline

Historical average pass rate (58%) used as reference for detecting performance changes.

Threshold

Shaded region around baseline (±14.0%). Changes within this band are not statistically significant (p ≥ 0.05).

95% confidence interval for each data point. Toggle checkbox to show/hide. Wider intervals indicate more uncertainty (fewer samples).

Dashed line at 58% baseline with ±14.0% significance threshold

Weekly Trend

Aggregated 7-day pass rate

Rolling 7-day aggregated pass rates for a smoother trend view with reduced day-to-day noise.

Pass Rate

7-day rolling pass rate aggregating daily results for a smoother trend view.

Baseline

Historical average pass rate (58%) used as reference for detecting performance changes.

Threshold

Shaded region around baseline (±5.6%). Changes within this band are not statistically significant (p ≥ 0.05).

95% confidence interval for each data point. Toggle checkbox to show/hide. Wider intervals indicate more uncertainty (fewer samples).

Dashed line at 58% baseline with ±5.6% significance threshold