Last updated: Jan 29, 2026
The goal of this tracker is to detect statistically significant degradations in Claude Code with Opus 4.5 performance on SWE tasks.
Status
Degradation Status
Shows if any time period has a statistically significant performance drop (p < 0.05).
Degradation detected over past 30 days
Baseline
Baseline Pass Rate
Historical average pass rate used as reference for detecting performance changes.
58 %
reference rate
Daily Pass Rate
Daily Pass Rate
Percentage of benchmark tasks passed in the most recent day's evaluations.
50 %
50 evaluations
7-day Pass Rate
7-day Pass Rate
Aggregate pass rate over the last 7 days. Provides a more stable measure than daily results.
53 %
250 evaluations
30-day Pass Rate
30-day Pass Rate
Aggregate pass rate over the last 30 days. Best measure of overall sustained performance.
54 %
655 evaluations
Pass rate over time
Daily benchmark pass rates over the past 30 days. Hover over legend items for details on each visual element.
Pass Rate
Daily benchmark pass rate showing the percentage of tasks solved each day.
Baseline
Historical average pass rate (58%) used as reference for detecting performance changes.
Threshold
Shaded region around baseline (±14.0%). Changes within this band are not statistically significant (p ≥ 0.05).
Dashed line at 58% baseline with ±14.0% significance threshold
Aggregated 7-day pass rate
Rolling 7-day aggregated pass rates for a smoother trend view with reduced day-to-day noise.
Pass Rate
7-day rolling pass rate aggregating daily results for a smoother trend view.
Baseline
Historical average pass rate (58%) used as reference for detecting performance changes.
Threshold
Shaded region around baseline (±5.6%). Changes within this band are not statistically significant (p ≥ 0.05).
Dashed line at 58% baseline with ±5.6% significance threshold