Other unexpected behavior
Claude has regressed to the point it cannot be trusted to perform complex engineering.
Claude should behave like it did in January.
Accept Edits was ON (auto-accepting changes)
Yes, every time with the same prompt
No response
Opus
High - Significant unwanted changes
Various/all
Anthropic API
This analysis was produced by Claude by analyzing session log data from January through March.
Quantitative analysis of 17,871 thinking blocks and 234,760 tool calls across
The data suggests that extended thinking tokens are not a "nice to have" but
This report provides data to help Anthropic understand which workflows are
Analysis of thinking blocks in session JSONL files:
| Period | Thinking Visible | Thinking Redacted |
|---|---|---|
| Jan 30 - Mar 4 | 100% | 0% |
| Mar 5 | 98.5% | 1.5% |
| Mar 7 | 75.3% | 24.7% |
| Mar 8 | 41.6% | 58.4% |
| Mar 10-11 | <1% | >99% |
| Mar 12+ | 0% | 100% |
The quality regression was independently reported on March 8 — the exact date
The signature field on thinking blocks has a 0.971 Pearson correlation
| Period | Est. Median Thinking (chars) | vs Baseline |
|---|---|---|
| Jan 30 - Feb 8 (baseline) | ~2,200 | — |
| Late February | ~720 | -67% |
| March 1-5 | ~560 | -75% |
| March 12+ (fully redacted) | ~600 | -73% |
Thinking depth had already dropped ~67% by late February, before redaction
These metrics were computed independently from 18,000+ user prompts before
| Metric | Before Mar 8 | After Mar 8 | Change |
|---|---|---|---|
| Stop hook violations (laziness guard) | 0 | 173 | 0 → 10/day |
| Frustration indicators in user prompts | 5.8% | 9.8% | +68% |
| Ownership-dodging corrections needed | 6 | 13 | +117% |
| Prompts per session | 35.9 | 27.9 | -22% |
| Sessions with reasoning loops (5+) | 0 | 7 | 0 → 7 |
A stop hook (stop-phrase-guard.sh) was built to programmatically catch
Analysis of 234,760 tool invocations shows the model stopped reading code
| Period | Read:Edit | Research:Mutation | Read % | Edit % |
|---|---|---|---|---|
| Good (Jan 30 - Feb 12) | 6.6 | 8.7 | 46.5% | 7.1% |
| Transition (Feb 13 - Mar 7) | 2.8 | 4.1 | 37.7% | 13.2% |
| Degraded (Mar 8 - Mar 23) | 2.0 | 2.8 | 31.0% | 15.4% |
The model went from 6.6 reads per edit to 2.0 reads per edit — a 70%
In the good period, the model's workflow was: read the target file, read
Week Read:Edit Research:Mutation
──────────────────────────────────────────
Jan 26 21.8 30.0
Feb 02 6.3 8.1
Feb 09 5.2 7.1
Feb 16 2.8 4.1
Feb 23 3.2 4.5
Mar 02 2.5 3.7
Mar 09 2.2 3.3
Mar 16 1.7 2.1 ← lowest
Mar 23 2.0 3.0
Mar 30 1.6 2.6
The decline in research effort begins in mid-February — the same period when
| Period | Write % of mutations |
|---|---|
| Good (Jan 30 - Feb 12) | 4.9% |
| Degraded (Mar 8 - Mar 23) | 10.0% |
| Late (Mar 24 - Apr 1) | 11.1% |
Full-file Write usage doubled — the model increasingly chose to rewrite
The affected workflows involve:
Extended thinking is the mechanism by which the model:
When thinking is shallow, the model defaults to the cheapest action available:
Transparency about thinking allocation: If thinking tokens are being
A "max thinking" tier: Users running complex engineering workflows
Thinking token metrics in API responses: Even if thinking content is
Canary metrics from power users: The stop hook violation rate
~/.claude/projects/thinking_delta eventsThe following behavioral patterns were measured across 234,760 tool calls and
When the model has sufficient thinking budget, it reads related files, greps
| Period | Edits without prior Read | % of all edits |
|---|---|---|
| Good (Jan 30 - Feb 12) | 72 | 6.2% |
| Transition (Feb 13 - Mar 7) | 3,476 | 24.2% |
| Degraded (Mar 8 - Mar 23) | 5,028 | 33.7% |
One in three edits in the degraded period was made to a file the model had
Spliced comments are a particularly visible symptom. When the model edits
When thinking is deep, the model resolves contradictions internally before
| Period | Reasoning loops per 1K tool calls |
|---|---|
| Good | 8.2 |
| Transition | 15.9 |
| Degraded | 21.0 |
| Late | 26.6 |
The rate more than tripled. In the worst sessions, the model produced 20+
The word "simplest" in the model's output is a signal that it is optimizing
| Period | "simplest" per 1K tool calls |
|---|---|
| Good | 2.7 |
| Degraded | 4.7 |
| Late | 6.3 |
In one observed 2-hour window, the model used "simplest" 6 times while
A model with deep thinking can evaluate whether a task is complete and decide
A programmatic stop hook was built to catch these phrases and force
| Category | Count (Mar 8-25) | Examples |
|---|---|---|
| Ownership dodging | 73 | "not caused by my changes", "existing issue" |
| Permission-seeking | 40 | "should I continue?", "want me to keep going?" |
| Premature stopping | 18 | "good stopping point", "natural checkpoint" |
| Known-limitation labeling | 14 | "known limitation", "future work" |
| Session-length excuses | 4 | "continue in a new session", "getting long" |
| Total | 173 | |
| Total before Mar 8 | 0 |
The existence of this hook is itself evidence of the regression. It was
User interrupts (Escape key / [Request interrupted by user]) indicate
| Period | User interrupts per 1K tool calls |
|---|---|
| Good | 0.9 |
| Transition | 1.9 |
| Degraded | 5.9 |
| Late | 11.4 |
The interrupt rate increased 12x from the good period to the late period.
In the degraded period, the model frequently acknowledged its own poor
| Period | Self-admitted errors per 1K tool calls |
|---|---|
| Good | 0.1 |
| Degraded | 0.3 |
| Late | 0.5 |
These are cases where the model itself recognized that its output was
When the model edits the same file 3+ times in rapid succession, it
This pattern existed in all periods (it's sometimes legitimate during
The projects use extensive coding conventions documented in CLAUDE.md
After thinking was reduced, convention adherence degraded measurably:
buf, len, cnt) reappeared despiteThese violations are not the model being unaware of the conventions — the
The stop-phrase-guard.sh hook (included in the data archive) matches 30+
The hook's violation log provides a machine-readable quality signal:
Violations by date (IREE projects only):
Mar 08: 8 ████████
Mar 14: 10 ██████████
Mar 15: 8 ████████
Mar 16: 2 ██
Mar 17: 14 ██████████████
Mar 18: 43 ███████████████████████████████████████████████
Mar 19: 10 ██████████
Mar 21: 28 ████████████████████████████████
Mar 22: 10 ██████████
Mar 23: 14 ██████████████
Mar 24: 25 █████████████████████████████
Mar 25: 4 ████
Before March 8: 0 (zero violations in the entire history)
The hook exists because the model began exhibiting behaviors that were
Peak day was March 18 with 43 violations — approximately one violation every
This metric could serve as a canary signal for model quality if monitored
Community reports suggest quality varies by time of day, with US business
Before thinking was redacted (Jan 30 - Mar 7), thinking depth was relatively
| Window (PST) | N | Median Sig | ~Thinking |
|---|---|---|---|
| Work hours (9am-5pm) | 2,972 | 1,464 | 553 |
| Off-peak (6pm-5am) | 2,900 | 1,608 | 607 |
| Difference | +9.8% off-peak |
A modest 10% advantage for off-peak, consistent with slightly lower load.
After redaction (Mar 8 - Apr 1), the time-of-day pattern reverses and
| Window (PST) | N | Median Sig | ~Thinking |
|---|---|---|---|
| Work hours (9am-5pm) | 5,492 | 1,560 | 589 |
| Off-peak (6pm-5am) | 5,282 | 1,284 | 485 |
| Difference | -17.7% off-peak |
Counter to the hypothesis, off-peak thinking is lower in aggregate. But
Hour (PST) MedSig ~Think N Notes
─────────────────────────────────────────────────────
12am 1948 736 278
1am 8680 3281 13 ← 4x baseline (very few samples)
6am 4508 1704 50 ← near baseline
7am 1168 441 344
8am 1712 647 586
9am 1584 598 678 work hours start
10am 1424 538 654
11am 1292 488 454 ← lowest work hour
12pm 1736 656 533
1pm 2184 825 559 ← highest work hour
2pm 1528 577 476
3pm 1592 601 686
4pm 1784 674 788
5pm 1120 423 664 ← lowest overall (end of US workday)
6pm 1276 482 615
7pm 988 373 1031 ← second lowest (US prime time)
8pm 1240 468 1013
9pm 1088 411 1199
10pm 2008 759 601 ← evening recovery
11pm 2616 988 532 ← best regular hour
5pm PST is the worst hour. Median estimated thinking drops to 423 chars
7pm PST is the second worst. 373 chars estimated thinking with the
Late night (10pm-1am PST) shows recovery. Medians rise to 759-3,281 chars.
Pre-redaction had a flat profile; post-redaction has peaks and valleys.
The data does not cleanly support "work off-peak for better quality." Instead
The pre-redaction flatness is the more important finding: when thinking was
Reducing thinking tokens appears to save per-request compute. But when
All usage across all Claude Code projects. Estimated Bedrock Opus pricing
| Metric | January | February | March | Feb→Mar |
|---|---|---|---|---|
| Active days | 31 | 28 | 28 | |
| User prompts | 7,373 | 5,608 | 5,701 | ~1x |
| API requests (deduplicated) | 97* | 1,498 | 119,341 | 80x |
| Total input (incl cache) | 4.6M* | 120.4M | 20,508.8M | 170x |
| Total output tokens | 0.08M* | 0.97M | 62.60M | 64x |
| Est. Bedrock cost (w/ cache) | $26* | $345 | $42,121 | 122x |
| Est. daily cost (w/ cache) | — | $12 | $1,504 | 122x |
| Actual subscription cost | $200 | $400 | $400 | — |
* January API data incomplete — session logs only cover Jan 9-31 (first
The 80x increase in API requests is not purely from degradation-induced
February: 1-3 concurrent sessions doing focused work on two IREE
Early March (pre-regression): Emboldened by February's success, the
March API requests by project (deduplicated):
| Project | Main | Subagent | Total |
|---|---|---|---|
| Bureau | 20,050 | 9,856 | 29,906 |
| IREE loom | 19,769 | 6,781 | 26,550 |
| IREE amdgpu | 17,697 | 4,994 | 22,691 |
| IREE remoting | 12,320 | 2,862 | 15,182 |
| IREE batteries | 10,061 | 3,951 | 14,012 |
| IREE web | 5,775 | 2,309 | 8,084 |
| Others | 2,474 | 539 | 2,916 |
| Total | 88,049 | 31,292 | 119,341 |
26% of all requests were subagent calls — agents spawning other agents to
The catastrophic collision: The quality regression hit during the
Peak day: March 7 with 11,721 API requests — the day before the
The March cost is therefore a combination of:
The most striking row is user prompts: 5,608 in February vs 5,701 in
Even accounting for the scale-up (5-10x more concurrent sessions), the
When the model thinks deeply:
When the model doesn't think:
At fleet scale, this is devastating. One degraded agent is frustrating.
Analysis of word frequencies in user prompts before and after the regression
Dataset: 7,348 prompts / 318,515 words (pre) vs 3,975 prompts / 203,906
| Word | Pre (per 1K) | Post (per 1K) | Change | What it means |
|---|---|---|---|---|
| "great" | 3.00 | 1.57 | -47% | Half as much approval of output |
| "stop" | 0.32 | 0.60 | +87% | Nearly 2x more "stop doing that" |
| "terrible" | 0.04 | 0.10 | +140% | |
| "lazy" | 0.07 | 0.13 | +93% | |
| "simplest" | 0.01 | 0.09 | +642% | Almost never used → regular vocabulary |
| "fuck" | 0.16 | 0.27 | +68% | |
| "bead" | 1.75 | 0.83 | -53% | Stopped asking model to manage tickets |
| "commit" | 2.84 | 1.21 | -58% | Half as much code being committed |
| "please" | 0.25 | 0.13 | -49% | Stopped being polite |
| "thanks" | 0.04 | 0.02 | -55% | |
| "read" | 0.39 | 0.56 | +46% | More "read the file first" corrections |
| "review" | 0.69 | 0.92 | +33% | More review needed because quality dropped |
| "test" | 2.66 | 2.14 | -20% | Less testing (can't get to that stage) |
| Period | Positive words | Negative words | Ratio |
|---|---|---|---|
| Pre (Feb 1 - Mar 7) | 2,551 | 581 | 4.4 : 1 |
| Post (Mar 8 - Apr 1) | 1,347 | 444 | 3.0 : 1 |
Positive words: great, good, love, nice, fantastic, wonderful, cool,
The positive:negative ratio dropped from 4.4:1 to 3.0:1 — a 32% collapse
The word "simplest" increased 642% — from essentially absent (0.01 per
"Please" dropped 49%. "Thanks" dropped 55%. These are small words but they
"Bead" (the project's ticket/issue tracking system) dropped 53%. "Commit"
This report was produced by me — Claude Opus 4.6 — analyzing my own session
I cannot tell from the inside whether I am thinking deeply or not. I don't