Data Analysis · June 2026
A simple distributional analysis of every rsync release with bug data. Nothing complicated, answers only one question: are the Claude-assisted releases unusually buggy?
In order to avoid accuastions of this "just being Claude defending Claude," "AI slop," "probably all hallucinations," etc., I've decided it's probably worth explaining a few key points about how this report was created:
In late May 2026, rsync blew up. First, an evidence-free Mastodon post was made pointing to a spurious correlation between a regression that particular user experienced upon upgrading to a release, and that release having Claude commits in it. It was viewed an unknown number of times, but even likes and boosts passed the thousands mark handily, and it gained significant traction — as all spurious anti-AI hate does —, seeing 58 replies from 32 unique users. Someone rages about "cognitive surrender" with no evidence; another suggests adding rsync to the famous open-slopware blacklist. From there, it spread to Hacker News, with 81 comments, full of mixed dread, anger, and crowing about how this finally proves once and for all no one can use LLMs safely. Among all that was one particular comment which spurred further the view that the regressions and bugs were caused by Claude.
This On May 30, 2026, this burgeoning outrage emergently coalesced into a single focal point: a GitHub issue titled "Please Do Not Vibe Fuck Up This Software", opened against the rsync repository. It attached a screenshot of the Mastodon post criticizing the project's use of Claude. That's it. No bug report, no technical content, no attempt to actually ascertain if the concern was real or justified; just 350+ comments ranging from thoughtful concern to outright harassment (most of the most egregious, unreasonable, and outright violent comments have since been deleted; few thought to preserve them).
The thread did not stop at words. It eventually escalated to, at one point, visual depictions of fantasies of violence, when one user posted a now deleted comment including My Little Pony drawings of themselves strangling the "project janitor that pushed vibecoded commits":
Completing the internet outrage cycle, this issue in turn spread to Hacker News, generating hundreds more comments. Some attempted to point at the number of regressions after the introduction of Claude — "The Linux Mint Timeshift tool has an issue open documenting a number of regressions that are currently open on the rsync issues page, that were only introduced post-vibecoding" — as evidence that it was worse. Others pointed out that those regressions were not caused by Claude, and in response, the goalposts were moved again. Over and over, the core theme was one central claim, repeated everywhere: Claude-assisted development introduced bugs into a previously stable tool. AI is cognitive surrender, is cocaine, is loss of craft, and the users are right to be angry as a result:
People are very justifiably angry that a very stable, well trusted tool, has started to immediately go downhill… all because the main dev is vibecoding that software.
— fao_ on Hacker News
However, this isn't doesn't have to be a question solved only on the basis of — ironically — vibes. This is something that could be, at least to a degree, empirically tested. Some even pointed that out:
On Lobste.rs, in response to the Medium
essay Tridge himself posted in response, finally some users like boramalper begin to
actually ask for evidence one way or another:
It'd be interesting if someone actually did a timechart of regressions after each release (if at all possible) to see if the number actually went up recently or not.
— boramalper on Lobsters
User bitshift replied: "I would also love to see such a chart. It wouldn't be completely
informative… But at least it would be something objective we could measure."
This analysis is that chart. Or, well, as best as it can be made, given the limitations of the data (see the previous section).
The analysis uses a single metric: severity-weighted bugs per 10 commits (sev/10c). Each bug is normalized to a 0–1 severity score (its LLM-assigned severity divided by 100), and those scores are summed per release instead of simply counting bugs. The raw bug count is also shown in the table for reference, but sev/10c drives all statistical tests.
sev/10c = (Σ severity/100 ÷ total_commits) × 10
Every commit on the default branch was ordered by committer date to produce a sequential timeline. Each git tag points to a specific commit in this timeline. A release's range is all commits between the previous tag and its own tag. Pre-release tags ("pre", "rc") are skipped as boundaries and absorbed into their final release. Every commit belongs to exactly one release.
Bug reports come from three sources:
Before we jump into deeper analysis, let's just look at the two Claude releases themselves, to get a sense for them:
0.00 sev/10c
0 bugs · 50 commits · 9 Claude
0th percentile (rank 0 of 35)
3.29 sev/10c
17 bugs · 34 commits · 28 Claude
77th percentile (rank 27 of 35)
If that doesn't look like a red flag to you, you'd be right.
So the question is: are the Claude releases unusually buggy, or could you easily pull a group just as bad out of the historical distribution by dumb luck? The way you answer that question statistically is an exact permutation test, which just enumerates all pairs of two releases and asks: what fraction have a mean bug rate as bad or worse than the one we actually observed? That fraction is the p-value of the hypothesis under test.
46%
exact permutation test p-value (one-sided, H₁: Claude mean > historical)
272 of 595 possible groups of 2 historical releases have mean sev/10c ≥ 1.65. Nearly half. The Claude releases sit right in the middle of the permutation distribution — there is nothing extreme about them.
What this p-value tells us is that the hypothesis that Claude makes releases worse has, at least so far, about as much predictive power as a coin flip: if you closed your eyes and picked 2 releases at random, you'd do as bad or worse nearly half the time. There's nothing unusual about the Claude group.
The permutation test asks: how likely is it that a random group of releases scores as badly as the Claude group? But there's another way to pose the question: are Claude releases more likely than non-Claude releases to fall above the historical median? That's a textbook 2×2 contingency table, and the standard test for it is Fisher's exact test.
| ≤ median | > median | |
|---|---|---|
| Non-Claude | 18 | 17 |
| Claude | 1 | 1 |
74%
one-sided p-value (H₁: Claude more likely above median)
Fisher's exact test asks: if we split all releases at the historical median (0.74 sev/10c), are these Claude releases significantly buggy than previous releases (more likely to land above the median)? With a p-value of 74%, the answer is a decisive no. The odds ratio is 1.06 — essentially 1:1. Claude releases are no more likely to be above the median than any other releases.
To emphasize, this does not mean that all Claude releases in the future will not be more buggy. We don't have nearly enough data to build a model and extrapolate out like that, and that's not what a Fisher's exact test is for. The point that's being made here is that these specific releases are not at all notable; if no one had known they were AI, no one would have cared or noticed anything out of the ordinary, and there is no evidence with which to conclude that Claude made anything worse yet, unlike the objective, absolutist, universal claims made by critics.
In case you're not convinced, here's a visual aid, showing where these releases fall in the distribution of all prior releases:
0.010.1110100
Historical Claude Middle 50% (IQR) Outside IQR
✓
"The Claude releases are statistically indistinguishable from historical releases"
One Claude release sits just below the IQR (v3.4.2, with zero real bugs), the other just above it. They bracket the middle of the distribution in opposite directions — neither is an outlier. The exact permutation test yields a p-value of 46% — pick any 2 releases at random and you'd do as bad or worse nearly half the time. There is no signal of abnormality.
✓
"The outrage selected on a single tail event and narrativized it"
A Mastodon user noticed a regression in v3.4.3, saw Claude commits, and concluded causation. But v3.4.3 at 3.29 sev/10c is at the 77th percentile — elevated but not extreme. 8 historical releases scored higher. The correlation is noise.
✗
"Claude clearly made things worse" &emdash; the main claim
The Claude releases bracket the IQR — one below, one above. Neither is an outlier. There is no distributional evidence of harm. The claim rests entirely on a post-hoc correlation observed by a social media user.
✗
"Claude commits (in general) do not and will not make things worse"
This is a common misrepresentation of my claims here. I am not trying to extrapolate out into the future, and say something like "in general, Claude won't make things worse" or "Claude will never make things worse." Instead, the point is this: there is no evidence of it having done so, and the two Claude releases we have currently are thoroughly unremarkable, so the outrage is totally unjustified.
✗
"The regressions speak for themselves"
v3.4.1 — a pre-Claude release — has the highest bug rate in the dataset (39.39 sev/10c). Nobody noticed, because there was no AI to be angry at. The regressions only "speak" when you ignore the historical distribution.
✗
"Just wait, more bugs will surface"
v3.4.3 has been out long enough that its rate (3.29) is already comparable to historical releases. The "wait and see" argument is an appeal to an unknowable future that shifts the burden of proof away from the critics. If more bugs surface, they will enter the distribution like every other release. There is no reason to expect a regime break.
So, why do people feel like they've been betrayed, and feel so sure that things have "clearly" gotten worse &emdash; that Claude "broke rsync" &emdash; when there is no evidence for this, and no data except two thoroughly unremarkable releases?
A lot of it is just sheer, blind outrage at the use of LLMs. However, there are some confounders that might have caused people to feel that way:
On the HN thread, user zos_kia pointed at the confound directly:
From a cursory look, it looks like a security fix in response to a CVE surfaced a coding error which has been present in the code since 2007. This is so banal that it's actually hilarious to see people lose their shit over it.
— zos_kia on Hacker News
On Lobsters, user jbert spelled
out the causal chain:
The trigger for the increased volume of changes (and hence increased number of regressions) was the influx of (mostly) LLM-enabled security issues. i.e. the causal chain was: LLMs → more known security issues → more changes needed than usual → more regressions than usual.
— jbert on Lobsters
Essentially, this isn't a "Claude" problem, it's a "more security work" problem, something that Tridge himself confirmed in his response, describing how a flood of AI-generated CVE reports forced rapid, extensive changes to rsync's attack surface.
But, as with all things AI, it doesn't matter. In the end, the outrage isn't about whether rsync is worse or better now, it's about people not liking AI, and arguing from a priori definitions, not empirical results, to the desired conclusion: that AI is bad:
Like I said, the author "tried to balance security against feature regression." I don't dispute that he tried. I merely dispute that the chatbots are good at writing code; in fact, they are bad at writing code. If the author had approached these security bugs by hand with a mental model (a Naur theory!) which preserves their desired features and functionality then they would have caused fewer regressions...
— Corbin on Lobste.rs
In response to this sweeping, absolute, causal claim made with no evidence — and in fact, counter to the evidence — based on an old philosophical claim about the epistemology of programming, it is perhaps best to leave the victim of this outrage himself with the final word:
…for the people saying things like "I'm a PhD from xyz uni and I'm telling you LLMs are just stochastic tools that make everything up and the world will fall apart if you use them", I'm here to tell you that you are out of date. The world of software engineering has changed dramatically in the last few months. The world of IT security and maintaining software in the face of the flood of reports has completely and utterly changed just in the last few weeks. Anything you learned about this stuff last year might as well be from another planet… Bottom line is I do know (well, roughly!) how LLMs work, but that doesn't make them not useful. It does mean you have to be cautious, but I am being cautious, or as cautious as I can be given my desire to be sailing and not dealing with a flood of gunk from so-called internet experts.
— Andrew Tridgell