James Routley

Why this exists?

AI labs frequently update their models post-launch. These updates sometimes introduce "nerfs" such as aggressive censorship, excessive quantization (to save compute costs), or behavioral degradation. This chart exposes these hidden trends.

Note on Web UIs vs. API: LMSYS Arena tests model performance via API endpoints (the "raw" model). Consumer chat interfaces (like gemini.com or chatgpt.com) often add system prompts, safety filters, and UI-specific wrappers not present in the raw API. Providers may also silently switch to quantized (lower-precision) versions of models to save compute during peak load, leading to perceived "nerfing" the API benchmarks don't fully capture. PRs are welcome for data sources representing true web-interface evaluations.

Where does the data come from?

The data is automatically fetched daily from the official LM Arena Leaderboard Dataset on Hugging Face. The Arena relies on thousands of blind, crowdsourced human evaluations, making it the most robust metric of actual model capability.

How does the chart logic work?

Each major AI lab has exactly ONE curve representing their flagship lineage. At each point in time the curve tracks the lab's highest-rated flagship-eligible model on the leaderboard — not just the most recently announced one.

Highest-ELO flagship: If a lab ships a mid-tier model (e.g. Sonnet) while a higher-tier one (e.g. Opus) is still the top performer, the curve stays on Opus.
Inference-mode variants collapsed: Suffixes like -thinking, -reasoning, and -high are the same underlying model in a different mode — they're merged so the curve doesn't flip-flop between them.
New releases: Shown as marker points with labels, often accompanied by a jump in score.
Degradation: Any downward trend in a model's lifecycle between releases is clearly visible.

Arena AI Model ELO History

Why this exists?

Where does the data come from?

How does the chart logic work?