AI labs frequently update their models post-launch. These updates sometimes introduce "nerfs" such as aggressive censorship, excessive quantization (to save compute costs), or behavioral degradation. This chart exposes these hidden trends.
Note on Web UIs vs. API: LMSYS Arena tests model performance via API endpoints (the "raw" model). Consumer chat interfaces (like gemini.com or chatgpt.com) often add system prompts, safety filters, and UI-specific wrappers not present in the raw API. Providers may also silently switch to quantized (lower-precision) versions of models to save compute during peak load, leading to perceived "nerfing" the API benchmarks don't fully capture. PRs are welcome for data sources representing true web-interface evaluations.
The data is automatically fetched daily from the official LM Arena Leaderboard Dataset on Hugging Face. The Arena relies on thousands of blind, crowdsourced human evaluations, making it the most robust metric of actual model capability.
Each major AI lab has exactly ONE curve representing their flagship lineage. At each point in time the curve tracks the lab's highest-rated flagship-eligible model on the leaderboard — not just the most recently announced one.
-thinking, -reasoning, and -high are the same underlying model in a different mode — they're merged so the curve doesn't flip-flop between them.