Arena AI Model Elo History

Tracking the public Elo lifecycle of flagship AI models over time to reveal potential nerfing.

Why this exists

AI labs frequently update their models post-launch, and users regularly report perceived "nerfs": excessive quantization (to save compute costs), aggressive censorship, or behavioral degradation. This chart plots each flagship's public Elo lifecycle on one timeline, so any such trend would be visible at a glance.

Data is fetched daily from the official Arena AI Leaderboard Dataset on Hugging Face, built from thousands of blind, crowdsourced head-to-head human votes. It's an imperfect lens (see caveats below) but the most consistent long-running signal currently available.

How the chart works

Each lab gets exactly one curve. The toggle above the chart switches how that curve picks the lab's active model at each point in time.

  • Highest Elo (default): tracks the lab's highest-rated flagship-eligible model, not just the most recently announced one. If a lab ships a mid-tier model (e.g. Sonnet) while a higher-tier one (e.g. Opus) still ranks above it, the curve stays on Opus.
  • Latest release: tracks the lab's newest flagship instead, even when its Elo dipped below a predecessor (e.g. claude-opus-4-8 below 4-6) — the clearest view of post-release degradation.
  • Inference-mode variants (suffixes like -thinking, -reasoning, -high) are merged into the parent so the curve doesn't flip-flop between modes.
  • New releases appear as labeled marker points, often with a jump in score.
  • Downward trends between releases are visible too, but read the caveats below before treating them as proof.

Caveats

01

Web UIs vs. API

Arena tests models via API endpoints, i.e. the "raw" model. Consumer chat interfaces (gemini.com, chatgpt.com, etc.) add system prompts, safety filters, and UI wrappers not present in the raw API, and providers may silently switch to quantized (lower-precision) versions under load. Perceived "nerfing" in those products may not show up here.

02

Elo is relative

Ratings shift against the rest of the leaderboard. When stronger models enter (or peers improve), an unchanged model's Elo can drift down anyway; conversely, if every model regresses in parallel, Elo won't reveal it. A fixed-benchmark longitudinal dataset would be cleaner, but no such public archive seems to exist.