James Routley

Last updated: March 15, 2026

This page collects architecture figures and fact sheets from The Big LLM Architecture Comparison and A Dream of Spring for Open-Weight LLMs. It focuses on the architecture panels only. Click a figure to enlarge it and use the model title to jump to the corresponding article section.

If you spot an inaccurate fact sheet, mislabeled architecture, or broken link, please file an issue here: Architecture Gallery issue tracker.

Upon popular request, you can now also get this as a physical poster via Zazzle. The preview there may look a bit low-resolution, but the upload is based on a fresh high-resolution export at 14570 x 12490 pixels (a 56 MB PNG file with 182 megapixels). I just ordered one myself but please be aware that I haven't been able to verify the quality, yet.

Reference dense Llama stack used to contrast OLMo 2's normalization and attention choices.

Scale: 8B parameters
Date: 2024-04-18
Decoder type: Dense
Attention: GQA with RoPE
Key detail: Pre-norm baseline; wider than OLMo 2 at a similar scale.

Transparent dense model that keeps classic MHA and pushes normalization changes for training stability.

Scale: 7B parameters
Date: 2024-11-25
Decoder type: Dense
Attention: MHA with QK-Norm
Key detail: Uses inside-residual post-norm instead of the usual pre-norm layout.

DeepSeek's flagship template kicked off the recent wave of large open MoE models.

Scale: 671B total, 37B active
Date: 2024-12-26
Decoder type: Sparse MoE
Attention: MLA
Key detail: Uses a dense prefix plus a shared expert to keep a very large model practical at inference.

Reasoning-tuned DeepSeek model built on the V3 architecture rather than a new base design.

Scale: 671B total, 37B active
Date: 2025-01-20
Decoder type: Sparse MoE
Attention: MLA
Key detail: Architecture matches DeepSeek V3; the main change is the reasoning-oriented training recipe.

Gemma's flagship text stack leans on local attention more aggressively than Gemma 2.

Scale: 27B parameters
Date: 2025-03-11
Decoder type: Dense
Attention: GQA with QK-Norm and 5:1 sliding-window/global attention
Key detail: Built around a 27B sweet spot with heavier local attention and a large multilingual vocabulary.

Fast dense 24B model that drops the sliding-window setup used in older Mistral releases.

Scale: 24B parameters
Date: 2025-03-18
Decoder type: Dense
Attention: Standard GQA
Key detail: Latency-focused design with a smaller KV cache and fewer layers than Gemma 3 27B.

Meta's large MoE follows the DeepSeek V3 playbook but with a more conventional attention stack.

Scale: 400B total, 17B active
Date: 2025-04-05
Decoder type: Sparse MoE
Attention: GQA
Key detail: Alternates dense and MoE blocks and uses fewer, larger experts than DeepSeek V3.

Large sparse Qwen variant that stays very close to DeepSeek V3 while removing the shared expert.

Scale: 235B total, 22B active
Date: 2025-04-28
Decoder type: Sparse MoE
Attention: GQA with QK-Norm
Key detail: High-capacity MoE design optimized for serving efficiency without a shared expert.

Large dense Qwen3 model that serves as the clearest like-for-like comparison for OLMo 3 32B.

Scale: 32B parameters
Date: 2025-04-28
Decoder type: Dense
Attention: GQA with QK-Norm
Key detail: Reference dense Qwen stack with QK-Norm and 8 KV heads.

Mid-size dense Qwen3 model used here as a clean baseline against SmolLM3 and Tiny Aya.

Scale: 4B parameters
Date: 2025-04-28
Decoder type: Dense
Attention: GQA with QK-Norm
Key detail: Compact Qwen3 dense stack with QK-Norm and a 151k vocabulary.

Dense Qwen3 baseline used here to show how little OLMo 3 changed the overall decoder recipe.

Scale: 8B parameters
Date: 2025-04-28
Decoder type: Dense
Attention: GQA with QK-Norm
Key detail: Reference Qwen3 dense stack with QK-Norm and 8 KV heads.

Compact dense model that experiments with leaving out positional encodings in selected layers.

Scale: 3B parameters
Date: 2025-06-19
Decoder type: Dense
Attention: GQA with periodic NoPE layers
Key detail: Every fourth layer omits RoPE to test a NoPE-style cadence.

Trillion-parameter Moonshot model that essentially scales the DeepSeek V3 recipe upward.

Scale: 1T total, 32B active
Date: 2025-07-10
Decoder type: Sparse MoE
Attention: MLA
Key detail: More experts and fewer MLA heads than DeepSeek V3.

Agent-oriented instruction/reasoning hybrid that borrows DeepSeek's dense-prefix MoE layout.

Scale: 355B total, 32B active
Date: 2025-07-28
Decoder type: Sparse MoE
Attention: GQA with QK-Norm
Key detail: Starts with three dense layers before MoE routing and keeps a shared expert.

Larger gpt-oss variant keeps the same alternating-attention recipe as the 20B model.

Scale: 120B parameters
Date: 2025-08-04
Decoder type: Sparse MoE
Attention: GQA with alternating sliding-window and global layers
Key detail: Shared architectural template scaled up for OpenAI's flagship open-weight release.

OpenAI's smaller open-weight MoE model favors width and alternating local/global attention.

Scale: 20B total, 3.6B active
Date: 2025-08-04
Decoder type: Sparse MoE
Attention: GQA with alternating sliding-window and global layers
Key detail: Wider and shallower than Qwen3, with attention bias and sink mechanisms.

Rare production-model release that shows an older MoE style with fewer, larger experts.

Scale: 270B parameters
Date: 2025-08-22
Decoder type: Sparse MoE
Attention: GQA
Key detail: Adds an always-on SwiGLU path that effectively behaves like a shared expert.

Efficiency-focused Qwen refresh that swaps standard attention for a DeltaNet-attention hybrid.

Scale: 80B total, 3B active
Date: 2025-09-09
Decoder type: Sparse hybrid
Attention: 3:1 Gated DeltaNet and Gated Attention
Key detail: Adds many more experts, a shared expert, and a native 262k context.

MiniMax's flagship returns to full attention and looks like a leaner, sparser cousin of Qwen3.

Scale: 230B total, 10B active
Date: 2025-10-23
Decoder type: Sparse MoE
Attention: GQA with QK-Norm and partial RoPE
Key detail: Uses per-layer QK-Norm and much sparser MoE routing than Qwen3.

Linear-attention hybrid that keeps a transformer backbone but replaces most full-attention layers.

Scale: 48B total, 3B active
Date: 2025-10-30
Decoder type: Sparse hybrid
Attention: 3:1 Kimi Delta Attention and MLA
Key detail: Uses NoPE in MLA layers and channel-wise gating for long-context efficiency.

Scaled-up OLMo 3 keeps the same block design but moves to grouped-query attention.

Scale: 32B parameters
Date: 2025-11-20
Decoder type: Dense
Attention: GQA with QK-Norm and 3:1 sliding-window/global attention
Key detail: Keeps post-norm while scaling width and applying YaRN only on global layers.

New transparent Allen AI model that keeps OLMo's post-norm flavor while modernizing context handling.

Scale: 7B parameters
Date: 2025-11-20
Decoder type: Dense
Attention: MHA with QK-Norm and 3:1 sliding-window/global attention
Key detail: Retains post-norm, keeps MHA, and applies YaRN only on global layers.

DeepSeek's successor keeps the V3 template but adds sparse attention to cut long-context costs.

Scale: 671B total, 37B active
Date: 2025-12-01
Decoder type: Sparse MoE
Attention: MLA with DeepSeek Sparse Attention
Key detail: An evolutionary update focused on efficiency rather than a new base layout.

Mistral's new flagship effectively adopts the DeepSeek architecture and retunes the expert sizes.

Scale: 673B total, 41B active
Date: 2025-12-02
Decoder type: Sparse MoE
Attention: MLA
Key detail: Near-clone of DeepSeek V3 with larger experts, fewer routed experts, and multimodal support.

NVIDIA's Nano model is the most extreme transformer-state-space hybrid in the gallery.

Scale: 30B total, 3B active
Date: 2025-12-04
Decoder type: Hybrid MoE
Attention: Mostly Mamba-2 with a few GQA layers
Key detail: Interleaves Mamba-2 and MoE blocks, using attention only sparingly.

Xiaomi MiMo-V2-Flash 309B

Large MoE model that pushes sliding-window attention harder than most contemporaries.

Scale: 309B total, 15B active
Date: 2025-12-16
Decoder type: Sparse MoE
Attention: 5:1 sliding-window/global attention
Key detail: Uses an unusually small 128-token local window plus multi-token prediction.

Immediate GLM predecessor that stays closer to the older GLM-4.5 style before the MLA shift.

Scale: 355B total, 32B active
Date: 2025-12-22
Decoder type: Sparse MoE
Attention: GQA with QK-Norm
Key detail: Serves as the pre-MLA, pre-sparse-attention baseline with the same 32B active path as GLM-4.5.

Arcee AI Trinity Large 400B

Arcee's flagship blends several efficiency tricks into a DeepSeek-like coarse MoE design.

Scale: 400B total, 13B active
Date: 2026-01-27
Decoder type: Sparse MoE
Attention: GQA with gated attention and 3:1 sliding-window/global attention
Key detail: Combines QK-Norm, RoPE+NoPE, sandwich norm, and a coarse-grained MoE.

Huge GLM refresh that adopts both MLA and DeepSeek Sparse Attention for flagship-scale inference.

Scale: 744B total, 40B active
Date: 2026-02-11
Decoder type: Sparse MoE
Attention: MLA with DeepSeek Sparse Attention
Key detail: Bigger than GLM-4.7, with more experts and fewer layers.

Nemotron 3 Super 120B-A12B

The Super variant scales up Nano and adds both latent experts and native speculative decoding support.

Scale: 120B total, 12B active
Date: 2026-03-11
Decoder type: Hybrid MoE
Attention: Mostly Mamba-2 with a few GQA layers
Key detail: Adds latent-space MoE and shared-weight MTP for fast inference.

Throughput-oriented MoE model that stays competitive with much larger DeepSeek-style systems.

Scale: 196B total, 11B active
Date: 2026-02-01
Decoder type: Sparse MoE
Attention: GQA with 3:1 sliding-window attention
Key detail: Uses MTP-3 during both training and inference for unusually high throughput.

Small on-device oriented model that stays close to Llama 3.2 while nudging the scaling choices.

Scale: 3B parameters
Date: 2026-02-10
Decoder type: Dense
Attention: GQA
Key detail: Llama-like stack without tying input embeddings to the output layer.

Popular 230B coder that opts for a classic architecture instead of the newer hybrid-attention ideas.

Scale: 230B total, 10B active
Date: 2026-02-12
Decoder type: Sparse MoE
Attention: GQA with QK-Norm
Key detail: Deliberately avoids sliding-window or linear-attention hybrids while keeping a 10B active path.

Compact multilingual model from Cohere with a rare parallel transformer block.

Scale: 3.35B parameters
Date: 2026-02-13
Decoder type: Dense
Attention: GQA with 3:1 sliding-window attention
Key detail: Runs attention and the MLP in parallel while mixing RoPE with NoPE.

Trillion-parameter long-context model that swaps DeltaNet for Lightning Attention.

Scale: 1T total, 63B active
Date: 2026-02-15
Decoder type: Sparse hybrid
Attention: Lightning Attention plus MLA
Key detail: Uses a 7:1 linear-attention/MLA ratio and a much larger 63B active path.

Mainline Qwen refresh that brings the Next-style hybrid attention into the flagship series.

Scale: 397B total, 17B active
Date: 2026-02-16
Decoder type: Sparse hybrid
Attention: 3:1 Gated DeltaNet and Gated Attention
Key detail: Turns the former Qwen3-Next side branch into the new core design with 512 experts and 17B active parameters.

Larger Sarvam variant keeps the sparse MoE layout but switches from GQA to MLA.

Scale: 105B total
Date: 2026-03-03
Decoder type: Sparse MoE
Attention: MLA with KV LayerNorm and NoPE + RoPE
Key detail: Large vocabulary and strong Indic language support carried into the larger MLA-based sparse MoE variant.

Reasoning-oriented Indian-language sparse MoE that keeps GQA at the smaller size.

Scale: 30B total
Date: 2026-03-03
Decoder type: Sparse MoE
Attention: GQA with QK-Norm
Key detail: Large vocabulary and strong Indic language support paired with a reasoning-focused sparse MoE design.

Source article

The Big LLM Architecture Comparison

The original comparison article that walks through the architecture figures in context and explains the key design choices across dense, MoE, MLA, and hybrid decoder families.

Read article

The Big LLM Architecture Comparison overview figure