Back Original

LLM Architecture Gallery

Last updated: March 15, 2026

This page collects architecture figures and fact sheets from The Big LLM Architecture Comparison and A Dream of Spring for Open-Weight LLMs. It focuses on the architecture panels only. Click a figure to enlarge it and use the model title to jump to the corresponding article section.

If you spot an inaccurate fact sheet, mislabeled architecture, or broken link, please file an issue here: Architecture Gallery issue tracker.

Upon popular request, you can now also get this as a physical poster via Zazzle. The preview there may look a bit low-resolution, but the upload is based on a fresh high-resolution export at 14570 x 12490 pixels (a 56 MB PNG file with 182 megapixels). I just ordered one myself but please be aware that I haven't been able to verify the quality, yet.

Reference dense Llama stack used to contrast OLMo 2's normalization and attention choices.

Scale
8B parameters

Date
2024-04-18

Decoder type
Dense

Attention
GQA with RoPE

Key detail
Pre-norm baseline; wider than OLMo 2 at a similar scale.

Transparent dense model that keeps classic MHA and pushes normalization changes for training stability.

Scale
7B parameters

Date
2024-11-25

Decoder type
Dense

Attention
MHA with QK-Norm

Key detail
Uses inside-residual post-norm instead of the usual pre-norm layout.

DeepSeek's flagship template kicked off the recent wave of large open MoE models.

Scale
671B total, 37B active

Date
2024-12-26

Decoder type
Sparse MoE

Attention
MLA

Key detail
Uses a dense prefix plus a shared expert to keep a very large model practical at inference.

Reasoning-tuned DeepSeek model built on the V3 architecture rather than a new base design.

Scale
671B total, 37B active

Date
2025-01-20

Decoder type
Sparse MoE

Attention
MLA

Key detail
Architecture matches DeepSeek V3; the main change is the reasoning-oriented training recipe.

Gemma's flagship text stack leans on local attention more aggressively than Gemma 2.

Scale
27B parameters

Date
2025-03-11

Decoder type
Dense

Attention
GQA with QK-Norm and 5:1 sliding-window/global attention

Key detail
Built around a 27B sweet spot with heavier local attention and a large multilingual vocabulary.

Fast dense 24B model that drops the sliding-window setup used in older Mistral releases.

Scale
24B parameters

Date
2025-03-18

Decoder type
Dense

Attention
Standard GQA

Key detail
Latency-focused design with a smaller KV cache and fewer layers than Gemma 3 27B.

Meta's large MoE follows the DeepSeek V3 playbook but with a more conventional attention stack.

Scale
400B total, 17B active

Date
2025-04-05

Decoder type
Sparse MoE

Attention
GQA

Key detail
Alternates dense and MoE blocks and uses fewer, larger experts than DeepSeek V3.

Large sparse Qwen variant that stays very close to DeepSeek V3 while removing the shared expert.

Scale
235B total, 22B active

Date
2025-04-28

Decoder type
Sparse MoE

Attention
GQA with QK-Norm

Key detail
High-capacity MoE design optimized for serving efficiency without a shared expert.

Large dense Qwen3 model that serves as the clearest like-for-like comparison for OLMo 3 32B.

Scale
32B parameters

Date
2025-04-28

Decoder type
Dense

Attention
GQA with QK-Norm

Key detail
Reference dense Qwen stack with QK-Norm and 8 KV heads.

Mid-size dense Qwen3 model used here as a clean baseline against SmolLM3 and Tiny Aya.

Scale
4B parameters

Date
2025-04-28

Decoder type
Dense

Attention
GQA with QK-Norm

Key detail
Compact Qwen3 dense stack with QK-Norm and a 151k vocabulary.

Dense Qwen3 baseline used here to show how little OLMo 3 changed the overall decoder recipe.

Scale
8B parameters

Date
2025-04-28

Decoder type
Dense

Attention
GQA with QK-Norm

Key detail
Reference Qwen3 dense stack with QK-Norm and 8 KV heads.

Compact dense model that experiments with leaving out positional encodings in selected layers.

Scale
3B parameters

Date
2025-06-19

Decoder type
Dense

Attention
GQA with periodic NoPE layers

Key detail
Every fourth layer omits RoPE to test a NoPE-style cadence.

Trillion-parameter Moonshot model that essentially scales the DeepSeek V3 recipe upward.

Scale
1T total, 32B active

Date
2025-07-10

Decoder type
Sparse MoE

Attention
MLA

Key detail
More experts and fewer MLA heads than DeepSeek V3.

Agent-oriented instruction/reasoning hybrid that borrows DeepSeek's dense-prefix MoE layout.

Scale
355B total, 32B active

Date
2025-07-28

Decoder type
Sparse MoE

Attention
GQA with QK-Norm

Key detail
Starts with three dense layers before MoE routing and keeps a shared expert.

Larger gpt-oss variant keeps the same alternating-attention recipe as the 20B model.

Scale
120B parameters

Date
2025-08-04

Decoder type
Sparse MoE

Attention
GQA with alternating sliding-window and global layers

Key detail
Shared architectural template scaled up for OpenAI's flagship open-weight release.

OpenAI's smaller open-weight MoE model favors width and alternating local/global attention.

Scale
20B total, 3.6B active

Date
2025-08-04

Decoder type
Sparse MoE

Attention
GQA with alternating sliding-window and global layers

Key detail
Wider and shallower than Qwen3, with attention bias and sink mechanisms.

Rare production-model release that shows an older MoE style with fewer, larger experts.

Scale
270B parameters

Date
2025-08-22

Decoder type
Sparse MoE

Attention
GQA

Key detail
Adds an always-on SwiGLU path that effectively behaves like a shared expert.

Efficiency-focused Qwen refresh that swaps standard attention for a DeltaNet-attention hybrid.

Scale
80B total, 3B active

Date
2025-09-09

Decoder type
Sparse hybrid

Attention
3:1 Gated DeltaNet and Gated Attention

Key detail
Adds many more experts, a shared expert, and a native 262k context.

MiniMax's flagship returns to full attention and looks like a leaner, sparser cousin of Qwen3.

Scale
230B total, 10B active

Date
2025-10-23

Decoder type
Sparse MoE

Attention
GQA with QK-Norm and partial RoPE

Key detail
Uses per-layer QK-Norm and much sparser MoE routing than Qwen3.

Linear-attention hybrid that keeps a transformer backbone but replaces most full-attention layers.

Scale
48B total, 3B active

Date
2025-10-30

Decoder type
Sparse hybrid

Attention
3:1 Kimi Delta Attention and MLA

Key detail
Uses NoPE in MLA layers and channel-wise gating for long-context efficiency.

Scaled-up OLMo 3 keeps the same block design but moves to grouped-query attention.

Scale
32B parameters

Date
2025-11-20

Decoder type
Dense

Attention
GQA with QK-Norm and 3:1 sliding-window/global attention

Key detail
Keeps post-norm while scaling width and applying YaRN only on global layers.

New transparent Allen AI model that keeps OLMo's post-norm flavor while modernizing context handling.

Scale
7B parameters

Date
2025-11-20

Decoder type
Dense

Attention
MHA with QK-Norm and 3:1 sliding-window/global attention

Key detail
Retains post-norm, keeps MHA, and applies YaRN only on global layers.

DeepSeek's successor keeps the V3 template but adds sparse attention to cut long-context costs.

Scale
671B total, 37B active

Date
2025-12-01

Decoder type
Sparse MoE

Attention
MLA with DeepSeek Sparse Attention

Key detail
An evolutionary update focused on efficiency rather than a new base layout.

Mistral's new flagship effectively adopts the DeepSeek architecture and retunes the expert sizes.

Scale
673B total, 41B active

Date
2025-12-02

Decoder type
Sparse MoE

Attention
MLA

Key detail
Near-clone of DeepSeek V3 with larger experts, fewer routed experts, and multimodal support.

NVIDIA's Nano model is the most extreme transformer-state-space hybrid in the gallery.

Scale
30B total, 3B active

Date
2025-12-04

Decoder type
Hybrid MoE

Attention
Mostly Mamba-2 with a few GQA layers

Key detail
Interleaves Mamba-2 and MoE blocks, using attention only sparingly.

Xiaomi MiMo-V2-Flash 309B

Large MoE model that pushes sliding-window attention harder than most contemporaries.

Scale
309B total, 15B active

Date
2025-12-16

Decoder type
Sparse MoE

Attention
5:1 sliding-window/global attention

Key detail
Uses an unusually small 128-token local window plus multi-token prediction.

Immediate GLM predecessor that stays closer to the older GLM-4.5 style before the MLA shift.

Scale
355B total, 32B active

Date
2025-12-22

Decoder type
Sparse MoE

Attention
GQA with QK-Norm

Key detail
Serves as the pre-MLA, pre-sparse-attention baseline with the same 32B active path as GLM-4.5.

Arcee AI Trinity Large 400B

Arcee's flagship blends several efficiency tricks into a DeepSeek-like coarse MoE design.

Scale
400B total, 13B active

Date
2026-01-27

Decoder type
Sparse MoE

Attention
GQA with gated attention and 3:1 sliding-window/global attention

Key detail
Combines QK-Norm, RoPE+NoPE, sandwich norm, and a coarse-grained MoE.

Huge GLM refresh that adopts both MLA and DeepSeek Sparse Attention for flagship-scale inference.

Scale
744B total, 40B active

Date
2026-02-11

Decoder type
Sparse MoE

Attention
MLA with DeepSeek Sparse Attention

Key detail
Bigger than GLM-4.7, with more experts and fewer layers.

Nemotron 3 Super 120B-A12B

The Super variant scales up Nano and adds both latent experts and native speculative decoding support.

Scale
120B total, 12B active

Date
2026-03-11

Decoder type
Hybrid MoE

Attention
Mostly Mamba-2 with a few GQA layers

Key detail
Adds latent-space MoE and shared-weight MTP for fast inference.

Throughput-oriented MoE model that stays competitive with much larger DeepSeek-style systems.

Scale
196B total, 11B active

Date
2026-02-01

Decoder type
Sparse MoE

Attention
GQA with 3:1 sliding-window attention

Key detail
Uses MTP-3 during both training and inference for unusually high throughput.

Small on-device oriented model that stays close to Llama 3.2 while nudging the scaling choices.

Scale
3B parameters

Date
2026-02-10

Decoder type
Dense

Attention
GQA

Key detail
Llama-like stack without tying input embeddings to the output layer.

Popular 230B coder that opts for a classic architecture instead of the newer hybrid-attention ideas.

Scale
230B total, 10B active

Date
2026-02-12

Decoder type
Sparse MoE

Attention
GQA with QK-Norm

Key detail
Deliberately avoids sliding-window or linear-attention hybrids while keeping a 10B active path.

Compact multilingual model from Cohere with a rare parallel transformer block.

Scale
3.35B parameters

Date
2026-02-13

Decoder type
Dense

Attention
GQA with 3:1 sliding-window attention

Key detail
Runs attention and the MLP in parallel while mixing RoPE with NoPE.

Trillion-parameter long-context model that swaps DeltaNet for Lightning Attention.

Scale
1T total, 63B active

Date
2026-02-15

Decoder type
Sparse hybrid

Attention
Lightning Attention plus MLA

Key detail
Uses a 7:1 linear-attention/MLA ratio and a much larger 63B active path.

Mainline Qwen refresh that brings the Next-style hybrid attention into the flagship series.

Scale
397B total, 17B active

Date
2026-02-16

Decoder type
Sparse hybrid

Attention
3:1 Gated DeltaNet and Gated Attention

Key detail
Turns the former Qwen3-Next side branch into the new core design with 512 experts and 17B active parameters.

Larger Sarvam variant keeps the sparse MoE layout but switches from GQA to MLA.

Scale
105B total

Date
2026-03-03

Decoder type
Sparse MoE

Attention
MLA with KV LayerNorm and NoPE + RoPE

Key detail
Large vocabulary and strong Indic language support carried into the larger MLA-based sparse MoE variant.

Reasoning-oriented Indian-language sparse MoE that keeps GQA at the smaller size.

Scale
30B total

Date
2026-03-03

Decoder type
Sparse MoE

Attention
GQA with QK-Norm

Key detail
Large vocabulary and strong Indic language support paired with a reasoning-focused sparse MoE design.