
Diffusion language models (DLMs) offer a compelling promise: parallel token generation could break the sequential bottleneck of autoregressive (AR) decoding. Yet in practice, DLMs consistently lag behind AR models in quality.
We argue that this gap stems from a fundamental failure of introspective consistency: AR models agree with what they generate, whereas DLMs often do not. We introduce the Introspective Diffusion Language Model (I-DLM), which uses introspective strided decoding (ISD) to verify previously generated tokens while advancing new ones in the same forward pass.
Empirically, I-DLM-8B is the first DLM to match the quality of its same-scale AR counterpart, outperforming LLaDA-2.1-mini (16B) by +26 on AIME-24 and +15 on LiveCodeBench-v6 with half the parameters, while delivering 2.9-4.1x throughput at high concurrency. With gated LoRA, ISD enables bit-for-bit lossless acceleration.
Key Insight: AR training unifies generation and introspection in one forward pass. Existing DLMs miss this — they learn to denoise but not to introspect.
We identify three fundamental bottlenecks in current DLMs:

(1) Low introspective consistency. SDAR: 0.699 vs. I-DLM: 0.984.

(2) Compute inefficiency. TiDAR: ~7.8x overhead vs. I-DLM: ~2.5x.

(3) Infrastructure mismatch. SDAR slope=84 vs. I-DLM: 549.
Convert pretrained AR models via causal attention, logit shift, and an all-masked objective.
Generate N tokens per forward pass while verifying prior tokens via the p/q acceptance criterion.
Strict causal attention enables direct integration into SGLang with no custom infrastructure.

Decoding paradigm comparison. I-DLM is a drop-in replacement within AR serving infrastructure.
I-DLM is the first DLM to match same-scale AR quality while surpassing all prior DLMs across 15 benchmarks.
Blue = best non-AR <30B. Bold = best non-AR <100B.
| Qwen3 | Qwen3 | LLaDA-2.1 | LLaDA-2.0 | LLaDA-2.1 | SDAR | SDAR | Mercury | Gemini | I-DLM | I-DLM | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Knowledge & Reasoning | |||||||||||
| ARC-C | 95.8 | 97.2 | 90.2 | --- | --- | 91.9 | 93.2 | --- | --- | 95.8 | 96.8 |
| MMLU | 83.5 | 87.2 | 74.5 | --- | --- | 78.6 | 82.8 | --- | --- | 82.4 | 86.8 |
| MMLU-Pro | 75.1 | 80.1 | 64.8 | 74.8 | 76.6 | 56.9 | 61.5 | --- | --- | 73.1 | 79.7 |
| GPQA-D | 58.9 | 64.1 | 46.0 | --- | --- | 40.2 | 36.7 | --- | --- | 55.6 | 62.1 |
| GPQA | 55.4 | 65.0 | 53.3 | 62.3 | 67.3 | --- | --- | --- | --- | 54.9 | 58.7 |
| Math | |||||||||||
| GSM8K | 96.0 | 94.7 | 89.0 | --- | --- | 91.7 | 91.4 | --- | --- | 95.0 | 94.9 |
| MATH-500 | 95.8 | 97.8 | 85.0 | --- | --- | 78.6 | 77.8 | --- | --- | 96.8 | 97.6 |
| MathBench | 93.1 | 95.5 | 84.2 | --- | --- | 76.9 | 79.3 | --- | --- | 89.1 | 95.6 |
| AIME-24 | 73.1 | 76.7 | 43.3 | --- | --- | 10.0 | 16.7 | --- | --- | 69.6 | 83.3 |
| AIME-25 | 65.4 | 80.0 | 43.3 | 60.0 | 63.3 | 10.0 | 10.8 | --- | --- | 60.8 | 80.0 |
| Code | |||||||||||
| HumanEval | 95.1 | 96.3 | 86.0 | --- | --- | 78.7 | 87.2 | 90.0 | 89.6 | 93.3 | 96.3 |
| MBPP | 93.4 | 95.7 | 82.1 | --- | --- | 72.0 | 71.6 | 76.6 | 76.0 | 92.2 | 94.6 |
| LCB-v6 | 50.3 | 58.3 | 30.4 | 42.5 | 45.4 | 16.6 | 21.7 | --- | --- | 45.7 | 57.1 |
| Instruction Following | |||||||||||
| IFEval | 84.7 | 84.5 | 83.2 | 82.6 | 83.6 | 61.4 | 60.6 | --- | --- | 84.7 | 84.7 |

Throughput-latency tradeoff compared with DLMs across batch sizes (1, 4, 16, 64). I-DLM delivers 2.9-4.1x higher throughput than LLaDA-2.1-mini and SDAR at C=64.
In the memory-bound decode regime, TPF closely approximates wall-clock speedup: a TPF of 2.5 represents roughly 2.5x faster decoding than AR. Explore how acceptance rate and stride size affect this below.
0.90
1.12
Gated LoRA adds compute at MASK positions for bit-for-bit lossless output. α=1.12 matches empirical overhead.
Memory-bound: Speedup ≈ TPF = (2+p+...+pN-2) / (2-pN-1)
How do DLMs perform as they approach compute-bound?
At high concurrency, forward pass latency scales with query count per forward. We can measure compute efficiency as TPF²/query_size — how much useful output each FLOP produces relative to AR (efficiency = 1):
Efficiency > 1 means parallel decoding actually saves total compute vs. AR. This is why I-DLM's throughput scales with concurrency while SDAR and LLaDA plateau in the throughput figure above.
Acceptance compounds geometrically: position k has probability $p^{k-1}$. Position 1 is always accepted (logit shift).
Everything you need to train, serve, and deploy I-DLM. Click any card to expand.
git clone https://github.com/Introspective-Diffusion/I-DLM.git cd I-DLM/inference bash install.sh
See inference/README.md for detailed environment setup.
1. Launch server:
python -m sglang.launch_server \
--model-path yifanyu/I-DLM-8B \
--trust-remote-code --tp-size 1 --dtype bfloat16 \
--mem-fraction-static 0.85 --max-running-requests 32 \
--attention-backend flashinfer --dllm-algorithm IDLMBlockN \
--dllm-algorithm-config inference/configs/idlm_blockN4_config.yaml \
--port 30000
2. Generate:
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "Prove that sqrt(2) is irrational."}],
"max_tokens": 4096, "temperature": 1.0
}'
Convert a pretrained AR model into I-DLM via introspective-consistency training:
[x_t | x_0]See training/README.md for scripts and configs.
Introspective Strided Decoding (ISD) generates and verifies in a single forward pass:
min(1, p(x)/q(x)) guarantees AR-distribution outputSee inference/README.md for algorithm configs.
I-DLM uses strict causal attention, enabling direct integration into SGLang with no custom infrastructure:
Full system achieves 2.1-2.5x throughput over naive baseline.
Residual ISD (R-ISD) adds a gated LoRA adapter for bit-for-bit lossless acceleration:
| Model | Base | Description |
|---|---|---|
| I-DLM-8B | Qwen3-8B | Main model, matches AR quality |
| I-DLM-32B | Qwen3-32B | Large scale, outperforms LLaDA-2.1-flash (100B) |
| I-DLM-8B-LoRA | Qwen3-8B | Gated LoRA (rank=128) for lossless R-ISD |
All models use trust_remote_code=True (custom SDARForCausalLM architecture).
We evaluate on 15 benchmarks across 4 categories with thinking mode enabled:
See inference/eval/ for reproduction scripts.
@article{yu2026introspective,
title={Introspective Diffusion Language Models},
author={Yu, Yifan and Jian, Yuqing and Wang, Junxiong and Zhou, Zhongzhu
and Zhuang, Donglin and Fang, Xinyu and Yanamandra, Sri
and Wu, Xiaoxia and Wu, Qingyang and Song, Shuaiwen Leon
and Dao, Tri and Athiwaratkun, Ben and Zou, James
and Lai, Fan and Xu, Chenfeng},
journal={arXiv preprint arXiv:7471639},
year={2026}
}
© 2025 I-DLM Team. Built with care.