James Routley

For the complete documentation index, see llms.txt. This page is also available as Markdown.

GLM-5.2 is Z.ai’s new open model, delivering SOTA performance across long-horizon coding, reasoning, and agentic tasks. With 744B parameters, 40B active parameters, and a 1M context window, it can now be run locally using Unsloth Dynamic GGUFs. GLM-5.2 is the strongest open model to date, performing on par with Claude 4.8 Opus, GPT-5.5, and Gemini 3.1 Pro across Artificial Analysis and many other benchmarks.

Dynamic 1-bit reaches ~76.2% top-1 accuracy while being 86% smaller. Dynamic 2-bit reaches ~82% accuracy while being 84% smaller. This means the model is not 86% worse since it's 86% smaller - it rather is only ~24% less accurate than the full 1.5TB model. Thanks Z.ai for giving Unsloth day-zero access. GLM-5.2-GGUF

Run GLM-5.2 Tutorials Quantization Results

⚙️ Usage Guide

The 2-bit dynamic quant UD-IQ2_M uses 239GB of disk space - this can directly fit on a 256GB unified memory Mac and works well in a 1x24GB GPU and 256GB of RAM with MoE offloading. The 1-bit quant will fit on a 223GB RAM and 8-bit requires 810GB RAM.

Table: Inference hardware requirements (units = total memory: RAM + VRAM, or unified memory)

For best performance, make sure your total available memory, including VRAM and system RAM, exceeds the quantized model file size by a comfortable margin.

Recommended Settings

GLM-5.2 has 3 thinking modes. Non-thinking and Thinking in two modes: High + Max. Use Max Thinking for complicated tasks. In Unsloth Studio you can easily toggle High + Max Thinking and non-Thinking with a UI.

Use these settings for most use cases:

Maximum context window: 1,048,576.

Disabling Thinking, changing reasoning effort

GLM 5.2 uses reasoning by default. It also supports reasoning efforts where reasoning_effort can be "high", "max" or disabled.

To disable thinking, use --chat-template-kwargs '{"enable_thinking":false}'. If you're on Windows Powershell, use: --chat-template-kwargs "{\"enable_thinking\":false}"

You can also use --reasoning on or --reasoning off in llama.cpp as well now!

For reasoning effort customization and or to disable reasoning, use the below examples:

📈 Quantization analysis

We also ran KLD (KL Divergence) benchmarks to gauge the accuracy of our quantizations of GLM-5.2-GGUF. Dynamic 4-bit UD-Q4_K_XL and dynamic 5-bit UD-Q5_K_XL are mostly lossless, and smaller quants also work great by dynamically leaving important layers in higher precision, and unimportant ones to low bits.

On pure top-1% accuracy, dynamic 1-bit gets around 76.2% accuracy yet being 86% smaller! Dynamic 2-bit gets around 82% accuracy whilst being 84% smaller. This shows dynamically quantizing some layers to higher precision does not make the model 86% worse yet being 86% smaller - but only 24% less capable than the full 1.5TB model.

But what does "76% accuracy" actually describe?

76% top-1% does NOT mean "The capital of France is" becomes choosing 76% Paris and 24% of Sydney. The model is NOT "dumber" by 24%. For this, Paris is always 100%, and Sydney is 0%. The 76% number includes filler words and stop words across the entire corpus for example asking:

"Create a novel" will get due to LLM sampling:

I will now create a novel...
The novel is below:
What genre would you like it to be?

Each example is correct, but the [I, The, What] distribution is what changes - the baseline might use [I] 100% of the time, but now [I] is 76% and [The] is 24%.

It does NOT mean that you get incorrect outputs like gibberish or incorrect outputs 24% of the time.

99.9% KLD is also generally good - there is a larger uplift from 4bit onwards though, so for massive out of distribution tasks, dynamic 4-bit is probably best.

Top-1% is a "forced" binomial distrution of the KLD itself. KLD is the "distance" between the probabilities of the baseline (BF16 or Q8_0) vs the quantized version. The goal of quantization is to minimize the below objective:

$\text{minimize } \frac{1}{n} \sum{\text{D}_{\text{KL}}[\text{ }f(q(W))\text{ }||\text{ } f(W))\text{ }]}$

Where f is the language model's forward and q is the quantization operation and W is the parameters or weights of the model. The goal is to make the "distance" between the logits output of the baseline f(W) and the quantized model's output as small as possible. If you can make 0 KLD, then you have perfectly reconstructed the model!

We use mean KLD like below since it's expensive to run KLD across the full training corpus (15T tokens for example) - instead we do sampling, and get a small representative subset of the training corpus / downstream task, and optimize that. Mean KLD generally follows a monotonic trend vs disk space, and shows even at 1-bit GLM 5.2 works well!

Top-1% accuracy is simply a greedy decoding operator where we assume the argmax item will be picked, and for 1bit, 76% it picks the same as the argmax from the baseline.

Run GLM-5.2 Tutorials:

You can now run GLM-5.2 in llama.cpp and Unsloth Studio. We will be utilizing the 239GB UD-IQ2_M quant for best results in terms of accessbility and accuracy.

🦥 Run GLM-5.2 in Unsloth Studio

GLM-5.2 can run in Unsloth Studio, an open-source web UI for local AI. Unsloth Studio automatically offloads to RAM and detects multiGPU setups. With Unsloth Studio, you can run models locally on MacOS, Windows, Linux and:

Search, download, run GGUFs and safetensor models
Fast CPU + GPU inference via llama.cpp

Install and Launch Unsloth

To install, run in your terminal:

MacOS, Linux, WSL:

Windows PowerShell:

Launch Unsloth

MacOS, Linux, WSL and Windows:

Then open http://127.0.0.1:8888 (or your specific URL) in your browser.

Launch Unsloth securely with HTTPS and Cloudflare

NEW! Unsloth now provides a secure way to launch Studio over HTTPS through a free Cloudflare tunnel. Use the below (works in Windows, Mac & Linux):

Search and download GLM-5.2

Unsloth Studio automatically offloads to RAM and detects multiGPU setups. On first launch you will need to create a password to secure your account and sign in again later.

Then go to the Studio Chat tab and search for GLM-5.2 in the search bar and download your desired model and quant. Ensure you have enough compute the run the model.

Run GLM-5.2

Inference parameters should be auto-set when using Unsloth Studio, however you can still change it manually. You can also edit the context length, chat template and other settings.

For more information, you can view our Unsloth Studio inference guide.

🦙 Run GLM-5.2 in llama.cpp

For this guide we'll be running the UD-IQ2_M quant which will require at least 245GB RAM. Feel free to change quantization type. For these tutorials, we will using llama.cpp for fast local inference. GGUF: GLM-5.2-GGUF

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF then continue as usual - Metal support is on by default.

You can now use llama.cpp directly to load and download models, just like ollama run. First, select the quantization type you want like UD-IQ2_M. Also use export LLAMA_CACHE="unsloth/GLM-5.2-GGUF" to force llama.cpp to save to a specific location. Note this download process might be very slow, so it's probably best to use the manual download process in the next section.

If you want to download the model manually (much faster!), we can download the model via the code below (after installing pip install huggingface_hub). If downloads get stuck, see: Hugging Face Hub, XET debugging

If you want to use the dynamic 1bit, then do:

Then run the model in conversation mode. Use unsloth/GLM-5.2-GGUF/UD-IQ2_M/GLM-5.2-UD-IQ2_M-00001-of-00006.gguf for 2bit or unsloth/GLM-5.2-GGUF/UD-IQ1_S/GLM-5.2-UD-IQ1_S-00001-of-00006.gguf for 1bit.

When you launch llama-cli, you will see:

Then after prompting it to make a short Flappy Bird game, we get:

With the full conversation and game below:

Full game in HTML

Full conversation

And the game has sound and works wonderfully! Reminder this was a 1-bit quantization and it worked well!

📐Long context via KV Cache quantization

To utilize long context in llama.cpp, we need to employ KV cache quantization to reduce memory usage. Recently llama.cpp added higher accuracy tricks to KV cache quantization - see and other PRs!

Currently, these KV cache dtypes are supported:

By default f16 is used. If you use q4_0 which is around 4.5 bits per weight, you can extend around 16 / 4.5 = 3.5x longer context lengths! So if you model used to support 10K, 35K can be in reach! q4_1 is probably better since you also get a shifting parameter, and is 5 bits per weight - so 3.2x longer contexts.

Use it like below:

📊 Benchmarks

You can view further below for GLM-5.2 benchmarks in table format:

Last updated 3 minutes ago

GLM-5.2 – How to Run Locally