zai-org/GLM-4.6

GLM-4.6 MoE language model (~357B total parameters, BF16) with MTP speculative decoding, native tool calling and reasoning

View on HuggingFace

moe357B / 32B202,752 ctxvLLM 0.11.0+text

Guide

Overview

GLM-4.6 is the successor to GLM-4.5 with ~357B total parameters. It retains the MoE architecture and built-in Multi-Token Prediction (MTP) layers used for speculative decoding. FP8 is the recommended precision for cost-efficient serving with minimal accuracy loss relative to BF16.

Prerequisites

vLLM version: >= 0.11.0 (latest stable recommended)
Hardware: 8x H200 (BF16) or 4x-8x H200 (FP8), AMD MI300X / MI325X / MI355X for ROCm
Python: 3.10 - 3.13 (3.12 required for ROCm wheels)

Install vLLM (NVIDIA)

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Install vLLM (AMD ROCm)

uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm

Launching the Server

Tensor Parallel (FP8)

vllm serve zai-org/GLM-4.6-FP8 \
    --tensor-parallel-size 8 \
    --tool-call-parser glm45 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice

Enabling MTP Speculative Decoding

vllm serve zai-org/GLM-4.6-FP8 \
    --tensor-parallel-size 4 \
    --speculative-config.method mtp \
    --speculative-config.num_speculative_tokens 1 \
    --tool-call-parser glm45 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice

Tuning Tips

--max-model-len=65536 works well for most scenarios; max is 128K.
--max-num-batched-tokens=32768 is a good default for prompt-heavy workloads.
--gpu-memory-utilization=0.95 maximizes KV cache headroom.

Client Usage

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
    model="zai-org/GLM-4.6-FP8",
    messages=[{"role": "user", "content": "Summarize MTP speculative decoding."}],
    max_tokens=512,
)
print(resp.choices[0].message.content)

Benchmarking

vllm bench serve \
  --model zai-org/GLM-4.6-FP8 \
  --dataset-name random \
  --random-input-len 8000 \
  --random-output-len 1000 \
  --request-rate 10000 \
  --num-prompts 16 \
  --ignore-eos

Troubleshooting

MTP memory overhead: Monitor GPU memory and tune batch size when enabling MTP.
Tool calling not firing: Ensure --tool-call-parser glm45 --enable-auto-tool-choice are both present.