Qwen/Qwen3-Next-80B-A3B-Instruct
Advanced Qwen3-Next MoE model (80B total / 3B active) with hybrid attention, highly sparse experts, and multi-token prediction.
View on HuggingFaceGuide
Overview
Qwen3-Next is an advanced LLM from the Qwen team featuring:
- A hybrid attention mechanism
- A highly sparse Mixture-of-Experts (MoE) structure
- Training-stability-friendly optimizations
- A multi-token prediction mechanism for faster inference
Prerequisites
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
Deployment Configurations
Launch on 4x H200/H20 or 4x A100/A800 GPUs.
Basic Multi-GPU (BF16)
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct \
--tensor-parallel-size 4 \
--served-model-name qwen3-next \
--enable-prefix-caching
If you hit torch.AcceleratorError: CUDA error: an illegal memory access was encountered, add --compilation_config.cudagraph_mode=PIECEWISE.
FP8 (SM90/SM100)
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
--tensor-parallel-size 4 \
--enable-prefix-caching
On SM100, accelerate with the FP8 FlashInfer TRTLLM MoE kernel:
VLLM_USE_FLASHINFER_MOE_FP8=1 \
VLLM_FLASHINFER_MOE_BACKEND=latency \
VLLM_USE_DEEP_GEMM=0 \
VLLM_USE_TRTLLM_ATTENTION=0 \
VLLM_ATTENTION_BACKEND=FLASH_ATTN \
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
--tensor-parallel-size 4
MTP (Multi-Token Prediction)
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct \
--tokenizer-mode auto --gpu-memory-utilization 0.8 \
--speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 2}' \
--tensor-parallel-size 4 --no-enable-chunked-prefill
Tool / Function Calling
vllm serve ... --tool-call-parser hermes --enable-auto-tool-choice
AMD (MI300X/MI325X/MI355X)
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.14.1/rocm700
SAFETENSORS_FAST_GPU=1 \
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--no-enable-prefix-caching \
--trust-remote-code
Client Usage
Benchmark:
vllm bench serve \
--backend vllm \
--model Qwen/Qwen3-Next-80B-A3B-Instruct \
--served-model-name qwen3-next \
--endpoint /v1/completions \
--dataset-name random \
--random-input 2048 \
--random-output 1024 \
--max-concurrency 10 \
--num-prompt 100
Troubleshooting
- Sub-optimal MoE performance warning: Tune the MoE Triton kernel with benchmark_moe, then set
VLLM_TUNED_CONFIG_FOLDERto the directory containing the generated config. - IMA error in DP mode: add
--compilation_config.cudagraph_mode=PIECEWISE. - For more parallel topologies, see the Data Parallel Deployment docs.