vLLM/Recipes
DeepSeek

deepseek-ai/DeepSeek-V3.2-Exp

Experimental DeepSeek-V3.2 preview with sparse attention (MQA-like logits) and FP8 KV cache; architecture matches DeepSeek-V3.1 except for the sparse attention mechanism.

View on HuggingFace
moe671B / 37B163,840 ctxvLLM 0.12.0+text
Guide

Overview

DeepSeek-V3.2-Exp is a sparse-attention MoE preview. Its main architecture is similar to DeepSeek-V3.1, with a sparse attention mechanism. Only Hopper and Blackwell data center GPUs are supported for now.

Prerequisites

  • Hardware: 8x H200 (or H20, or 8xB200) GPUs
  • vLLM: Current stable release
  • DeepGEMM: Required for MQA logits computation (and optionally for MoE)
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
uv pip install git+https://github.com/deepseek-ai/DeepGEMM.git@v2.1.1.post3 --no-build-isolation

Note: DeepGEMM is used both for MoE and for MQA logits computation. It is required for MQA logits. To disable the MoE path only, set VLLM_USE_DEEP_GEMM=0. Some users report better performance with VLLM_USE_DEEP_GEMM=0 (e.g. on H20), and this also skips the long warmup.

Launching DeepSeek-V3.2-Exp

Serving on 8xH200 (or H20) GPUs

Using the recommended EP/DP mode:

vllm serve deepseek-ai/DeepSeek-V3.2-Exp -dp 8 --enable-expert-parallel

Using tensor parallel:

vllm serve deepseek-ai/DeepSeek-V3.2-Exp -tp 8

Serving on 8xB200 GPUs

Same as the above.

Accuracy Benchmarking

lm-eval --model local-completions --tasks gsm8k \
  --model_args model=deepseek-ai/DeepSeek-V3.2-Exp,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=100,max_retries=3,tokenized_requests=False

Reported GSM8K score: 0.9591 (5-shot) and 0.9538 (20-shot).

Performance Tips

  1. The kernels are mainly optimized for TP=1, so it is recommended to run this model under EP/DP mode (e.g. DP=8, EP=8, TP=1). If you hit any errors or hangs, try tensor parallel instead. Simple TP works and is more robust, but the performance is not optimal.
  2. The default config uses a custom fp8 KV cache. You can also use bfloat16 KV cache by specifying kv_cache_dtype=bfloat16. FP8 allows more tokens to be cached but incurs quantization/dequantization overhead. Use bfloat16 for short requests and fp8 for long requests.

Troubleshooting

  • CUDA error (flashmla-src/csrc/smxx/mla_combine.cu:201): invalid configuration argument: This may be caused by too large a batch size. Try --max-num-seqs 256 or smaller (default is 1024).
  • For thinking-mode toggling, refer to the DeepSeek-V3.1 recipe (deepseek-ai/DeepSeek-V3.1).

References