deepseek-ai/DeepSeek-V3.2-Exp
Experimental DeepSeek-V3.2 preview with sparse attention (MQA-like logits) and FP8 KV cache; architecture matches DeepSeek-V3.1 except for the sparse attention mechanism.
View on HuggingFaceGuide
Overview
DeepSeek-V3.2-Exp is a sparse-attention MoE preview. Its main architecture is similar to DeepSeek-V3.1, with a sparse attention mechanism. Only Hopper and Blackwell data center GPUs are supported for now.
Prerequisites
- Hardware: 8x H200 (or H20, or 8xB200) GPUs
- vLLM: Current stable release
- DeepGEMM: Required for MQA logits computation (and optionally for MoE)
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
uv pip install git+https://github.com/deepseek-ai/DeepGEMM.git@v2.1.1.post3 --no-build-isolation
Note: DeepGEMM is used both for MoE and for MQA logits computation. It is required for
MQA logits. To disable the MoE path only, set VLLM_USE_DEEP_GEMM=0. Some users report
better performance with VLLM_USE_DEEP_GEMM=0 (e.g. on H20), and this also skips the
long warmup.
Launching DeepSeek-V3.2-Exp
Serving on 8xH200 (or H20) GPUs
Using the recommended EP/DP mode:
vllm serve deepseek-ai/DeepSeek-V3.2-Exp -dp 8 --enable-expert-parallel
Using tensor parallel:
vllm serve deepseek-ai/DeepSeek-V3.2-Exp -tp 8
Serving on 8xB200 GPUs
Same as the above.
Accuracy Benchmarking
lm-eval --model local-completions --tasks gsm8k \
--model_args model=deepseek-ai/DeepSeek-V3.2-Exp,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=100,max_retries=3,tokenized_requests=False
Reported GSM8K score: 0.9591 (5-shot) and 0.9538 (20-shot).
Performance Tips
- The kernels are mainly optimized for TP=1, so it is recommended to run this model under EP/DP mode (e.g. DP=8, EP=8, TP=1). If you hit any errors or hangs, try tensor parallel instead. Simple TP works and is more robust, but the performance is not optimal.
- The default config uses a custom
fp8KV cache. You can also usebfloat16KV cache by specifyingkv_cache_dtype=bfloat16. FP8 allows more tokens to be cached but incurs quantization/dequantization overhead. Usebfloat16for short requests andfp8for long requests.
Troubleshooting
CUDA error (flashmla-src/csrc/smxx/mla_combine.cu:201): invalid configuration argument: This may be caused by too large a batch size. Try--max-num-seqs 256or smaller (default is 1024).- For thinking-mode toggling, refer to the DeepSeek-V3.1 recipe (
deepseek-ai/DeepSeek-V3.1).