vLLM/Recipes
DeepSeek

deepseek-ai/DeepSeek-V3.2

DeepSeek V3.2 MoE model with MLA attention, sparse attention, and scalable RL for strong reasoning and agent capabilities.

View on HuggingFace
moe671B / 37B163,840 ctxvLLM 0.18.0+text
Guide

Overview

DeepSeek-V3.2 is a Mixture-of-Experts model that balances computational efficiency with strong reasoning and agent capabilities through three technical innovations: DeepSeek Sparse Attention (DSA) for efficient long-context processing, a scalable reinforcement learning framework achieving GPT-5-level performance, and a large-scale agentic task synthesis pipeline for robust tool-use generalization.

Prerequisites

  • Hardware: Minimum 8x H100/H200 80GB GPUs (BF16) or 3x H200 (NVFP4 variant).
  • vLLM: Version 0.18.0 or later (nightly recommended).
  • Python: 3.10+
  • CUDA: 12.x or later (CUDA 13.x may require extra env vars; see Troubleshooting).
  • Disk: ~1.3 TB for BF16 weights; ~350 GB for NVFP4 variant.
  • DeepGEMM (recommended):
    uv pip install git+https://github.com/deepseek-ai/DeepGEMM.git@v2.1.1.post3 --no-build-isolation
    
    Note: Set VLLM_USE_DEEP_GEMM=0 to disable MoE DeepGEMM if you experience issues (e.g., on H20 GPUs) or want to skip the long warmup.

Client Usage

Launch the server:

vllm serve deepseek-ai/DeepSeek-V3.2 \
  --tensor-parallel-size 8 \
  --trust-remote-code \
  --kernel-config.enable_flashinfer_autotune=False \
  --tokenizer-mode deepseek_v32 \
  --tool-call-parser deepseek_v32 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v3

Use the OpenAI Python SDK to interact with the server:

from openai import OpenAI

client = OpenAI(
    api_key="your-api-key",
    base_url="http://localhost:8000/v1",
)

# Standard chat
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3.2",
    messages=[{"role": "user", "content": "Hello!"}],
)

# Thinking / reasoning mode
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3.2",
    messages=[{"role": "user", "content": "Solve this step by step..."}],
    extra_body={"chat_template_kwargs": {"thinking": True}},
)

Troubleshooting

ptxas fatal: Value 'sm_110a' is not defined for option 'gpu-name' This can occur on CUDA 13.x. Fix by exporting:

export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}

TP=8 performance on Hopper/Blackwell Avoid -tp 8 with FlashMLA-Sparse. Due to kernel restrictions, TP=8 yields only 16 heads per rank but is padded to 64, causing overhead. Prefer TP=2 (Hopper) or TP=1 (Blackwell) with DP/EP mode: vllm serve deepseek-ai/DeepSeek-V3.2 -dp 8 --enable-expert-parallel.

DeepGEMM warmup too slow Set VLLM_USE_DEEP_GEMM=0 to disable MoE DeepGEMM and skip the long warmup.

References