deepseek-ai/DeepSeek-V3
DeepSeek-V3 is a 671B-parameter Mixture-of-Experts model with native FP8 weights and strong reasoning, coding, and math capabilities.
View on HuggingFaceGuide
Overview
DeepSeek-V3 is a 671B-parameter Mixture-of-Experts model (37B activated per token)
shipped with native FP8 weights. It shares its architecture with DeepSeek-R1, so the
same launch recipes apply to both models. For Blackwell GPUs, NVIDIA publishes an FP4
quantized variant (nvidia/DeepSeek-V3-FP4 / nvidia/DeepSeek-R1-FP4) that runs on
fewer GPUs.
Prerequisites
- Hardware (FP8): 8x H200 GPUs (verified)
- Hardware (FP4): 4x B200 GPUs
- vLLM: Install with
uv pip install -U vllm --torch-backend auto
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
Client Usage
8xH200 (FP8)
Tensor Parallel + Expert Parallel (TP8+EP):
vllm serve deepseek-ai/DeepSeek-V3 \
--trust-remote-code \
--tensor-parallel-size 8 \
--enable-expert-parallel
Data Parallel + Expert Parallel (DP8+EP):
vllm serve deepseek-ai/DeepSeek-V3 \
--trust-remote-code \
--data-parallel-size 8 \
--enable-expert-parallel
4xB200 (FP4)
Enable FlashInfer MoE kernels before launching:
# For FP4 (recommended on Blackwell)
export VLLM_USE_FLASHINFER_MOE_FP4=1
# For FP8 on Blackwell
export VLLM_USE_FLASHINFER_MOE_FP8=1
Tensor Parallel + Expert Parallel (TP4+EP):
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve nvidia/DeepSeek-V3-FP4 \
--trust-remote-code \
--tensor-parallel-size 4 \
--enable-expert-parallel
Data Parallel + Expert Parallel (DP4+EP):
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve nvidia/DeepSeek-V3-FP4 \
--trust-remote-code \
--data-parallel-size 4 \
--enable-expert-parallel
Benchmarking
For benchmarking, disable prefix caching by adding --no-enable-prefix-caching
to the server command.
# Prompt-heavy benchmark (8k/1k)
vllm bench serve \
--model deepseek-ai/DeepSeek-V3 \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--request-rate 10000 \
--num-prompts 16 \
--ignore-eos
Test different workloads by adjusting input/output lengths:
- Prompt-heavy: 8000 input / 1000 output
- Decode-heavy: 1000 input / 8000 output
- Balanced: 1000 input / 1000 output
Troubleshooting
- Disaggregated Serving with Wide EP (Experimental GB200): See vLLM issue #33583 and the vLLM blog post for GB200 disaggregated serving recipes.