deepseek-ai/DeepSeek-V3.1
DeepSeek-V3.1 is a hybrid MoE model that supports dynamic switching between thinking and non-thinking modes, with tool calling and function execution.
View on HuggingFaceGuide
Overview
DeepSeek-V3.1 is a hybrid MoE model that supports both thinking and non-thinking modes.
You can dynamically switch between the two modes from the client by passing
extra_body={"chat_template_kwargs": {"thinking": True|False}}.
Prerequisites
- Hardware: 8x H200 (or H20) GPUs (141 GB per GPU)
- vLLM: Current stable release
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
Launching DeepSeek-V3.1
Serving on 8xH200 (or H20) GPUs
vllm serve deepseek-ai/DeepSeek-V3.1 \
--enable-expert-parallel \
--tensor-parallel-size 8 \
--served-model-name ds31
Function calling
vLLM supports user-defined tool calling for DeepSeek-V3.1. Add these flags when launching
the server. The example chat template ships in the official container and can also be
downloaded from the vLLM repo:
tool_chat_template_deepseekv31.jinja.
vllm serve ... \
--enable-auto-tool-choice \
--tool-call-parser deepseek_v31 \
--chat-template examples/tool_chat_template_deepseekv31.jinja
Client Usage
OpenAI Python SDK
Control thinking mode via extra_body={"chat_template_kwargs": {"thinking": False}}
(or True to enable thinking).
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
model = client.models.list().data[0].id
messages = [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Who are you?"},
{"role": "assistant", "content": "<think>Hmm</think>I am DeepSeek"},
{"role": "user", "content": "9.11 and 9.8, which is greater?"},
]
response = client.chat.completions.create(
model=model,
messages=messages,
extra_body={"chat_template_kwargs": {"thinking": False}},
)
print(response.choices[0].message.content)
When thinking=True, output includes a </think> segment delimiting chain-of-thought;
when thinking=False, the model produces a direct answer without the thinking segment.
curl
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ds31",
"messages": [
{"role": "user", "content": "9.11 and 9.8, which is greater?"}
],
"chat_template_kwargs": {"thinking": true}
}'