mistralai/Ministral-3-8B-Reasoning-2512
Ministral 3 Reasoning family (3B/8B/14B) with BF16 weights, vision support, and 256K context
View on HuggingFaceGuide
Overview
Ministral-3 Reasoning comes with BF16 weights in 3 sizes:
- 3B (tied embeddings)
- 8B, 14B (independent embeddings/outputs)
Each variant has vision support and 256K max context. On GB200, we observe significant speed-ups with NVFP4 Marlin fallback for older GPUs.
Prerequisites
- Hardware: 1x H200 (3B/8B), 2x H200 recommended for 14B with full context
- vLLM >= 0.11.0
Install vLLM
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
Launch command
3B or 8B on 1x H200:
vllm serve mistralai/Ministral-3-8B-Reasoning-2512 \
--tokenizer_mode mistral --config_format mistral --load_format mistral \
--enable-auto-tool-choice --tool-call-parser mistral \
--reasoning-parser mistral
14B on 2x H200:
vllm serve mistralai/Ministral-3-14B-Reasoning-2512 \
--tensor-parallel-size 2 \
--tokenizer_mode mistral --config_format mistral --load_format mistral \
--enable-auto-tool-choice --tool-call-parser mistral \
--reasoning-parser mistral
Key flags:
enable-auto-tool-choice: required for tool usagetool-call-parser mistral: required for tool usagereasoning-parser mistral: required to extract reasoning content
Client Usage
Streaming reasoning + answer:
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
stream = client.chat.completions.create(
model="mistralai/Ministral-3-8B-Reasoning-2512",
messages=[{"role": "user", "content": "Solve: use 2,5,6,3 to make 24."}],
stream=True, temperature=0.7, top_p=0.95, max_tokens=262144,
)
for chunk in stream:
delta = chunk.choices[0].delta
rc = getattr(delta, "reasoning_content", None)
if rc:
print(rc, end="", flush=True)
if delta.content:
print(delta.content, end="", flush=True)
Troubleshooting
- OOM on 14B: use
--tensor-parallel-size 2or lower--max-model-len.