mistralai/Mistral-Large-3-675B-Instruct-2512
Mistral Large 3 (675B) with FP8 and NVFP4 weights for 8xH200 / 4xB200 deployments
View on HuggingFaceGuide
Overview
Mistral-Large-3-675B-Instruct-2512 is available in FP8 and NVFP4 formats:
- FP8 (
mistralai/Mistral-Large-3-675B-Instruct-2512): up to 256K context, 8xH200 - NVFP4 (
mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4): best for <64K context, 4xB200
NVFP4 gives significant speed-up on B200 (native FP4 support). On older GPUs (A100/H100), vLLM falls back to Marlin FP4, which matches FP8 speed while saving memory.
For large contexts (>64K) we observed a performance regression on NVFP4 — use FP8 in those cases. A minor regression on vision datasets is expected with NVFP4 (calibration was mainly on text).
Prerequisites
- Hardware: 8xH200 for FP8, 4xB200 for NVFP4
- vLLM >= 0.11.0
Install vLLM
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
Launch commands
FP8 on 8xH200:
vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 \
--tensor-parallel-size 8 \
--tokenizer_mode mistral --config_format mistral --load_format mistral \
--enable-auto-tool-choice --tool-call-parser mistral
NVFP4 on 4xB200:
vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4 \
--tensor-parallel-size 4 \
--tokenizer_mode mistral --config_format mistral --load_format mistral \
--enable-auto-tool-choice --tool-call-parser mistral
Additional flags:
--max-model-len: default 262144; reduce to save memory--max-num-batched-tokens: balance throughput vs. latency--limit-mm-per-prompt.image 0: skip vision encoder for text-only tasks
Client Usage
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
model="mistralai/Mistral-Large-3-675B-Instruct-2512",
messages=[{"role": "user", "content": "Write a sentence..."}],
temperature=0.15, max_tokens=262144,
)
print(response.choices[0].message.content)
Troubleshooting
- Accuracy regression with NVFP4 at long context: switch to FP8 variant.
- OOM: reduce
--max-model-lenor adjust TP size.