mistralai/Mistral-Large-3-675B-Instruct-2512

Mistral Large 3 (675B) with FP8 and NVFP4 weights for 8xH200 / 4xB200 deployments

moe675B / 22B294,912 ctxvLLM 0.11.0+multimodal

Guide

Overview

Mistral-Large-3-675B-Instruct-2512 is available in FP8 and NVFP4 formats:

FP8 (mistralai/Mistral-Large-3-675B-Instruct-2512): up to 256K context, 8xH200
NVFP4 (mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4): best for <64K context, 4xB200

NVFP4 gives significant speed-up on B200 (native FP4 support). On older GPUs (A100/H100), vLLM falls back to Marlin FP4, which matches FP8 speed while saving memory.

For large contexts (>64K) we observed a performance regression on NVFP4 — use FP8 in those cases. A minor regression on vision datasets is expected with NVFP4 (calibration was mainly on text).

Prerequisites

Hardware: 8xH200 for FP8, 4xB200 for NVFP4
vLLM >= 0.11.0

Install vLLM

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Launch commands

FP8 on 8xH200:

vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 \
  --tensor-parallel-size 8 \
  --tokenizer_mode mistral --config_format mistral --load_format mistral \
  --enable-auto-tool-choice --tool-call-parser mistral

NVFP4 on 4xB200:

vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4 \
  --tensor-parallel-size 4 \
  --tokenizer_mode mistral --config_format mistral --load_format mistral \
  --enable-auto-tool-choice --tool-call-parser mistral

Additional flags:

--max-model-len: default 262144; reduce to save memory
--max-num-batched-tokens: balance throughput vs. latency
--limit-mm-per-prompt.image 0: skip vision encoder for text-only tasks

Client Usage

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
    model="mistralai/Mistral-Large-3-675B-Instruct-2512",
    messages=[{"role": "user", "content": "Write a sentence..."}],
    temperature=0.15, max_tokens=262144,
)
print(response.choices[0].message.content)

Troubleshooting