vLLM/Recipes
Mistral AI

mistralai/Mistral-Large-3-675B-Instruct-2512

Mistral Large 3 (675B) with FP8 and NVFP4 weights for 8xH200 / 4xB200 deployments

View on HuggingFace
moe675B / 22B294,912 ctxvLLM 0.11.0+multimodal
Guide

Overview

Mistral-Large-3-675B-Instruct-2512 is available in FP8 and NVFP4 formats:

  • FP8 (mistralai/Mistral-Large-3-675B-Instruct-2512): up to 256K context, 8xH200
  • NVFP4 (mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4): best for <64K context, 4xB200

NVFP4 gives significant speed-up on B200 (native FP4 support). On older GPUs (A100/H100), vLLM falls back to Marlin FP4, which matches FP8 speed while saving memory.

For large contexts (>64K) we observed a performance regression on NVFP4 — use FP8 in those cases. A minor regression on vision datasets is expected with NVFP4 (calibration was mainly on text).

Prerequisites

  • Hardware: 8xH200 for FP8, 4xB200 for NVFP4
  • vLLM >= 0.11.0

Install vLLM

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Launch commands

FP8 on 8xH200:

vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 \
  --tensor-parallel-size 8 \
  --tokenizer_mode mistral --config_format mistral --load_format mistral \
  --enable-auto-tool-choice --tool-call-parser mistral

NVFP4 on 4xB200:

vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4 \
  --tensor-parallel-size 4 \
  --tokenizer_mode mistral --config_format mistral --load_format mistral \
  --enable-auto-tool-choice --tool-call-parser mistral

Additional flags:

  • --max-model-len: default 262144; reduce to save memory
  • --max-num-batched-tokens: balance throughput vs. latency
  • --limit-mm-per-prompt.image 0: skip vision encoder for text-only tasks

Client Usage

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
    model="mistralai/Mistral-Large-3-675B-Instruct-2512",
    messages=[{"role": "user", "content": "Write a sentence..."}],
    temperature=0.15, max_tokens=262144,
)
print(response.choices[0].message.content)

Troubleshooting

  • Accuracy regression with NVFP4 at long context: switch to FP8 variant.
  • OOM: reduce --max-model-len or adjust TP size.

References