MiniMaxAI/MiniMax-M2.1
MiniMax M2.1 MoE language model (230B total / 10B active) for coding, agent toolchains, and long-context reasoning — native FP8 checkpoint
View on HuggingFaceGuide
Overview
MiniMax-M2.1 is part of the MiniMax M2 series of advanced MoE language models. It retains the M2 architecture (10B active, 230B total) with improvements over the original M2 release. Supports 196K context per sequence.
Prerequisites
- OS: Linux
- Python: 3.10 - 3.13
- NVIDIA: compute capability >= 7.0; ~220 GB for weights + 240 GB per 1M context tokens
- AMD: MI300X / MI325X / MI350X / MI355X with ROCm 7.0+
Install vLLM (NVIDIA)
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
Docker (dedicated M2-series image)
docker run --gpus all \
-p 8000:8000 \
--ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:minimax27 MiniMaxAI/MiniMax-M2.1 \
--tensor-parallel-size 4 \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2 \
--enable-auto-tool-choice \
--compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
--trust-remote-code
Launching the Server
NVIDIA — TP4
vllm serve MiniMaxAI/MiniMax-M2.1 \
--tensor-parallel-size 4 \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2 \
--compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
--enable-auto-tool-choice \
--trust-remote-code
Pure TP8 is not supported. For >4 GPUs use DP+EP or TP+EP.
TP4+EP (recommended for H100)
vllm serve MiniMaxAI/MiniMax-M2.1 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2 \
--compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
--enable-auto-tool-choice
AMD ROCm
VLLM_ROCM_USE_AITER=1 vllm serve MiniMaxAI/MiniMax-M2.1 \
--tensor-parallel-size 4 \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2 \
--enable-auto-tool-choice \
--trust-remote-code
Benchmarking
vllm bench serve \
--backend vllm \
--model MiniMaxAI/MiniMax-M2.1 \
--endpoint /v1/completions \
--dataset-name random \
--random-input 2048 \
--random-output 1024 \
--max-concurrency 10 \
--num-prompt 100
Troubleshooting
- See MiniMax-M2 for shared troubleshooting notes
(
fuse_minimax_qk_norm, nightly vs stable, DeepGEMM, AITER).