XiaomiMiMo/MiMo-V2-Flash
Xiaomi's MoE reasoning model (309B total / 15B active) with hybrid attention and MTP for fast agentic workflows
View on HuggingFaceGuide
Overview
MiMo-V2-Flash is a MoE language model with 309B total parameters and 15B active. Designed for high-speed reasoning and agentic workflows, it features hybrid attention and Multi-Token Prediction (MTP) to reduce inference cost.
Prerequisites
- Hardware: 4x H200 (TP4) or equivalent aggregate VRAM (~320 GB with FP8)
- vLLM >= 0.11.0
Install vLLM (NVIDIA)
uv venv
source .venv/bin/activate
uv pip install vllm --torch-backend auto
Install vLLM (AMD ROCm MI300X/MI325X/MI355X)
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/
Launch commands
Basic TP4:
vllm serve XiaomiMiMo/MiMo-V2-Flash \
--host 0.0.0.0 --port 9001 --seed 1024 \
--served-model-name mimo_v2_flash \
--tensor-parallel-size 4 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--generation-config vllm
With tool calling + reasoning:
vllm serve XiaomiMiMo/MiMo-V2-Flash \
--tensor-parallel-size 4 --trust-remote-code --gpu-memory-utilization 0.9 \
--tool-call-parser qwen3_xml \
--reasoning-parser qwen3 \
--generation-config vllm \
--served-model-name mimo_v2_flash
DP + TP + EP:
vllm serve XiaomiMiMo/MiMo-V2-Flash \
--data-parallel-size 2 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--generation-config vllm \
--served-model-name mimo_v2_flash
AMD:
export VLLM_ROCM_USE_AITER=0
vllm serve XiaomiMiMo/MiMo-V2-Flash --tensor-parallel-size 4 \
--trust-remote-code --gpu-memory-utilization 0.9 --generation-config vllm
Tunable flags:
--max-model-len=65536works well; max is 128K.--max-num-batched-tokens=32768for prompt-heavy; 16K/8K for lower latency.--gpu-memory-utilization=0.95to maximize KV cache.
Client Usage
curl -X POST http://localhost:9001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mimo_v2_flash",
"messages": [{"role": "user", "content": "Hello MiMo!"}],
"chat_template_kwargs": {"enable_thinking": true}
}'
Set "enable_thinking": false (or omit the kwargs) to disable thinking mode.
Benchmarking
vllm bench serve \
--model XiaomiMiMo/MiMo-V2-Flash \
--dataset-name random --random-input-len 8000 --random-output-len 1000 \
--request-rate 3 --num-prompts 1800 --ignore-eos
Accuracy (GSM8K)
Reported 5-shot exact_match: flexible 0.9128, strict 0.9075.