vLLM/Recipes
Xiaomi MiMo

XiaomiMiMo/MiMo-V2-Flash

Xiaomi's MoE reasoning model (309B total / 15B active) with hybrid attention and MTP for fast agentic workflows

View on HuggingFace
moe309B / 15B262,144 ctxvLLM 0.11.0+text
Guide

Overview

MiMo-V2-Flash is a MoE language model with 309B total parameters and 15B active. Designed for high-speed reasoning and agentic workflows, it features hybrid attention and Multi-Token Prediction (MTP) to reduce inference cost.

Prerequisites

  • Hardware: 4x H200 (TP4) or equivalent aggregate VRAM (~320 GB with FP8)
  • vLLM >= 0.11.0

Install vLLM (NVIDIA)

uv venv
source .venv/bin/activate
uv pip install vllm --torch-backend auto

Install vLLM (AMD ROCm MI300X/MI325X/MI355X)

uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/

Launch commands

Basic TP4:

vllm serve XiaomiMiMo/MiMo-V2-Flash \
  --host 0.0.0.0 --port 9001 --seed 1024 \
  --served-model-name mimo_v2_flash \
  --tensor-parallel-size 4 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --generation-config vllm

With tool calling + reasoning:

vllm serve XiaomiMiMo/MiMo-V2-Flash \
  --tensor-parallel-size 4 --trust-remote-code --gpu-memory-utilization 0.9 \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3 \
  --generation-config vllm \
  --served-model-name mimo_v2_flash

DP + TP + EP:

vllm serve XiaomiMiMo/MiMo-V2-Flash \
  --data-parallel-size 2 \
  --tensor-parallel-size 4 \
  --enable-expert-parallel \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --generation-config vllm \
  --served-model-name mimo_v2_flash

AMD:

export VLLM_ROCM_USE_AITER=0
vllm serve XiaomiMiMo/MiMo-V2-Flash --tensor-parallel-size 4 \
  --trust-remote-code --gpu-memory-utilization 0.9 --generation-config vllm

Tunable flags:

  • --max-model-len=65536 works well; max is 128K.
  • --max-num-batched-tokens=32768 for prompt-heavy; 16K/8K for lower latency.
  • --gpu-memory-utilization=0.95 to maximize KV cache.

Client Usage

curl -X POST http://localhost:9001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mimo_v2_flash",
    "messages": [{"role": "user", "content": "Hello MiMo!"}],
    "chat_template_kwargs": {"enable_thinking": true}
  }'

Set "enable_thinking": false (or omit the kwargs) to disable thinking mode.

Benchmarking

vllm bench serve \
  --model XiaomiMiMo/MiMo-V2-Flash \
  --dataset-name random --random-input-len 8000 --random-output-len 1000 \
  --request-rate 3 --num-prompts 1800 --ignore-eos

Accuracy (GSM8K)

Reported 5-shot exact_match: flexible 0.9128, strict 0.9075.

References