moonshotai/Kimi-K2-Instruct

Moonshot AI's Kimi-K2 is a trillion-parameter MoE instruction model (~32B active) with native FP8 weights and strong tool-calling capabilities.

View on HuggingFace

moe1T / 32B131,072 ctxvLLM 0.12.0+text

Guide

Overview

Kimi-K2-Instruct is Moonshot AI's trillion-parameter Mixture-of-Experts instruction model (approximately 32B activated per token) shipped with native FP8 weights. The smallest deployment unit for Kimi-K2 FP8 weights with 128k seqlen on mainstream H800 platforms is a 16-GPU cluster using either Tensor Parallel (TP) or Data Parallel + Expert Parallel (DP+EP). This guide is partially adapted from the official Kimi-K2-Instruct Deployment Guidance.

Prerequisites

Hardware (FP8): 16x H800 or 16x H200 GPUs (verified)
vLLM: Current stable release

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Running Kimi-K2 with FP8 on 16xH800

Tensor Parallelism + Pipeline Parallelism (TP8+PP2)

# node 0 (start Ray on both nodes first)
vllm serve moonshotai/Kimi-K2-Instruct \
  --trust-remote-code \
  --tokenizer-mode auto \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 2 \
  --dtype bfloat16 \
  --quantization fp8 \
  --max-model-len 2048 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 1024 \
  --enable-chunked-prefill \
  --disable-log-requests \
  --kv-cache-dtype fp8 \
  -dcp 8

Key parameter notes:

--enable-auto-tool-choice: required when enabling tool usage.
--tool-call-parser kimi_k2: required when enabling tool usage.

Data Parallelism + Expert Parallelism (DP16+EP)

You can install libraries like DeepEP and DeepGEMM as needed. Then run the command (example on H800):

# node 0
vllm serve moonshotai/Kimi-K2-Instruct \
  --port 8000 --served-model-name kimi-k2 \
  --trust-remote-code \
  --data-parallel-size 16 \
  --data-parallel-size-local 8 \
  --data-parallel-address $MASTER_IP \
  --data-parallel-rpc-port $PORT \
  --enable-expert-parallel \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 256 \
  --gpu-memory-utilization 0.85 \
  --enable-auto-tool-choice \
  --tool-call-parser kimi_k2

# node 1
vllm serve moonshotai/Kimi-K2-Instruct \
  --headless \
  --data-parallel-start-rank 8 \
  --port 8000 --served-model-name kimi-k2 \
  --trust-remote-code \
  --data-parallel-size 16 \
  --data-parallel-size-local 8 \
  --data-parallel-address $MASTER_IP \
  --data-parallel-rpc-port $PORT \
  --enable-expert-parallel \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 256 \
  --gpu-memory-utilization 0.85 \
  --enable-auto-tool-choice \
  --tool-call-parser kimi_k2

Additional flags:

--max-model-len preserves memory; --max-model-len=65536 is usually good for most scenarios.
--max-num-batched-tokens balances throughput vs latency. 32768 is good for prompt-heavy workloads; reduce to 16k or 8k to cut activation memory and decrease latency.
vLLM conservatively uses 90% of GPU memory. Set --gpu-memory-utilization=0.95 to maximize KV cache.

Benchmarking

FP8 Benchmark on 16xH800

vllm bench serve \
  --model moonshotai/Kimi-K2-Instruct \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 512 \
  --request-rate 1.0 \
  --num-prompts 8 \
  --ignore-eos \
  --trust-remote-code

FP8 Benchmark on 16xH200

vllm bench serve \
  --model moonshotai/Kimi-K2-Instruct \
  --dataset-name random \
  --random-input-len 8000 \
  --random-output-len 1000 \
  --request-rate 10000 \
  --num-prompts 16 \
  --ignore-eos \
  --trust-remote-code

Adding -dcp 8 at launch can further improve throughput on H200 (observed ~33% lower mean TTFT and higher tok/s in internal benchmarks).

Test different batch sizes by changing --num-prompts, e.g. 1, 16, 32, 64, 128, 256, 512.