moonshotai/Kimi-K2-Instruct
Moonshot AI's Kimi-K2 is a trillion-parameter MoE instruction model (~32B active) with native FP8 weights and strong tool-calling capabilities.
View on HuggingFaceGuide
Overview
Kimi-K2-Instruct is Moonshot AI's trillion-parameter Mixture-of-Experts instruction model (approximately 32B activated per token) shipped with native FP8 weights. The smallest deployment unit for Kimi-K2 FP8 weights with 128k seqlen on mainstream H800 platforms is a 16-GPU cluster using either Tensor Parallel (TP) or Data Parallel + Expert Parallel (DP+EP). This guide is partially adapted from the official Kimi-K2-Instruct Deployment Guidance.
Prerequisites
- Hardware (FP8): 16x H800 or 16x H200 GPUs (verified)
- vLLM: Current stable release
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
Running Kimi-K2 with FP8 on 16xH800
Tensor Parallelism + Pipeline Parallelism (TP8+PP2)
# node 0 (start Ray on both nodes first)
vllm serve moonshotai/Kimi-K2-Instruct \
--trust-remote-code \
--tokenizer-mode auto \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2 \
--dtype bfloat16 \
--quantization fp8 \
--max-model-len 2048 \
--max-num-seqs 1 \
--max-num-batched-tokens 1024 \
--enable-chunked-prefill \
--disable-log-requests \
--kv-cache-dtype fp8 \
-dcp 8
Key parameter notes:
--enable-auto-tool-choice: required when enabling tool usage.--tool-call-parser kimi_k2: required when enabling tool usage.
Data Parallelism + Expert Parallelism (DP16+EP)
You can install libraries like DeepEP and DeepGEMM as needed. Then run the command (example on H800):
# node 0
vllm serve moonshotai/Kimi-K2-Instruct \
--port 8000 --served-model-name kimi-k2 \
--trust-remote-code \
--data-parallel-size 16 \
--data-parallel-size-local 8 \
--data-parallel-address $MASTER_IP \
--data-parallel-rpc-port $PORT \
--enable-expert-parallel \
--max-num-batched-tokens 8192 \
--max-num-seqs 256 \
--gpu-memory-utilization 0.85 \
--enable-auto-tool-choice \
--tool-call-parser kimi_k2
# node 1
vllm serve moonshotai/Kimi-K2-Instruct \
--headless \
--data-parallel-start-rank 8 \
--port 8000 --served-model-name kimi-k2 \
--trust-remote-code \
--data-parallel-size 16 \
--data-parallel-size-local 8 \
--data-parallel-address $MASTER_IP \
--data-parallel-rpc-port $PORT \
--enable-expert-parallel \
--max-num-batched-tokens 8192 \
--max-num-seqs 256 \
--gpu-memory-utilization 0.85 \
--enable-auto-tool-choice \
--tool-call-parser kimi_k2
Additional flags:
--max-model-lenpreserves memory;--max-model-len=65536is usually good for most scenarios.--max-num-batched-tokensbalances throughput vs latency.32768is good for prompt-heavy workloads; reduce to 16k or 8k to cut activation memory and decrease latency.- vLLM conservatively uses 90% of GPU memory. Set
--gpu-memory-utilization=0.95to maximize KV cache.
Benchmarking
FP8 Benchmark on 16xH800
vllm bench serve \
--model moonshotai/Kimi-K2-Instruct \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 512 \
--request-rate 1.0 \
--num-prompts 8 \
--ignore-eos \
--trust-remote-code
FP8 Benchmark on 16xH200
vllm bench serve \
--model moonshotai/Kimi-K2-Instruct \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--request-rate 10000 \
--num-prompts 16 \
--ignore-eos \
--trust-remote-code
Adding -dcp 8 at launch can further improve throughput on H200 (observed ~33% lower
mean TTFT and higher tok/s in internal benchmarks).
Test different batch sizes by changing --num-prompts, e.g. 1, 16, 32, 64, 128, 256, 512.