vLLM/Recipes
Moonshot AI

moonshotai/Kimi-K2-Thinking

Kimi-K2-Thinking is an advanced reasoning MoE model with native INT4 QAT weights, designed for long-horizon agent workflows interleaving chain-of-thought reasoning with tool calls.

View on HuggingFace
moe1T / 32B262,144 ctxvLLM 0.12.0+text
Guide

Overview

Kimi-K2-Thinking is an advanced trillion-parameter MoE created by Moonshot AI with these highlights:

  • Deep Thinking & Tool Orchestration: End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift.
  • Native INT4 Quantization: Quantization-Aware Training (QAT) delivers lossless 2x speed-up in low-latency mode.
  • Stable Long-Horizon Agency: Maintains coherent goal-directed behavior across up to 200-300 consecutive tool invocations, surpassing prior models that degrade after 30-50 steps.

Prerequisites

  • Hardware: 8x H200 or 8x H20 GPUs
  • vLLM: Current stable release
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Launching Kimi-K2-Thinking with vLLM

Low-Latency Scenarios (TP8)

vllm serve moonshotai/Kimi-K2-Thinking \
  --tensor-parallel-size 8 \
  --enable-auto-tool-choice \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2 \
  --trust-remote-code

The --reasoning-parser flag specifies the parser used to extract reasoning content from the model output.

High-Throughput Scenarios (TP8+DCP8)

vLLM supports Decode Context Parallel, which provides significant benefits in high-throughput scenarios. Enable DCP by adding --decode-context-parallel-size 8:

vllm serve moonshotai/Kimi-K2-Thinking \
  --tensor-parallel-size 8 \
  --decode-context-parallel-size 8 \
  --enable-auto-tool-choice \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2 \
  --trust-remote-code

Metrics (GSM8K)

Configexact_match (flexible)exact_match (strict)
TP80.94160.9409
TP8+DCP80.93860.9371

Benchmarking

We used the following script to benchmark moonshotai/Kimi-K2-Thinking on 8xH200:

vllm bench serve \
  --model moonshotai/Kimi-K2-Thinking \
  --dataset-name random \
  --random-input 8000 \
  --random-output 4000 \
  --request-rate 100 \
  --num-prompt 1000 \
  --trust-remote-code

DCP Gain Analysis

MetricTP8TP8+DCP8ChangeImprovement
Request throughput (req/s)1.251.57+0.32+25.6%
Output token throughput (tok/s)485.78695.13+209.35+43.1%
Mean TTFT (s)271.2227.8-43.4+16.0%

DCP multiplies the GPU KV cache size by dcp_world_size:

  • TP8 KV cache: 715,072 tokens
  • TP8+DCP8 KV cache: 5,721,088 tokens (8x)

Enabling DCP delivers strong advantages (43% faster token generation, 26% higher throughput) with minimal drawbacks. Read the DCP doc and try it in your LLM workloads.

References