Google/gemma-4-31B-it
Google's unified multimodal Gemma 4 dense model (31B) with native text, image, and audio, plus thinking mode and tool-use protocol.
View on HuggingFaceGuide
Overview
Gemma 4 is Google's most capable open model family, featuring a unified multimodal architecture that natively processes text, images, and audio. Gemma 4 models support structured thinking/reasoning, function calling with a custom tool-use protocol, and dynamic vision resolution — all available through vLLM's OpenAI-compatible API.
Key Features
- Multimodal: Text + images natively (video via custom frame-extraction pipeline). The smaller E2B and E4B models also support audio.
- MoE variant: 128 fine-grained experts with top-8 routing and custom GELU-activated FFN (Gemma 4 26B-A4B).
- Dual Attention: Alternating sliding-window (local) and global attention with different head dimensions.
- Thinking Mode: Structured reasoning via
<|channel>thought\n...<channel|>delimiters. - Function Calling: Custom tool-call protocol with dedicated special tokens.
- Dynamic Vision Resolution: Per-request configurable vision token budget (70, 140, 280, 560, 1120 tokens).
Supported Variants
Dense:
google/gemma-4-E2B-it(effective 2B)google/gemma-4-E4B-it(effective 4B)google/gemma-4-31B-it(31B)
MoE:
google/gemma-4-26B-A4B-it(26B total / 4B active)
TPU support is provided through vLLM TPU with recipes for Trillium and Ironwood.
Prerequisites
pip (NVIDIA CUDA)
uv venv
source .venv/bin/activate
uv pip install -U vllm --pre \
--extra-index-url https://wheels.vllm.ai/nightly/cu129 \
--extra-index-url https://download.pytorch.org/whl/cu129 \
--index-strategy unsafe-best-match
pip (AMD ROCm: MI300X, MI325X, MI350X, MI355X)
Requires Python 3.12, ROCm 7.2.1, glibc >= 2.35 (Ubuntu 22.04+).
uv venv --python 3.12
source .venv/bin/activate
uv pip install vllm --pre \
--extra-index-url https://wheels.vllm.ai/rocm/nightly/rocm721 --upgrade
Docker
docker pull vllm/vllm-openai:gemma4 # CUDA 12.9
docker pull vllm/vllm-openai:gemma4-cu130 # CUDA 13.0
docker pull vllm/vllm-openai-rocm:gemma4 # AMD
TPU images are published separately by vllm-project/tpu-inference; see the Trillium / Ironwood tpu-recipes below for the pinned tag.
Deployment Configurations
Quick Start (Single GPU)
vllm serve google/gemma-4-E4B-it \
--max-model-len <n_of_tokens> # up to 131072
31B Dense on 2xA100/H100 (TP=2, BF16)
vllm serve google/gemma-4-31B-it \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90
26B MoE on 1xA100/H100 (BF16)
vllm serve google/gemma-4-26B-A4B-it \
--max-model-len 32768 \
--gpu-memory-utilization 0.90
Full-Featured Server Launch
Enables text, image, audio, thinking, and tool calling:
vllm serve google/gemma-4-31B-it \
--tensor-parallel-size 2 \
--max-model-len 16384 \
--gpu-memory-utilization 0.90 \
--enable-auto-tool-choice \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--chat-template examples/tool_chat_template_gemma4.jinja \
--limit-mm-per-prompt image=4,audio=1 \
--async-scheduling \
--host 0.0.0.0 \
--port 8000
Docker (NVIDIA)
docker run -itd --name gemma4 \
--ipc=host --network host --shm-size 16G --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:gemma4 \
--model google/gemma-4-31B-it \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--host 0.0.0.0 --port 8000
Docker (AMD MI300X/MI325X/MI350X/MI355X)
docker run -itd --name gemma4-rocm \
--ipc=host --network=host --privileged \
--cap-add=CAP_SYS_ADMIN --device=/dev/kfd --device=/dev/dri \
--group-add=video --cap-add=SYS_PTRACE \
--security-opt=seccomp=unconfined --shm-size 16G \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai-rocm:gemma4 \
--model google/gemma-4-31B-it \
--host 0.0.0.0 --port 8000
Docker (Cloud TPU — Trillium / Ironwood)
TPU uses the separate vllm/vllm-tpu image (no pip wheel). Pull the tag specified by the upstream Trillium or Ironwood recipe, then run:
docker run -itd --name gemma4-tpu \
--privileged --network host --shm-size 16G \
-v /dev/shm:/dev/shm -e HF_TOKEN=$HF_TOKEN \
vllm/vllm-tpu:latest \
--model google/gemma-4-31B-it \
--tensor-parallel-size 8 \
--max-model-len 16384 \
--disable_chunked_mm_input \
--host 0.0.0.0 --port 8000
Client Usage
Text Generation
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="google/gemma-4-31B-it",
messages=[{"role": "user", "content": "Write a poem about the ocean."}],
max_tokens=512, temperature=0.7,
)
print(response.choices[0].message.content)
Image Understanding
response = client.chat.completions.create(
model="google/gemma-4-31B-it",
messages=[{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
{"type": "text", "text": "Describe this image in detail."},
]}],
max_tokens=1024,
)
Dynamic Vision Resolution
Supported values: 70, 140, 280 (default), 560, 1120 tokens/image.
vllm serve google/gemma-4-31B-it \
--mm-processor-kwargs '{"max_soft_tokens": 560}'
Audio (E2B / E4B)
Requires uv pip install "vllm[audio]".
vllm serve google/gemma-4-E2B-it \
--max-model-len 8192 \
--limit-mm-per-prompt image=4,audio=1
Thinking Mode
vllm serve google/gemma-4-31B-it \
--max-model-len 16384 \
--enable-auto-tool-choice \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--chat-template examples/tool_chat_template_gemma4.jinja
Enable thinking per-request via extra_body={"chat_template_kwargs": {"enable_thinking": True}}, or default-on with --default-chat-template-kwargs '{"enable_thinking": true}'.
Structured Outputs
vLLM guided decoding constrains output to a JSON schema. Include semantic instructions in the system prompt — the model does not see schema descriptions.
Configuration Tips
- Set
--max-model-lento match your workload. --gpu-memory-utilization 0.90–0.95maximizes KV cache.- Image-only workloads: pass
--limit-mm-per-prompt audio=0. - Text-only workloads: pass
--limit-mm-per-prompt image=0,audio=0to skip MM profiling. --async-schedulingimproves throughput.- FP8 KV cache (
--kv-cache-dtype fp8) saves ~50% KV memory.
Throughput vs Latency
| Goal | TP | --max-num-seqs | Notes |
|---|---|---|---|
| Max throughput | 1-2 | 256-512 | Best tok/s per GPU |
| Min latency | 4-8 | 8-16 | Best TTFT/TPOT |
| Balanced | 2 | 128 | Mixed workloads |
Deploy on Modal
Modal deployment script: gemma4-modal.py in the original recipe directory.
pip install modal
modal setup
modal deploy gemma4-modal.py
modal run gemma4-modal.py