vLLM/Recipes
Google

Google/gemma-4-31B-it

Google's unified multimodal Gemma 4 dense model (31B) with native text, image, and audio, plus thinking mode and tool-use protocol.

View on HuggingFace
dense31B262,144 ctxvLLM 0.19.1+multimodaltext
Guide

Overview

Gemma 4 is Google's most capable open model family, featuring a unified multimodal architecture that natively processes text, images, and audio. Gemma 4 models support structured thinking/reasoning, function calling with a custom tool-use protocol, and dynamic vision resolution — all available through vLLM's OpenAI-compatible API.

Key Features

  • Multimodal: Text + images natively (video via custom frame-extraction pipeline). The smaller E2B and E4B models also support audio.
  • MoE variant: 128 fine-grained experts with top-8 routing and custom GELU-activated FFN (Gemma 4 26B-A4B).
  • Dual Attention: Alternating sliding-window (local) and global attention with different head dimensions.
  • Thinking Mode: Structured reasoning via <|channel>thought\n...<channel|> delimiters.
  • Function Calling: Custom tool-call protocol with dedicated special tokens.
  • Dynamic Vision Resolution: Per-request configurable vision token budget (70, 140, 280, 560, 1120 tokens).

Supported Variants

Dense:

  • google/gemma-4-E2B-it (effective 2B)
  • google/gemma-4-E4B-it (effective 4B)
  • google/gemma-4-31B-it (31B)

MoE:

  • google/gemma-4-26B-A4B-it (26B total / 4B active)

TPU support is provided through vLLM TPU with recipes for Trillium and Ironwood.

Prerequisites

pip (NVIDIA CUDA)

uv venv
source .venv/bin/activate
uv pip install -U vllm --pre \
  --extra-index-url https://wheels.vllm.ai/nightly/cu129 \
  --extra-index-url https://download.pytorch.org/whl/cu129 \
  --index-strategy unsafe-best-match

pip (AMD ROCm: MI300X, MI325X, MI350X, MI355X)

Requires Python 3.12, ROCm 7.2.1, glibc >= 2.35 (Ubuntu 22.04+).

uv venv --python 3.12
source .venv/bin/activate
uv pip install vllm --pre \
  --extra-index-url https://wheels.vllm.ai/rocm/nightly/rocm721 --upgrade

Docker

docker pull vllm/vllm-openai:gemma4        # CUDA 12.9
docker pull vllm/vllm-openai:gemma4-cu130  # CUDA 13.0
docker pull vllm/vllm-openai-rocm:gemma4   # AMD

TPU images are published separately by vllm-project/tpu-inference; see the Trillium / Ironwood tpu-recipes below for the pinned tag.

Deployment Configurations

Quick Start (Single GPU)

vllm serve google/gemma-4-E4B-it \
  --max-model-len <n_of_tokens>   # up to 131072

31B Dense on 2xA100/H100 (TP=2, BF16)

vllm serve google/gemma-4-31B-it \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

26B MoE on 1xA100/H100 (BF16)

vllm serve google/gemma-4-26B-A4B-it \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

Enables text, image, audio, thinking, and tool calling:

vllm serve google/gemma-4-31B-it \
  --tensor-parallel-size 2 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.90 \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --chat-template examples/tool_chat_template_gemma4.jinja \
  --limit-mm-per-prompt image=4,audio=1 \
  --async-scheduling \
  --host 0.0.0.0 \
  --port 8000

Docker (NVIDIA)

docker run -itd --name gemma4 \
  --ipc=host --network host --shm-size 16G --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:gemma4 \
    --model google/gemma-4-31B-it \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90 \
    --host 0.0.0.0 --port 8000

Docker (AMD MI300X/MI325X/MI350X/MI355X)

docker run -itd --name gemma4-rocm \
  --ipc=host --network=host --privileged \
  --cap-add=CAP_SYS_ADMIN --device=/dev/kfd --device=/dev/dri \
  --group-add=video --cap-add=SYS_PTRACE \
  --security-opt=seccomp=unconfined --shm-size 16G \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai-rocm:gemma4 \
    --model google/gemma-4-31B-it \
    --host 0.0.0.0 --port 8000

Docker (Cloud TPU — Trillium / Ironwood)

TPU uses the separate vllm/vllm-tpu image (no pip wheel). Pull the tag specified by the upstream Trillium or Ironwood recipe, then run:

docker run -itd --name gemma4-tpu \
  --privileged --network host --shm-size 16G \
  -v /dev/shm:/dev/shm -e HF_TOKEN=$HF_TOKEN \
  vllm/vllm-tpu:latest \
    --model google/gemma-4-31B-it \
    --tensor-parallel-size 8 \
    --max-model-len 16384 \
    --disable_chunked_mm_input \
    --host 0.0.0.0 --port 8000

Client Usage

Text Generation

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="google/gemma-4-31B-it",
    messages=[{"role": "user", "content": "Write a poem about the ocean."}],
    max_tokens=512, temperature=0.7,
)
print(response.choices[0].message.content)

Image Understanding

response = client.chat.completions.create(
    model="google/gemma-4-31B-it",
    messages=[{"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
        {"type": "text", "text": "Describe this image in detail."},
    ]}],
    max_tokens=1024,
)

Dynamic Vision Resolution

Supported values: 70, 140, 280 (default), 560, 1120 tokens/image.

vllm serve google/gemma-4-31B-it \
  --mm-processor-kwargs '{"max_soft_tokens": 560}'

Audio (E2B / E4B)

Requires uv pip install "vllm[audio]".

vllm serve google/gemma-4-E2B-it \
  --max-model-len 8192 \
  --limit-mm-per-prompt image=4,audio=1

Thinking Mode

vllm serve google/gemma-4-31B-it \
  --max-model-len 16384 \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --chat-template examples/tool_chat_template_gemma4.jinja

Enable thinking per-request via extra_body={"chat_template_kwargs": {"enable_thinking": True}}, or default-on with --default-chat-template-kwargs '{"enable_thinking": true}'.

Structured Outputs

vLLM guided decoding constrains output to a JSON schema. Include semantic instructions in the system prompt — the model does not see schema descriptions.

Configuration Tips

  • Set --max-model-len to match your workload.
  • --gpu-memory-utilization 0.90–0.95 maximizes KV cache.
  • Image-only workloads: pass --limit-mm-per-prompt audio=0.
  • Text-only workloads: pass --limit-mm-per-prompt image=0,audio=0 to skip MM profiling.
  • --async-scheduling improves throughput.
  • FP8 KV cache (--kv-cache-dtype fp8) saves ~50% KV memory.

Throughput vs Latency

GoalTP--max-num-seqsNotes
Max throughput1-2256-512Best tok/s per GPU
Min latency4-88-16Best TTFT/TPOT
Balanced2128Mixed workloads

Deploy on Modal

Modal deployment script: gemma4-modal.py in the original recipe directory.

pip install modal
modal setup
modal deploy gemma4-modal.py
modal run gemma4-modal.py

References