Google/gemma-4-26B-A4B-it

Google's Gemma 4 MoE multimodal model (26B total / 4B active) with 128 fine-grained experts, top-8 routing, thinking mode, and tool-use protocol.

View on HuggingFace

moe26B / 4B131,072 ctxvLLM 0.19.1+multimodaltext

Guide

Overview

Gemma 4 26B-A4B is the Mixture-of-Experts member of Google's Gemma 4 family — 26B total parameters with only 4B active per token via 128 fine-grained experts and top-8 routing. It supports text + images natively, structured thinking, function calling, and dynamic vision resolution.

Key Features

MoE: 128 fine-grained experts with top-8 routing and custom GELU-activated FFN.
Multimodal: Text + images natively (video via custom frame-extraction pipeline). Audio is only supported on the smaller E2B/E4B variants.
Dual Attention: Alternating sliding-window (local) and global attention with different head dimensions.
Thinking Mode: Structured reasoning via <|channel>thought\n...<channel|> delimiters.
Function Calling: Custom tool-call protocol with dedicated special tokens.
Dynamic Vision Resolution: Per-request configurable vision token budget (70, 140, 280, 560, 1120 tokens).

TPU support is provided through vLLM TPU with recipes for Trillium and Ironwood.

Prerequisites

pip (NVIDIA CUDA)

uv venv
source .venv/bin/activate
uv pip install -U vllm --pre \
  --extra-index-url https://wheels.vllm.ai/nightly/cu129 \
  --extra-index-url https://download.pytorch.org/whl/cu129 \
  --index-strategy unsafe-best-match

pip (AMD ROCm: MI300X, MI325X, MI350X, MI355X)

Requires Python 3.12, ROCm 7.2.1, glibc >= 2.35 (Ubuntu 22.04+).

uv venv --python 3.12
source .venv/bin/activate
uv pip install vllm --pre \
  --extra-index-url https://wheels.vllm.ai/rocm/nightly/rocm721 --upgrade

Docker

docker pull vllm/vllm-openai:gemma4        # CUDA 12.9
docker pull vllm/vllm-openai:gemma4-cu130  # CUDA 13.0
docker pull vllm/vllm-openai-rocm:gemma4   # AMD

TPU images are published separately by vllm-project/tpu-inference; see the Trillium / Ironwood tpu-recipes below for the pinned tag.

Deployment Configurations

26B MoE on 1x A100/H100 (BF16)

vllm serve google/gemma-4-26B-A4B-it \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

Full-Featured Server Launch

Enables text, image, thinking, and tool calling:

vllm serve google/gemma-4-26B-A4B-it \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.90 \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --chat-template examples/tool_chat_template_gemma4.jinja \
  --limit-mm-per-prompt image=4 \
  --async-scheduling \
  --host 0.0.0.0 \
  --port 8000

Docker (NVIDIA)

docker run -itd --name gemma4-moe \
  --ipc=host --network host --shm-size 16G --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:gemma4 \
    --model google/gemma-4-26B-A4B-it \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90 \
    --host 0.0.0.0 --port 8000

Docker (AMD MI300X/MI325X/MI350X/MI355X)

docker run -itd --name gemma4-rocm \
  --ipc=host --network=host --privileged \
  --cap-add=CAP_SYS_ADMIN --device=/dev/kfd --device=/dev/dri \
  --group-add=video --cap-add=SYS_PTRACE \
  --security-opt=seccomp=unconfined --shm-size 16G \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai-rocm:gemma4 \
    --model google/gemma-4-26B-A4B-it \
    --host 0.0.0.0 --port 8000

Docker (Cloud TPU — Trillium / Ironwood)

TPU uses the separate vllm/vllm-tpu image (no pip wheel). Pull the tag specified by the upstream Trillium or Ironwood recipe, then run:

docker run -itd --name gemma4-tpu \
  --privileged --network host --shm-size 16G \
  -v /dev/shm:/dev/shm -e HF_TOKEN=$HF_TOKEN \
  vllm/vllm-tpu:latest \
    --model google/gemma-4-26B-A4B-it \
    --tensor-parallel-size 8 \
    --max-model-len 16384 \
    --disable_chunked_mm_input \
    --host 0.0.0.0 --port 8000

Trillium requires a 4-chip slice minimum; Ironwood runs on a single chip.

Client Usage

Text Generation

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="google/gemma-4-26B-A4B-it",
    messages=[{"role": "user", "content": "Write a poem about the ocean."}],
    max_tokens=512, temperature=0.7,
)
print(response.choices[0].message.content)

Image Understanding

response = client.chat.completions.create(
    model="google/gemma-4-26B-A4B-it",
    messages=[{"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
        {"type": "text", "text": "Describe this image in detail."},
    ]}],
    max_tokens=1024,
)

Thinking Mode

vllm serve google/gemma-4-26B-A4B-it \
  --max-model-len 16384 \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --enable-auto-tool-choice \
  --chat-template examples/tool_chat_template_gemma4.jinja

Enable per-request via extra_body={"chat_template_kwargs": {"enable_thinking": True}}.

Dynamic Vision Resolution

Supported values: 70, 140, 280 (default), 560, 1120 tokens/image.

vllm serve google/gemma-4-26B-A4B-it \
  --mm-processor-kwargs '{"max_soft_tokens": 560}'

Configuration Tips

Set --max-model-len to match your workload.
--gpu-memory-utilization 0.90-0.95 maximizes KV cache.
Text-only workloads: --limit-mm-per-prompt image=0,audio=0.
--async-scheduling improves throughput.
FP8 KV cache (--kv-cache-dtype fp8) saves ~50% KV memory.
For MoE, TEP (tensor-expert parallelism) and DEP (data-expert parallelism) strategies scale better than pure TP at large node counts.

Throughput vs Latency

Goal	TP	`--max-num-seqs`	Notes
Max throughput	1-2	256-512	Best tok/s per GPU
Min latency	4-8	8-16	Best TTFT/TPOT
Balanced	2	128	Mixed workloads