Google/gemma-4-26B-A4B-it
Google's Gemma 4 MoE multimodal model (26B total / 4B active) with 128 fine-grained experts, top-8 routing, thinking mode, and tool-use protocol.
View on HuggingFaceGuide
Overview
Gemma 4 26B-A4B is the Mixture-of-Experts member of Google's Gemma 4 family — 26B total parameters with only 4B active per token via 128 fine-grained experts and top-8 routing. It supports text + images natively, structured thinking, function calling, and dynamic vision resolution.
Key Features
- MoE: 128 fine-grained experts with top-8 routing and custom GELU-activated FFN.
- Multimodal: Text + images natively (video via custom frame-extraction pipeline). Audio is only supported on the smaller E2B/E4B variants.
- Dual Attention: Alternating sliding-window (local) and global attention with different head dimensions.
- Thinking Mode: Structured reasoning via
<|channel>thought\n...<channel|>delimiters. - Function Calling: Custom tool-call protocol with dedicated special tokens.
- Dynamic Vision Resolution: Per-request configurable vision token budget (70, 140, 280, 560, 1120 tokens).
TPU support is provided through vLLM TPU with recipes for Trillium and Ironwood.
Prerequisites
pip (NVIDIA CUDA)
uv venv
source .venv/bin/activate
uv pip install -U vllm --pre \
--extra-index-url https://wheels.vllm.ai/nightly/cu129 \
--extra-index-url https://download.pytorch.org/whl/cu129 \
--index-strategy unsafe-best-match
pip (AMD ROCm: MI300X, MI325X, MI350X, MI355X)
Requires Python 3.12, ROCm 7.2.1, glibc >= 2.35 (Ubuntu 22.04+).
uv venv --python 3.12
source .venv/bin/activate
uv pip install vllm --pre \
--extra-index-url https://wheels.vllm.ai/rocm/nightly/rocm721 --upgrade
Docker
docker pull vllm/vllm-openai:gemma4 # CUDA 12.9
docker pull vllm/vllm-openai:gemma4-cu130 # CUDA 13.0
docker pull vllm/vllm-openai-rocm:gemma4 # AMD
TPU images are published separately by vllm-project/tpu-inference; see the Trillium / Ironwood tpu-recipes below for the pinned tag.
Deployment Configurations
26B MoE on 1x A100/H100 (BF16)
vllm serve google/gemma-4-26B-A4B-it \
--max-model-len 32768 \
--gpu-memory-utilization 0.90
Full-Featured Server Launch
Enables text, image, thinking, and tool calling:
vllm serve google/gemma-4-26B-A4B-it \
--max-model-len 16384 \
--gpu-memory-utilization 0.90 \
--enable-auto-tool-choice \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--chat-template examples/tool_chat_template_gemma4.jinja \
--limit-mm-per-prompt image=4 \
--async-scheduling \
--host 0.0.0.0 \
--port 8000
Docker (NVIDIA)
docker run -itd --name gemma4-moe \
--ipc=host --network host --shm-size 16G --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:gemma4 \
--model google/gemma-4-26B-A4B-it \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--host 0.0.0.0 --port 8000
Docker (AMD MI300X/MI325X/MI350X/MI355X)
docker run -itd --name gemma4-rocm \
--ipc=host --network=host --privileged \
--cap-add=CAP_SYS_ADMIN --device=/dev/kfd --device=/dev/dri \
--group-add=video --cap-add=SYS_PTRACE \
--security-opt=seccomp=unconfined --shm-size 16G \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai-rocm:gemma4 \
--model google/gemma-4-26B-A4B-it \
--host 0.0.0.0 --port 8000
Docker (Cloud TPU — Trillium / Ironwood)
TPU uses the separate vllm/vllm-tpu image (no pip wheel). Pull the tag specified by the upstream Trillium or Ironwood recipe, then run:
docker run -itd --name gemma4-tpu \
--privileged --network host --shm-size 16G \
-v /dev/shm:/dev/shm -e HF_TOKEN=$HF_TOKEN \
vllm/vllm-tpu:latest \
--model google/gemma-4-26B-A4B-it \
--tensor-parallel-size 8 \
--max-model-len 16384 \
--disable_chunked_mm_input \
--host 0.0.0.0 --port 8000
Trillium requires a 4-chip slice minimum; Ironwood runs on a single chip.
Client Usage
Text Generation
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="google/gemma-4-26B-A4B-it",
messages=[{"role": "user", "content": "Write a poem about the ocean."}],
max_tokens=512, temperature=0.7,
)
print(response.choices[0].message.content)
Image Understanding
response = client.chat.completions.create(
model="google/gemma-4-26B-A4B-it",
messages=[{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
{"type": "text", "text": "Describe this image in detail."},
]}],
max_tokens=1024,
)
Thinking Mode
vllm serve google/gemma-4-26B-A4B-it \
--max-model-len 16384 \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--enable-auto-tool-choice \
--chat-template examples/tool_chat_template_gemma4.jinja
Enable per-request via extra_body={"chat_template_kwargs": {"enable_thinking": True}}.
Dynamic Vision Resolution
Supported values: 70, 140, 280 (default), 560, 1120 tokens/image.
vllm serve google/gemma-4-26B-A4B-it \
--mm-processor-kwargs '{"max_soft_tokens": 560}'
Configuration Tips
- Set
--max-model-lento match your workload. --gpu-memory-utilization 0.90-0.95maximizes KV cache.- Text-only workloads:
--limit-mm-per-prompt image=0,audio=0. --async-schedulingimproves throughput.- FP8 KV cache (
--kv-cache-dtype fp8) saves ~50% KV memory. - For MoE, TEP (tensor-expert parallelism) and DEP (data-expert parallelism) strategies scale better than pure TP at large node counts.
Throughput vs Latency
| Goal | TP | --max-num-seqs | Notes |
|---|---|---|---|
| Max throughput | 1-2 | 256-512 | Best tok/s per GPU |
| Min latency | 4-8 | 8-16 | Best TTFT/TPOT |
| Balanced | 2 | 128 | Mixed workloads |