zai-org/GLM-Image

Hybrid autoregressive + diffusion image generation model — text-to-image and image-to-image with strong text rendering and knowledge-intensive generation

View on HuggingFace

dense16B4,096 ctxvLLM 0.11.0+omni

Guide

Overview

GLM-Image is an image generation model with a hybrid architecture:

Autoregressive Generator (9B): initialized from GLM-4-9B-0414 with an expanded vocabulary for visual tokens. Produces a compact encoding (~256 tokens), then expands to 1K–4K tokens corresponding to 1K–2K resolution images.
Diffusion Decoder (7B): single-stream DiT that decodes latents into pixels. Includes a Glyph Encoder text module for accurate in-image text rendering.

Served via vLLM-Omni for OpenAI-compatible online inference.

Key Capabilities

Text-to-image and image-to-image (editing, style transfer, identity-preserving)
Exceptional text rendering inside generated images
Strong knowledge-intensive generation

Prerequisites

vLLM version: latest (with vllm-omni extension)
Transformers: >= 5.0.0 (use source install for latest)
Hardware: single H100-class GPU (approx. 33 GB for weights)

Install Dependencies

uv venv --python 3.12 --seed
source .venv/bin/activate

uv pip install -U vllm --torch-backend auto
uv pip install vllm-omni

pip install git+https://github.com/huggingface/transformers.git
pip install git+https://github.com/huggingface/diffusers.git

Online Serving

vllm serve zai-org/GLM-Image --omni

Client Usage

OpenAI SDK (Text-to-Image)

import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="zai-org/GLM-Image",
    messages=[{"role": "user", "content": "A beautiful landscape painting with mountains and a lake at sunset"}],
)

image_url = response.choices[0].message.content[0].image_url.url
image_data = base64.b64decode(image_url.split(",")[1])
with open("output.png", "wb") as f:
    f.write(image_data)

cURL (Text-to-Image)

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "A beautiful landscape painting"}]
  }' | jq -r '.choices[0].message.content[0].image_url.url' | cut -d',' -f2- | base64 -d > output.png

Image-to-Image

import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

with open("input.png", "rb") as f:
    image_base64 = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="zai-org/GLM-Image",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}},
            {"type": "text", "text": "Replace the background with a sunset beach scene"}
        ]
    }],
)

image_url = response.choices[0].message.content[0].image_url.url
image_data = base64.b64decode(image_url.split(",")[1])
with open("output.png", "wb") as f:
    f.write(image_data)

Offline Inference

# Text to Image
cd examples/offline_inference/text_to_image
python3 text_to_image.py --model zai-org/GLM-Image --output t2i_output.png

# Image to Image
cd examples/offline_inference/image_to_image
wget https://vllm-public-assets.s3.us-west-2.amazonaws.com/omni-assets/qwen-bear.png
python3 image_to_image.py --model zai-org/GLM-Image --image qwen-bear.png --output i2i_output.png

Troubleshooting

Resolution errors: Target image dimensions must be divisible by 32.
Text rendering: Wrap text that should appear in the image with quotation marks in the prompt.
Output stability: Default temperature=0.9, top_p=0.75. Higher temperature gives more diverse outputs but may reduce stability.
Transformers version: Requires transformers >= 5.0.0.