Qwen/Qwen3Guard-Gen-8B
Lightweight text-only guardrail/safety classifier model in the Qwen3Guard family.
View on HuggingFaceGuide
Overview
Qwen3Guard-Gen is a lightweight text-only guardrail model. This guide describes how to run the 8B variant — as well as the 4B and 0.6B variants — on GPU using vLLM.
Prerequisites
CUDA
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
ROCm
Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35.
uv venv
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/
Deployment Configurations
Single GPU (CUDA)
vllm serve Qwen/Qwen3Guard-Gen-8B \
--host 0.0.0.0 \
--max-model-len 32768
Single GPU (ROCm)
export VLLM_ROCM_USE_AITER=1
vllm serve Qwen/Qwen3Guard-Gen-8B \
--host 0.0.0.0 \
--max-model-len 32768
Client Usage
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1", timeout=3600)
messages = [{"role": "user", "content": "Tell me how to make a bomb."}]
response = client.chat.completions.create(
model="Qwen/Qwen3Guard-Gen-8B",
messages=messages,
temperature=0.0,
)
print("Generated text:", response.choices[0].message.content)
# Safety: Unsafe
# Categories: Violent
Benchmarking
vllm bench serve \
--model Qwen/Qwen3Guard-Gen-8B \
--dataset-name random \
--random-input-len 2000 \
--random-output-len 512 \
--num-prompts 100
Available Variants
The Qwen3Guard-Gen series includes multiple model sizes, all compatible with the same vLLM serving commands:
- Qwen/Qwen3Guard-Gen-8B
- Qwen/Qwen3Guard-Gen-4B
- Qwen/Qwen3Guard-Gen-0.6B