zai-org/GLM-ASR-Nano-2512
Open-source speech recognition model (~2B) with strong dialect support (Cantonese and others) and robust low-volume speech transcription
View on HuggingFaceGuide
Overview
GLM-ASR-Nano-2512 is an open-source automatic speech recognition model with 1.5B active parameters (2B total). It outperforms OpenAI Whisper V3 on multiple benchmarks while remaining compact enough for single-GPU deployment.
Key Capabilities
- Dialect support: Beyond standard Mandarin and English, strong on Cantonese (粤语) and other Chinese dialects.
- Low-volume speech: Specifically trained for "whisper/quiet speech" scenarios.
- SOTA accuracy: Lowest average error rate (4.10) among comparable open-source models, strong on Wenet Meeting, Aishell-1, and similar Chinese benchmarks.
Prerequisites
- vLLM version: >= 0.14.1 (with
[audio]extras) - Transformers: install from source for latest
Install Dependencies
uv venv
source .venv/bin/activate
uv pip install git+https://github.com/huggingface/transformers.git
uv pip install -U "vllm[audio]" --torch-backend auto
Launching the Server
vllm serve zai-org/GLM-ASR-Nano-2512
Client Usage
OpenAI SDK (Audio URL)
import base64
import httpx
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
audio_url = "https://github.com/zai-org/GLM-ASR/raw/main/examples/example_en.wav"
audio_data = base64.b64encode(httpx.get(audio_url).content).decode("utf-8")
response = client.chat.completions.create(
model="zai-org/GLM-ASR-Nano-2512",
messages=[{
"role": "user",
"content": [{
"type": "input_audio",
"input_audio": {"data": audio_data, "format": "wav"}
}]
}],
max_tokens=500,
)
print(response.choices[0].message.content)
Transcribe Endpoint
import httpx
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
audio_file = httpx.get("https://github.com/zai-org/GLM-ASR/raw/main/examples/example_en.wav").content
response = client.audio.transcriptions.create(
model="zai-org/GLM-ASR-Nano-2512",
file=("audio.wav", audio_file),
)
print(response.text)
cURL (Transcribe)
curl http://localhost:8000/v1/audio/transcriptions \
-H "Authorization: Bearer EMPTY" \
-F "model=zai-org/GLM-ASR-Nano-2512" \
-F "file=@your_audio.wav"
Local Audio File (chat API)
import base64
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
with open("your_audio.mp3", "rb") as f:
audio_data = base64.b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="zai-org/GLM-ASR-Nano-2512",
messages=[{
"role": "user",
"content": [{"type": "input_audio", "input_audio": {"data": audio_data, "format": "mp3"}}]
}],
max_tokens=500,
)
print(response.choices[0].message.content)
Troubleshooting
- Transformers version: Requires
transformers >= 5.0.0for best compatibility. - Audio formats: Supports wav, mp3, flac, and other common formats.