vLLM/Recipes
Qwen

Qwen/Qwen3-ASR-1.7B

Speech-to-text model supporting 11 languages, multiple accents, and singing voice with customizable text-context prompting.

View on HuggingFace
dense2.3B65,536 ctxvLLM 0.12.0+multimodal
Guide

Overview

Qwen3-ASR is a speech-to-text model that achieves accurate and robust recognition across 11 languages and multiple accents. It supports prompting the model with text context in any format to produce customized ASR results and performs well on singing-voice recognition. This guide demonstrates how to deploy Qwen3-ASR efficiently with vLLM.

Prerequisites

Install vLLM with audio dependencies:

uv venv
source .venv/bin/activate
uv pip install -U vllm --pre \
    --extra-index-url https://wheels.vllm.ai/nightly/cu129 \
    --extra-index-url https://download.pytorch.org/whl/cu129 \
    --index-strategy unsafe-best-match
uv pip install "vllm[audio]"

Launching with vLLM

vllm serve Qwen/Qwen3-ASR-1.7B

Client Usage

Chat Completions (OpenAI SDK)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="Qwen/Qwen3-ASR-1.7B",
    messages=[{
        "role": "user",
        "content": [{
            "type": "audio_url",
            "audio_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav"},
        }],
    }],
)
print(response.choices[0].message.content)

Transcription API

import httpx
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
audio_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav"
audio_file = httpx.get(audio_url).content

transcription = client.audio.transcriptions.create(
    model="Qwen/Qwen3-ASR-1.7B",
    file=audio_file,
)
print(transcription.text)

cURL

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": [
        {"type": "audio_url", "audio_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav"}}
      ]}
    ]
  }'

Offline Inference

from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset

llm = LLM(model="Qwen/Qwen3-ASR-1.7B")
audio_asset = AudioAsset("winning_call")

conversation = [{
    "role": "user",
    "content": [{"type": "audio_url", "audio_url": {"url": audio_asset.url}}],
}]

sampling_params = SamplingParams(temperature=0.01, max_tokens=256)
outputs = llm.chat(conversation, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

Troubleshooting

  • Make sure vllm[audio] extras are installed or audio requests will fail.
  • Use the nightly wheel until Qwen3-ASR support lands in the stable release.

References