vLLM/Recipes
NVIDIA

nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16

NVIDIA Nemotron-Nano 12B vision-language model with video support and Efficient Video Sampling (EVS)

View on HuggingFace
dense12B131,072 ctxvLLM 0.11.1+multimodal
Guide

Overview

Nemotron-Nano-12B-v2-VL is a vision-language model with image and video support. It includes Efficient Video Sampling (EVS) to prune video tokens and reduce compute. The model is available in BF16, FP8, and NVFP4 (QAD) precisions.

Prerequisites

  • Hardware: 1x GPU (A100/H100/B200, etc.)
  • vLLM: 0.11.0 does NOT include this model; use latest nightly or install from source
  • DGX Spark: use nvcr.io/nvidia/vllm:25.12.post1-py3

Install vLLM

docker pull vllm/vllm-openai:nightly-8bff831f0aa239006f34b721e63e1340e3472067
# or for DGX Spark:
docker pull nvcr.io/nvidia/vllm:25.12.post1-py3

Launch command

export VLLM_VIDEO_LOADER_BACKEND=opencv
export CHECKPOINT_PATH="nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16"
export CUDA_VISIBLE_DEVICES=0

python3 -m vllm.entrypoints.openai.api_server \
  --model ${CHECKPOINT_PATH} \
  --trust-remote-code \
  --media-io-kwargs '{"video": {"fps": 2, "num_frames": 128}}' \
  --max-model-len 131072 \
  --data-parallel-size 1 \
  --port 5566 \
  --allowed-local-media-path / \
  --video-pruning-rate 0.75 \
  --served-model-name "nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16"

Flags:

  • --max-model-len: reduce for shorter contexts to save memory
  • --allowed-local-media-path <root>: limit local-file access
  • --video-pruning-rate <0..1>: EVS compression; higher prunes more video tokens

Client Usage

Describe a video:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:5566/v1", api_key="<ignored>")
completion = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe the video."},
            {"type": "video_url", "video_url": {"url": "file:///path/to/video.mp4"}},
        ],
    }],
)
print(completion.choices[0].message.content)

Offline / LLM API

from vllm import LLM, SamplingParams

llm = LLM(
    "nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16",
    trust_remote_code=True,
    max_model_len=2**17,
    allowed_local_media_path="/",
    video_pruning_rate=0.75,
    media_io_kwargs=dict(video=dict(fps=2, num_frames=128)),
)

Troubleshooting

  • Set VLLM_VIDEO_LOADER_BACKEND=opencv (required for video inputs).
  • OOM: lower --max-model-len or increase --video-pruning-rate.

References