nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16

NVIDIA Nemotron-Nano 12B vision-language model with video support and Efficient Video Sampling (EVS)

dense12B131,072 ctxvLLM 0.11.1+multimodal

Guide

Overview

Nemotron-Nano-12B-v2-VL is a vision-language model with image and video support. It includes Efficient Video Sampling (EVS) to prune video tokens and reduce compute. The model is available in BF16, FP8, and NVFP4 (QAD) precisions.

Prerequisites

Hardware: 1x GPU (A100/H100/B200, etc.)
vLLM: 0.11.0 does NOT include this model; use latest nightly or install from source
DGX Spark: use nvcr.io/nvidia/vllm:25.12.post1-py3

Install vLLM

docker pull vllm/vllm-openai:nightly-8bff831f0aa239006f34b721e63e1340e3472067
# or for DGX Spark:
docker pull nvcr.io/nvidia/vllm:25.12.post1-py3

Launch command

export VLLM_VIDEO_LOADER_BACKEND=opencv
export CHECKPOINT_PATH="nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16"
export CUDA_VISIBLE_DEVICES=0

python3 -m vllm.entrypoints.openai.api_server \
  --model ${CHECKPOINT_PATH} \
  --trust-remote-code \
  --media-io-kwargs '{"video": {"fps": 2, "num_frames": 128}}' \
  --max-model-len 131072 \
  --data-parallel-size 1 \
  --port 5566 \
  --allowed-local-media-path / \
  --video-pruning-rate 0.75 \
  --served-model-name "nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16"

Flags:

--max-model-len: reduce for shorter contexts to save memory
--allowed-local-media-path <root>: limit local-file access
--video-pruning-rate <0..1>: EVS compression; higher prunes more video tokens

Client Usage

Describe a video:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:5566/v1", api_key="<ignored>")
completion = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe the video."},
            {"type": "video_url", "video_url": {"url": "file:///path/to/video.mp4"}},
        ],
    }],
)
print(completion.choices[0].message.content)

Offline / LLM API

from vllm import LLM, SamplingParams

llm = LLM(
    "nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16",
    trust_remote_code=True,
    max_model_len=2**17,
    allowed_local_media_path="/",
    video_pruning_rate=0.75,
    media_io_kwargs=dict(video=dict(fps=2, num_frames=128)),
)

Troubleshooting

Set VLLM_VIDEO_LOADER_BACKEND=opencv (required for video inputs).
OOM: lower --max-model-len or increase --video-pruning-rate.