vLLM/Recipes
Meta

meta-llama/Llama-3.1-8B-Instruct

Meta's Llama 3.1 8B dense instruction-tuned language model with 128K context

View on HuggingFace
dense8B131,072 ctxvLLM 0.6.0+text
Guide

Overview

Llama 3.1 Instruct is Meta's instruction-tuned language model family. The 8B dense variant is lightweight and ideal for single-GPU deployment, with 128K context support. A 70B variant is also available (see related recipes).

Prerequisites

  • Hardware: 1x GPU with >=16 GB VRAM (e.g. A100, L40S, H100, H200)
  • vLLM >= 0.6.0
  • CUDA Driver compatible with your vLLM version
  • Docker with NVIDIA Container Toolkit (recommended)

Install vLLM

uv venv
source .venv/bin/activate
uv pip install vllm --torch-backend auto

TPU Deployment

Client Usage

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello, how are you?"}],
)
print(response.choices[0].message.content)

Troubleshooting

OOM on small GPUs: Lower --max-model-len or --gpu-memory-utilization.

References