vLLM Installation for High-Performance LLM Inference

vLLM is a high-throughput LLM inference engine built around PagedAttention, which dramatically improves GPU memory utilization and enables significantly higher request throughput compared to naive implementations. This guide covers installing vLLM, loading models, using the OpenAI-compatible API, applying quantization for memory reduction, and tuning GPU memory configuration.

Prerequisites

Ubuntu 20.04/22.04 with NVIDIA GPU
CUDA 11.8 or 12.x installed
At least 16GB VRAM for 7B models (A10, RTX 3090/4090, A100)
Python 3.9+
50GB+ disk space for model weights

Installing vLLM

# Create a dedicated virtual environment
python3 -m venv ~/vllm-env
source ~/vllm-env/bin/activate

# Install vLLM (this installs PyTorch with CUDA support automatically)
pip install --upgrade pip
pip install vllm

# For a specific CUDA version
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121

# Verify installation
python -c "import vllm; print(vllm.__version__)"

# Verify GPU is accessible
python -c "import torch; print(f'GPU: {torch.cuda.get_device_name(0)}')"

Serving a Model

vLLM downloads models from Hugging Face automatically:

source ~/vllm-env/bin/activate

# Serve Llama 3.1 8B Instruct (requires HF access token for gated models)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000

# For public models (no token needed):
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --host 0.0.0.0 \
  --port 8000

# For gated models (Llama, Gemma):
export HUGGING_FACE_HUB_TOKEN=hf_your_token_here
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --host 127.0.0.1 \
  --port 8000

# Specify local model directory
python -m vllm.entrypoints.openai.api_server \
  --model /data/models/mistral-7b-instruct \
  --host 0.0.0.0 \
  --port 8000

Download models in advance:

# Use Hugging Face CLI to download to a local directory
pip install huggingface_hub
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.3 \
  --local-dir /data/models/mistral-7b-instruct

OpenAI-Compatible API

vLLM is fully compatible with the OpenAI API format:

# List available models
curl http://localhost:8000/v1/models

# Chat completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [
      {"role": "system", "content": "You are a helpful Linux sysadmin."},
      {"role": "user", "content": "How do I check disk usage in Linux?"}
    ],
    "temperature": 0.7,
    "max_tokens": 500
  }'

# Streaming completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [{"role": "user", "content": "Write a Python hello world"}],
    "stream": true
  }'

# Text completion (legacy)
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "prompt": "The capital of France is",
    "max_tokens": 50
  }'

Using the OpenAI Python client pointing to vLLM:

from openai import OpenAI

# Point to your vLLM server
client = OpenAI(
    api_key="not-needed",  # vLLM doesn't require a real key by default
    base_url="http://localhost:8000/v1"
)

response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    messages=[
        {"role": "user", "content": "Explain Docker in 2 sentences"}
    ],
    temperature=0.7,
    max_tokens=200
)

print(response.choices[0].message.content)

Quantization for Lower Memory Usage

Quantization reduces model memory requirements significantly:

# AWQ quantization (best quality/size tradeoff)
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
  --quantization awq \
  --host 0.0.0.0 \
  --port 8000

# GPTQ quantization
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Llama-2-13B-GPTQ \
  --quantization gptq \
  --host 0.0.0.0 \
  --port 8000

# FP8 quantization (NVIDIA Hopper/Ada GPUs — H100, RTX 4090)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --quantization fp8 \
  --host 0.0.0.0 \
  --port 8000

Memory requirements with quantization:

Model	FP16	AWQ/GPTQ (4-bit)
7B	~14GB	~5GB
13B	~26GB	~9GB
70B	~140GB	~40GB

Multi-GPU Serving with Tensor Parallelism

# Use 2 GPUs with tensor parallelism for a 13B model
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-13B-Instruct \
  --tensor-parallel-size 2 \
  --host 0.0.0.0 \
  --port 8000

# Use 4 GPUs for 70B models
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --host 0.0.0.0 \
  --port 8000

# Verify GPU utilization
watch -n 1 nvidia-smi

Running as a Systemd Service

sudo tee /etc/systemd/system/vllm.service << 'EOF'
[Unit]
Description=vLLM OpenAI-Compatible API Server
After=network.target

[Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu
Environment="HUGGING_FACE_HUB_TOKEN=hf_your_token"
ExecStart=/home/ubuntu/vllm-env/bin/python \
    -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.3 \
    --host 127.0.0.1 \
    --port 8000 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.90
Restart=on-failure
RestartSec=30

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable vllm
sudo systemctl start vllm

sudo journalctl -u vllm -f

Python Client and Batch Inference

For high-throughput offline batch inference, use the vLLM Python API directly:

from vllm import LLM, SamplingParams

# Load model (downloads from HuggingFace on first run)
llm = LLM(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    gpu_memory_utilization=0.90,
    max_model_len=4096
)

# Define sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=256
)

# Batch generate (much more efficient than one at a time)
prompts = [
    "What is a VPS?",
    "Explain Docker in one sentence.",
    "What is Nginx used for?",
    "How does SSH work?",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}")
    print(f"Output: {generated_text!r}")
    print()

Troubleshooting

"CUDA out of memory" on startup

# Reduce GPU memory allocation
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --gpu-memory-utilization 0.75  # Default is 0.90

# Reduce max context length
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --max-model-len 2048

Slow first request

# vLLM compiles CUDA kernels on first startup (~30-60s normal)
# Subsequent requests are fast

# Enable CUDA graph capture for faster inference
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --enforce-eager  # Disable CUDA graphs if causing issues

Model download fails

# Check HuggingFace token
echo $HUGGING_FACE_HUB_TOKEN

# Download manually
huggingface-cli download MODEL_NAME --token $HUGGING_FACE_HUB_TOKEN

Server not accessible remotely

# Ensure bound to 0.0.0.0, not 127.0.0.1
ss -tlnp | grep 8000

# Check firewall
ufw allow 8000/tcp  # Only if intended to be public

Conclusion

vLLM delivers production-grade LLM inference with significantly higher throughput than alternatives, thanks to PagedAttention's efficient KV cache management. Its OpenAI-compatible API makes it a drop-in replacement for any application already using the OpenAI client library, and support for AWQ/GPTQ quantization enables deploying large models on GPUs with limited VRAM.

vLLM Installation for High-Performance LLM Inference

En esta página

En esta página

vLLM Installation for High-Performance LLM Inference

Prerequisites

Installing vLLM

Serving a Model

OpenAI-Compatible API

Quantization for Lower Memory Usage

Multi-GPU Serving with Tensor Parallelism

Running as a Systemd Service

Python Client and Batch Inference

Troubleshooting

Conclusion

Último Video

Obtén $20 de Crédito Gratis