vLLM Installation for High-Performance LLM Inference
vLLM is a high-throughput LLM inference engine built around PagedAttention, which dramatically improves GPU memory utilization and enables significantly higher request throughput compared to naive implementations. This guide covers installing vLLM, loading models, using the OpenAI-compatible API, applying quantization for memory reduction, and tuning GPU memory configuration.
Prerequisites
- Ubuntu 20.04/22.04 with NVIDIA GPU
- CUDA 11.8 or 12.x installed
- At least 16GB VRAM for 7B models (A10, RTX 3090/4090, A100)
- Python 3.9+
- 50GB+ disk space for model weights
Installing vLLM
# Create a dedicated virtual environment
python3 -m venv ~/vllm-env
source ~/vllm-env/bin/activate
# Install vLLM (this installs PyTorch with CUDA support automatically)
pip install --upgrade pip
pip install vllm
# For a specific CUDA version
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121
# Verify installation
python -c "import vllm; print(vllm.__version__)"
# Verify GPU is accessible
python -c "import torch; print(f'GPU: {torch.cuda.get_device_name(0)}')"
Serving a Model
vLLM downloads models from Hugging Face automatically:
source ~/vllm-env/bin/activate
# Serve Llama 3.1 8B Instruct (requires HF access token for gated models)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000
# For public models (no token needed):
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--host 0.0.0.0 \
--port 8000
# For gated models (Llama, Gemma):
export HUGGING_FACE_HUB_TOKEN=hf_your_token_here
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--host 127.0.0.1 \
--port 8000
# Specify local model directory
python -m vllm.entrypoints.openai.api_server \
--model /data/models/mistral-7b-instruct \
--host 0.0.0.0 \
--port 8000
Download models in advance:
# Use Hugging Face CLI to download to a local directory
pip install huggingface_hub
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.3 \
--local-dir /data/models/mistral-7b-instruct
OpenAI-Compatible API
vLLM is fully compatible with the OpenAI API format:
# List available models
curl http://localhost:8000/v1/models
# Chat completion
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"messages": [
{"role": "system", "content": "You are a helpful Linux sysadmin."},
{"role": "user", "content": "How do I check disk usage in Linux?"}
],
"temperature": 0.7,
"max_tokens": 500
}'
# Streaming completion
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"messages": [{"role": "user", "content": "Write a Python hello world"}],
"stream": true
}'
# Text completion (legacy)
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"prompt": "The capital of France is",
"max_tokens": 50
}'
Using the OpenAI Python client pointing to vLLM:
from openai import OpenAI
# Point to your vLLM server
client = OpenAI(
api_key="not-needed", # vLLM doesn't require a real key by default
base_url="http://localhost:8000/v1"
)
response = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.3",
messages=[
{"role": "user", "content": "Explain Docker in 2 sentences"}
],
temperature=0.7,
max_tokens=200
)
print(response.choices[0].message.content)
Quantization for Lower Memory Usage
Quantization reduces model memory requirements significantly:
# AWQ quantization (best quality/size tradeoff)
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
--quantization awq \
--host 0.0.0.0 \
--port 8000
# GPTQ quantization
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Llama-2-13B-GPTQ \
--quantization gptq \
--host 0.0.0.0 \
--port 8000
# FP8 quantization (NVIDIA Hopper/Ada GPUs — H100, RTX 4090)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--quantization fp8 \
--host 0.0.0.0 \
--port 8000
Memory requirements with quantization:
| Model | FP16 | AWQ/GPTQ (4-bit) |
|---|---|---|
| 7B | ~14GB | ~5GB |
| 13B | ~26GB | ~9GB |
| 70B | ~140GB | ~40GB |
Multi-GPU Serving with Tensor Parallelism
# Use 2 GPUs with tensor parallelism for a 13B model
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-13B-Instruct \
--tensor-parallel-size 2 \
--host 0.0.0.0 \
--port 8000
# Use 4 GPUs for 70B models
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--host 0.0.0.0 \
--port 8000
# Verify GPU utilization
watch -n 1 nvidia-smi
Running as a Systemd Service
sudo tee /etc/systemd/system/vllm.service << 'EOF'
[Unit]
Description=vLLM OpenAI-Compatible API Server
After=network.target
[Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu
Environment="HUGGING_FACE_HUB_TOKEN=hf_your_token"
ExecStart=/home/ubuntu/vllm-env/bin/python \
-m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--host 127.0.0.1 \
--port 8000 \
--max-model-len 4096 \
--gpu-memory-utilization 0.90
Restart=on-failure
RestartSec=30
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable vllm
sudo systemctl start vllm
sudo journalctl -u vllm -f
Python Client and Batch Inference
For high-throughput offline batch inference, use the vLLM Python API directly:
from vllm import LLM, SamplingParams
# Load model (downloads from HuggingFace on first run)
llm = LLM(
model="mistralai/Mistral-7B-Instruct-v0.3",
gpu_memory_utilization=0.90,
max_model_len=4096
)
# Define sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=256
)
# Batch generate (much more efficient than one at a time)
prompts = [
"What is a VPS?",
"Explain Docker in one sentence.",
"What is Nginx used for?",
"How does SSH work?",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}")
print(f"Output: {generated_text!r}")
print()
Troubleshooting
"CUDA out of memory" on startup
# Reduce GPU memory allocation
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--gpu-memory-utilization 0.75 # Default is 0.90
# Reduce max context length
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--max-model-len 2048
Slow first request
# vLLM compiles CUDA kernels on first startup (~30-60s normal)
# Subsequent requests are fast
# Enable CUDA graph capture for faster inference
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--enforce-eager # Disable CUDA graphs if causing issues
Model download fails
# Check HuggingFace token
echo $HUGGING_FACE_HUB_TOKEN
# Download manually
huggingface-cli download MODEL_NAME --token $HUGGING_FACE_HUB_TOKEN
Server not accessible remotely
# Ensure bound to 0.0.0.0, not 127.0.0.1
ss -tlnp | grep 8000
# Check firewall
ufw allow 8000/tcp # Only if intended to be public
Conclusion
vLLM delivers production-grade LLM inference with significantly higher throughput than alternatives, thanks to PagedAttention's efficient KV cache management. Its OpenAI-compatible API makes it a drop-in replacement for any application already using the OpenAI client library, and support for AWQ/GPTQ quantization enables deploying large models on GPUs with limited VRAM.


