Ollama Installation for Local LLM Deployment

Ollama makes it simple to run large language models like Llama 3, Mistral, and Gemma locally on Linux servers, providing a straightforward CLI and REST API without requiring complex infrastructure setup. This guide covers installing Ollama, running models with CPU and GPU acceleration, using the API, integrating with Open WebUI, and tuning performance for production use.

Prerequisites

  • Ubuntu 20.04+ or CentOS/Rocky Linux 8+ (64-bit)
  • Minimum 8GB RAM (16GB+ recommended for 7B models)
  • For GPU acceleration: NVIDIA GPU with CUDA 11.8+ or AMD GPU with ROCm
  • Sufficient disk space (7B model: ~4GB, 13B model: ~8GB, 70B model: ~40GB)

Installing Ollama

One-Line Install Script

curl -fsSL https://ollama.com/install.sh | sh

This installs the ollama binary to /usr/local/bin and creates a systemd service.

Manual Installation

# Download the latest binary
curl -L https://ollama.com/download/ollama-linux-amd64.tgz \
  -o /tmp/ollama.tgz

sudo tar -C /usr/local/bin -xzf /tmp/ollama.tgz

# Make executable
sudo chmod +x /usr/local/bin/ollama

# Verify
ollama --version

Verify Installation

ollama --version
# ollama version is 0.x.x

# Check service status (if using automatic installer)
sudo systemctl status ollama

Running Your First Model

# Pull and run Llama 3.2 (3B — fast, low memory)
ollama run llama3.2

# Pull and run Mistral 7B
ollama run mistral

# Run with a one-shot prompt (non-interactive)
ollama run llama3.2 "Explain Docker containers in 3 sentences"

# Run Gemma 2 (Google's efficient model)
ollama run gemma2

# Run CodeLlama for coding tasks
ollama run codellama "Write a Python function to reverse a linked list"

Press Ctrl+D or type /bye to exit the interactive session.

Model Management

# List available models locally
ollama list

# Pull a model without running it
ollama pull llama3.1:8b
ollama pull llama3.1:70b  # Large — requires ~40GB RAM/VRAM

# Pull a specific quantization (smaller = faster, less accurate)
ollama pull llama3.2:3b-instruct-q4_K_M   # 4-bit quantized
ollama pull llama3.2:3b-instruct-fp16      # Full precision

# Show model details
ollama show llama3.2

# Remove a model
ollama rm mistral

# Copy a model locally
ollama cp llama3.2 my-custom-llama

Custom Modelfiles

Create custom models by extending a base model:

cat > Modelfile << 'EOF'
FROM llama3.2

# Set system prompt
SYSTEM """
You are a helpful Linux sysadmin assistant. Provide concise, accurate answers
about server administration, networking, and DevOps. Always include commands
when relevant.
"""

# Set parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
EOF

# Build the custom model
ollama create sysadmin-assistant -f Modelfile

# Run it
ollama run sysadmin-assistant "How do I check disk usage on Linux?"

GPU Acceleration

Ollama automatically detects and uses NVIDIA and AMD GPUs when drivers are installed:

# Verify GPU is detected by Ollama
ollama run llama3.2 "test"
# Look for "using GPU" in the output or check:
OLLAMA_DEBUG=1 ollama run llama3.2 "test" 2>&1 | grep -i "gpu\|cuda"

# Check GPU memory usage during inference
watch -n 1 nvidia-smi

# Force CPU-only mode if needed
OLLAMA_NUM_GPU=0 ollama run llama3.2 "test"

For partial GPU offloading (when model doesn't fit fully in VRAM):

# Set number of GPU layers (higher = more GPU memory used)
OLLAMA_NUM_GPU=24 ollama run llama3.1:70b "hello"

Using the Ollama REST API

Ollama exposes an OpenAI-compatible REST API on port 11434:

# Generate a completion
curl http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "prompt": "What is a VPS?",
    "stream": false
  }'

# Chat API (multi-turn conversation)
curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "user", "content": "What is Docker?"},
      {"role": "assistant", "content": "Docker is a containerization platform..."},
      {"role": "user", "content": "How does it differ from a VM?"}
    ],
    "stream": false
  }'

# List loaded models
curl http://localhost:11434/api/tags

# OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Python client example:

import requests

def ask_ollama(prompt, model="llama3.2"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt, "stream": False}
    )
    return response.json()["response"]

print(ask_ollama("List 5 Linux commands every sysadmin should know"))

Running Ollama as a Service

# The install script creates a systemd service automatically
# To configure the service:
sudo systemctl edit ollama

# Add environment variables
# [Service]
# Environment="OLLAMA_HOST=0.0.0.0:11434"
# Environment="OLLAMA_MODELS=/data/ollama/models"
# Environment="OLLAMA_NUM_PARALLEL=2"

sudo systemctl daemon-reload
sudo systemctl restart ollama

# Check logs
sudo journalctl -u ollama -f

# Enable on boot
sudo systemctl enable ollama

To allow remote API access (bind to all interfaces):

# Edit service override
sudo systemctl edit ollama
# Add:
# [Service]
# Environment="OLLAMA_HOST=0.0.0.0"

# Or set in ~/.bashrc for user sessions
export OLLAMA_HOST=0.0.0.0

# Restrict access with a firewall (allow only trusted IPs)
sudo ufw allow from 192.168.1.0/24 to any port 11434

Performance Optimization

# Set number of parallel inference requests
export OLLAMA_NUM_PARALLEL=2

# Set context window size (default: 2048)
export OLLAMA_NUM_CTX=4096

# Increase Flash Attention for better GPU utilization
export OLLAMA_FLASH_ATTENTION=1

# Keep models in memory between requests
export OLLAMA_KEEP_ALIVE="30m"  # Keep loaded for 30 minutes

# Run multiple instances on multi-GPU systems
export CUDA_VISIBLE_DEVICES=0 ollama serve &  # GPU 0
export CUDA_VISIBLE_DEVICES=1 ollama serve &  # GPU 1

Troubleshooting

"Error: model not found"

# Pull the model first
ollama pull llama3.2
ollama list  # Verify it's downloaded

GPU not detected

# Verify NVIDIA driver
nvidia-smi

# Check Ollama debug output
OLLAMA_DEBUG=1 ollama run llama3.2 "test" 2>&1 | head -50

"Out of memory" errors

# Use a smaller/more quantized model
ollama pull llama3.2:3b-instruct-q4_K_S  # Most compressed

# Reduce context window
ollama run llama3.2 --parameter num_ctx 1024 "test"

API not accessible remotely

# Check if Ollama is bound to all interfaces
ss -tlnp | grep 11434

# Set OLLAMA_HOST and restart
sudo systemctl edit ollama
# Add: Environment="OLLAMA_HOST=0.0.0.0"
sudo systemctl restart ollama

Conclusion

Ollama makes deploying large language models on Linux servers accessible without complex configuration, and its OpenAI-compatible API lets you integrate local inference into existing applications with minimal code changes. GPU acceleration with NVIDIA CUDA or AMD ROCm is detected automatically, and models can be kept warm in memory for low-latency inference in production environments.