Paperless-ngx Document Management Installation

Paperless-ngx is an open-source document management system that scans, OCRs, tags, and indexes your documents into a searchable digital archive, eliminating paper clutter and manual filing. Running on Docker with Tesseract OCR and full-text search, it supports automatic document consumption from email, scanners, and watched folders, making it ideal for digitizing home office or small business paperwork on a self-hosted Linux server.

Prerequisites

  • Ubuntu 20.04+, Debian 11+, or CentOS/Rocky 8+
  • Docker and Docker Compose installed
  • Minimum 2 GB RAM (4+ GB recommended)
  • Root or sudo access
  • A domain name or static IP for access

Installing Paperless-ngx with Docker

# Create the Paperless-ngx directory
sudo mkdir -p /opt/paperless
cd /opt/paperless

# Download the official docker-compose with PostgreSQL and Redis
curl -LO https://raw.githubusercontent.com/paperless-ngx/paperless-ngx/main/docker/compose/docker-compose.postgres.yml
mv docker-compose.postgres.yml docker-compose.yml

# Download the environment file template
curl -LO https://raw.githubusercontent.com/paperless-ngx/paperless-ngx/main/docker/compose/.env

Edit the .env file:

nano /opt/paperless/.env
# Key settings to configure in .env:

# Secret key — generate with: openssl rand -hex 32
PAPERLESS_SECRET_KEY=your-secret-key-here

# Admin user credentials
PAPERLESS_ADMIN_USER=admin
PAPERLESS_ADMIN_PASSWORD=secure-password-here
[email protected]

# Timezone
PAPERLESS_TIME_ZONE=America/New_York

# Language for OCR (3-letter ISO code)
PAPERLESS_OCR_LANGUAGE=eng

# URL for reverse proxy access
PAPERLESS_URL=https://paperless.example.com

# Storage paths
PAPERLESS_DATA_DIR=/usr/src/paperless/data
PAPERLESS_MEDIA_ROOT=/usr/src/paperless/media
PAPERLESS_CONSUMPTION_DIR=/usr/src/paperless/consume
PAPERLESS_EXPORT_DIR=/usr/src/paperless/export
# Start Paperless-ngx
sudo docker compose up -d

# Monitor startup (OCR model downloads may take a few minutes)
sudo docker compose logs -f webserver

# Verify it's running
sudo docker compose ps

Access the web interface at http://your-server:8000.

Initial Configuration

Create the admin user (if not auto-created from .env):

sudo docker compose exec webserver \
  python3 manage.py createsuperuser \
  --username admin \
  --email [email protected]

Configure document storage volumes:

# Create local directories for document storage
sudo mkdir -p /opt/paperless/{consume,data,media,export}
sudo chown -R 1000:1000 /opt/paperless/{consume,data,media,export}

Update docker-compose.yml to mount local paths:

services:
  webserver:
    volumes:
      - /opt/paperless/data:/usr/src/paperless/data
      - /opt/paperless/media:/usr/src/paperless/media
      - /opt/paperless/consume:/usr/src/paperless/consume
      - /opt/paperless/export:/usr/src/paperless/export

OCR Configuration

Paperless-ngx uses Tesseract for OCR. Configure language and quality:

# In .env, set OCR options:

# Primary OCR language
PAPERLESS_OCR_LANGUAGE=eng

# Multiple languages (separate with +)
PAPERLESS_OCR_LANGUAGE=eng+deu+fra

# OCR mode:
# 0 = Skip OCR on documents that already have text
# 1 = Redo OCR on all documents
# 2 = Force OCR even on documents with text (default: 0)
PAPERLESS_OCR_MODE=skip

# Image cleanup before OCR
PAPERLESS_OCR_CLEAN=clean

# Unpaper for page straightening
PAPERLESS_OCR_DESKEW=true
PAPERLESS_OCR_ROTATE_PAGES=true
PAPERLESS_OCR_ROTATE_PAGES_THRESHOLD=12

# PDF optimization
PAPERLESS_OCR_OUTPUT_TYPE=pdfa

Install additional Tesseract language packs:

# Add to Dockerfile or install in the container
sudo docker compose exec webserver \
  apt-get install -y tesseract-ocr-deu tesseract-ocr-fra

# List available language packs
sudo docker compose exec webserver tesseract --list-langs

Consumption Workflows

Paperless-ngx watches the consume directory for new documents:

Drop files for automatic processing:

# Copy files to the consumption directory
cp invoice.pdf /opt/paperless/consume/
cp scan.jpg /opt/paperless/consume/

# Paperless processes them automatically within seconds
# Monitor processing
sudo docker compose logs -f consumer

Configure watched folder with inotify:

# The consumer service watches the consumption directory automatically
# Check the consumption schedule
grep -i consume /opt/paperless/.env

# Manual consumption trigger
sudo docker compose exec webserver \
  python3 manage.py document_consumer --oneshot

Pre-process with filename tags:

# Paperless supports filename-based metadata hints
# Create metadata files alongside documents
# Example: invoice.pdf + invoice.pdf.json
echo '{"title": "Electric Bill", "tags": ["bills", "utilities"]}' > \
  /opt/paperless/consume/electric-bill.pdf.json
cp electric-bill.pdf /opt/paperless/consume/

Tagging and Document Organization

Paperless-ngx uses correspondents, document types, and tags:

Create tags via the web UI:

  1. Go to Tags > Create Tag
  2. Name: bills, Color: Red, Auto-match: enabled

Set up automatic tagging rules:

  1. Go to Correspondents > Add Correspondent
  2. Set Matching Algorithm: Auto (ML-based) or Regular expression
  3. For regex: Match Electric Company → Tag utilities

Bulk assign documents:

# Via the web UI:
# 1. Select multiple documents (checkboxes)
# 2. Click the tag icon
# 3. Apply tags in bulk

# Via CLI
sudo docker compose exec webserver \
  python3 manage.py shell -c "
from documents.models import Document, Tag
tag = Tag.objects.get(name='bills')
Document.objects.filter(title__contains='invoice').update()
"

Paperless-ngx uses Whoosh for full-text search indexing:

# Rebuild the search index
sudo docker compose exec webserver \
  python3 manage.py document_index reindex

# Search from CLI (useful for scripting)
sudo docker compose exec webserver \
  python3 manage.py shell -c "
from documents.models import Document
results = Document.objects.filter(content__icontains='invoice 2024')
for doc in results:
    print(doc.title, doc.created)
"

Search query syntax in the web UI:

content:invoice           # Search document content
title:electric            # Search by title
tag:bills                 # Filter by tag
correspondent:amazon      # Filter by correspondent
created:[2024-01-01 TO *] # Date range

Email Integration

Automatically import documents from email accounts:

# Configure mail accounts in Admin panel:
# Mail > Mail Accounts > Add Mail Account
# Settings:
# - IMAP Server: imap.gmail.com
# - Port: 993
# - Username: [email protected]
# - Password: app-password
# - IMAP Security: SSL

# Configure mail rules:
# Mail > Mail Rules > Add Mail Rule
# - Account: your Gmail account
# - Subject filter: "invoice" or "receipt"
# - Action: Consume attachments
# - Tags to assign: bills, email

Poll mail manually:

sudo docker compose exec webserver \
  python3 manage.py mail_fetcher

Troubleshooting

Documents not being processed from consume folder:

# Check consumer service
sudo docker compose logs consumer -n 50

# Verify file permissions
ls -la /opt/paperless/consume/
sudo chown 1000:1000 /opt/paperless/consume/*.pdf

# Check consume directory is mounted correctly
sudo docker compose exec consumer ls /usr/src/paperless/consume/

OCR producing garbled text:

# Check if the correct language is set
grep OCR_LANGUAGE /opt/paperless/.env

# Test OCR on a specific file
sudo docker compose exec webserver \
  tesseract /path/to/test.pdf output txt -l eng

# Enable deskew for scanned documents
# PAPERLESS_OCR_DESKEW=true
# PAPERLESS_OCR_ROTATE_PAGES=true

High memory usage during indexing:

# Check memory
sudo docker stats

# Limit concurrent tasks in .env
PAPERLESS_TASK_WORKERS=1
PAPERLESS_THREADS_PER_WORKER=1

# Rebuild search index can be memory-intensive
sudo docker compose exec webserver \
  python3 manage.py document_index reindex --no-progress-bar

Conclusion

Paperless-ngx transforms document management from manual filing into an automated, searchable digital archive with OCR, intelligent tagging, and multi-source consumption from email, scanners, and watched folders. The combination of full-text search, automatic correspondent detection, and email integration makes it practical to maintain a paperless office workflow entirely on self-hosted infrastructure without cloud document services.