Compressed Backup with tar and gzip: Complete Practical Guide

Introduction

The tar (Tape Archive) utility, combined with gzip compression, has been a cornerstone of Unix/Linux system backups for decades. Despite the emergence of modern backup tools with advanced features like deduplication and encryption, tar and gzip remain relevant due to their universality, simplicity, and reliability. Every Linux system includes these tools by default, making them ideal for situations where you need portable, self-contained backups that can be restored on any Unix-like system without installing specialized software.

Understanding tar and gzip is fundamental for any Linux system administrator. These tools provide the foundation for many backup workflows, from simple file archives to complex automated backup systems. Their straightforward operation, combined with powerful options for selective backups, compression, and incremental archiving, makes them suitable for everything from quick personal backups to enterprise disaster recovery implementations following the 3-2-1 backup rule.

This comprehensive guide explores tar and gzip from basics through advanced production use, covering syntax, compression strategies, incremental backups, automation, restoration procedures, and real-world scenarios.

Understanding tar and gzip

What is tar?

Tar (Tape Archive) creates archive files by combining multiple files and directories into a single file. Originally designed for magnetic tape backups, tar now serves as a universal archiving format.

Key characteristics:

  • Preserves file metadata (permissions, ownership, timestamps)
  • Maintains directory structures
  • Can archive special files (symbolic links, devices)
  • Creates sequential archives suitable for streaming
  • No compression by default (tar itself doesn't compress)

What is gzip?

Gzip (GNU zip) is a compression utility that reduces file sizes using the DEFLATE algorithm. While gzip compresses files, it traditionally operates on single files at a time.

Key characteristics:

  • Good compression ratio (typically 50-80% reduction)
  • Fast compression/decompression
  • Available on virtually all Unix/Linux systems
  • Combines naturally with tar for compressed archives

tar + gzip Workflow

The standard workflow combines both:

# Create tar archive
tar -cf archive.tar /path/to/backup

# Compress with gzip
gzip archive.tar
# Result: archive.tar.gz

# Or combine in single command
tar -czf archive.tar.gz /path/to/backup

Basic tar Operations

Creating Archives

Basic syntax:

tar -cf archive.tar file1 file2 directory/

Essential options:

  • -c: Create archive
  • -f: Specify filename
  • -v: Verbose output
  • -z: gzip compression
  • -j: bzip2 compression
  • -J: xz compression
  • -t: List contents
  • -x: Extract archive

Create basic archive:

# Archive single directory
tar -cf backup.tar /home/user/

# Archive multiple items
tar -cf backup.tar /etc/ /var/www/ /home/

# Verbose output
tar -cvf backup.tar /home/user/

# Archive with gzip compression
tar -czf backup.tar.gz /home/user/

# Archive with bzip2 (better compression, slower)
tar -cjf backup.tar.bz2 /home/user/

# Archive with xz (best compression, slowest)
tar -cJf backup.tar.xz /home/user/

Remove leading slash (recommended for portability):

# Keep leading slash (absolute paths)
tar -czf backup.tar.gz /home/user/
# Extracts to: /home/user/

# Remove leading slash (relative paths)
tar -czf backup.tar.gz -C / home/user/
# or
tar -czf backup.tar.gz --exclude-vcs /home/user/

Listing Archive Contents

# List files in archive
tar -tf archive.tar.gz

# Verbose listing with details
tar -tvf archive.tar.gz

# List specific files matching pattern
tar -tf archive.tar.gz | grep '.conf$'

# Count files in archive
tar -tf archive.tar.gz | wc -l

Extracting Archives

# Extract to current directory
tar -xzf archive.tar.gz

# Extract to specific directory
tar -xzf archive.tar.gz -C /restore/location/

# Extract specific files
tar -xzf archive.tar.gz path/to/specific/file.txt

# Extract with verbose output
tar -xzvf archive.tar.gz

# Extract files matching pattern
tar -xzf archive.tar.gz --wildcards '*.conf'

# Extract and preserve permissions (default)
tar -xzpf archive.tar.gz

# Overwrite existing files
tar -xzf archive.tar.gz --overwrite

Compression Options and Strategies

Comparing Compression Algorithms

gzip (default, balanced):

tar -czf archive.tar.gz /data/
# Compression ratio: ~60-70%
# Speed: Fast
# Compatibility: Universal

bzip2 (better compression):

tar -cjf archive.tar.bz2 /data/
# Compression ratio: ~70-80%
# Speed: Slower than gzip
# Compatibility: Very good

xz (best compression):

tar -cJf archive.tar.xz /data/
# Compression ratio: ~80-90%
# Speed: Slowest
# Compatibility: Good (modern systems)

lz4 (fastest):

tar -c /data/ | lz4 > archive.tar.lz4
# Compression ratio: ~50-60%
# Speed: Very fast
# Compatibility: Requires lz4 installed

Performance comparison example:

#!/bin/bash
# Compare compression methods

SOURCE="/var/www"
RESULTS="compression-test-results.txt"

echo "Compression Performance Test" > "$RESULTS"
echo "Source: $SOURCE" >> "$RESULTS"
echo "Original size: $(du -sh $SOURCE | cut -f1)" >> "$RESULTS"
echo "" >> "$RESULTS"

# gzip
echo "Testing gzip..." >> "$RESULTS"
time tar -czf test-gzip.tar.gz "$SOURCE" 2>&1 | grep real >> "$RESULTS"
echo "Size: $(du -h test-gzip.tar.gz | cut -f1)" >> "$RESULTS"
echo "" >> "$RESULTS"

# bzip2
echo "Testing bzip2..." >> "$RESULTS"
time tar -cjf test-bzip2.tar.bz2 "$SOURCE" 2>&1 | grep real >> "$RESULTS"
echo "Size: $(du -h test-bzip2.tar.bz2 | cut -f1)" >> "$RESULTS"
echo "" >> "$RESULTS"

# xz
echo "Testing xz..." >> "$RESULTS"
time tar -cJf test-xz.tar.xz "$SOURCE" 2>&1 | grep real >> "$RESULTS"
echo "Size: $(du -h test-xz.tar.xz | cut -f1)" >> "$RESULTS"

cat "$RESULTS"

Compression Level Tuning

gzip compression levels (1-9):

# Fast compression (level 1)
tar -czf archive.tar.gz --gzip-level=1 /data/

# Default (level 6)
tar -czf archive.tar.gz /data/

# Maximum compression (level 9)
tar -czf archive.tar.gz --gzip-level=9 /data/

Using pigz (parallel gzip):

# Install pigz
sudo apt install pigz  # Ubuntu/Debian
sudo yum install pigz  # CentOS/RHEL

# Use with tar
tar -cf - /data/ | pigz -p 4 > archive.tar.gz
# -p 4: Use 4 CPU cores

# Extract
pigz -dc archive.tar.gz | tar -xf -

Optimal Compression Strategy

Choose based on use case:

Quick backups (prioritize speed):

tar -czf --gzip-level=1 backup.tar.gz /data/
# or
tar -c /data/ | lz4 > backup.tar.lz4

Storage-constrained (prioritize size):

tar -cJf backup.tar.xz /data/
# or
tar -cjf backup.tar.bz2 /data/

Balanced (production default):

tar -czf backup.tar.gz /data/
# Standard gzip, good balance

Large datasets with multi-core CPU:

tar -cf - /data/ | pigz -p $(nproc) > backup.tar.gz
# Parallel compression using all cores

Advanced tar Features

Incremental Backups

Tar supports incremental backups using snapshot files:

Create snapshot file:

# Full backup with snapshot
tar -czf full-backup.tar.gz \
    --listed-incremental=backup.snar \
    /home/user/

Incremental backup (only changed files):

# Incremental backup 1
tar -czf incremental-1.tar.gz \
    --listed-incremental=backup.snar \
    /home/user/

# Incremental backup 2
tar -czf incremental-2.tar.gz \
    --listed-incremental=backup.snar \
    /home/user/

Restore incremental backups:

# Restore must be in order
tar -xzf full-backup.tar.gz --listed-incremental=/dev/null
tar -xzf incremental-1.tar.gz --listed-incremental=/dev/null
tar -xzf incremental-2.tar.gz --listed-incremental=/dev/null

Excluding Files and Directories

Exclude patterns:

# Exclude specific files/directories
tar -czf backup.tar.gz \
    --exclude='*.log' \
    --exclude='*.tmp' \
    --exclude='cache' \
    --exclude='node_modules' \
    /var/www/

Exclude file:

Create /etc/tar-exclude.txt:

*.log
*.tmp
.cache
cache/
tmp/
*.swp
node_modules/
vendor/
__pycache__/
.git/

Use with tar:

tar -czf backup.tar.gz \
    --exclude-from=/etc/tar-exclude.txt \
    /var/www/

Exclude version control:

tar -czf backup.tar.gz --exclude-vcs /project/
# Excludes .git, .svn, .hg, etc.

Selective File Inclusion

Include specific files only:

# Backup only .conf files
tar -czf configs.tar.gz \
    --wildcards \
    --no-recursion \
    $(find /etc -name '*.conf')

Backup specific file types:

# All PHP and HTML files
find /var/www -name '*.php' -o -name '*.html' | \
    tar -czf web-files.tar.gz -T -

Preserving Metadata and Permissions

# Preserve all attributes (default)
tar -czpf backup.tar.gz /data/

# Preserve permissions (-p)
tar -czpf backup.tar.gz /data/

# Preserve SELinux context
tar -czf backup.tar.gz --selinux /data/

# Preserve extended attributes
tar -czf backup.tar.gz --xattrs /data/

# Preserve ACLs
tar -czf backup.tar.gz --acls /data/

# All preservation options
tar -czf backup.tar.gz \
    --preserve-permissions \
    --selinux \
    --xattrs \
    --acls \
    /data/

Production Backup Scripts

Comprehensive Backup Script

#!/bin/bash
# /usr/local/bin/tar-backup.sh
# Production tar-based backup with rotation and verification

set -euo pipefail

# Configuration
BACKUP_NAME="backup-$(hostname)-$(date +%Y%m%d-%H%M%S)"
BACKUP_ROOT="/backup"
BACKUP_PATH="$BACKUP_ROOT/$BACKUP_NAME.tar.gz"
LOG_FILE="/var/log/tar-backup.log"
RETENTION_DAYS=30
ADMIN_EMAIL="[email protected]"

# Sources to backup
BACKUP_SOURCES=(
    "/etc"
    "/home"
    "/var/www"
    "/opt"
    "/root"
)

# Exclude file
EXCLUDE_FILE="/etc/tar-exclude.txt"

# Logging
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG_FILE"
}

error_exit() {
    log "ERROR: $1"
    echo "Backup failed: $1" | mail -s "Backup FAILED - $(hostname)" "$ADMIN_EMAIL"
    exit 1
}

# Create exclude file if doesn't exist
if [ ! -f "$EXCLUDE_FILE" ]; then
    cat > "$EXCLUDE_FILE" << 'EOF'
*.log
*.tmp
.cache/
tmp/
cache/
node_modules/
vendor/
__pycache__/
.git/
lost+found/
*.swp
EOF
fi

log "Starting backup: $BACKUP_NAME"

# Pre-backup: Database dumps
log "Creating database dumps"
mkdir -p /var/backups/db-dumps

if command -v mysqldump &> /dev/null; then
    mysqldump --all-databases --single-transaction | \
        gzip > /var/backups/db-dumps/mysql-all.sql.gz
fi

if command -v pg_dumpall &> /dev/null; then
    sudo -u postgres pg_dumpall | \
        gzip > /var/backups/db-dumps/postgresql-all.sql.gz
fi

# Create tar backup with parallel compression
log "Creating compressed archive"
tar -cf - \
    --exclude-from="$EXCLUDE_FILE" \
    --exclude-caches \
    --exclude-vcs \
    "${BACKUP_SOURCES[@]}" \
    /var/backups/db-dumps \
    2>> "$LOG_FILE" | \
    pigz -p $(nproc) > "$BACKUP_PATH"

if [ ${PIPESTATUS[0]} -ne 0 ]; then
    error_exit "Tar command failed"
fi

# Verify archive integrity
log "Verifying archive integrity"
if ! gzip -t "$BACKUP_PATH" 2>> "$LOG_FILE"; then
    error_exit "Archive integrity check failed"
fi

# Check archive size
ARCHIVE_SIZE=$(stat -c%s "$BACKUP_PATH" 2>/dev/null || stat -f%z "$BACKUP_PATH")
MIN_SIZE=1048576  # 1MB minimum

if [ "$ARCHIVE_SIZE" -lt "$MIN_SIZE" ]; then
    error_exit "Archive suspiciously small: $ARCHIVE_SIZE bytes"
fi

# Create checksum
log "Creating checksum"
sha256sum "$BACKUP_PATH" > "$BACKUP_PATH.sha256"

# Create manifest
log "Creating manifest"
cat > "$BACKUP_ROOT/$BACKUP_NAME.txt" << EOF
Backup Manifest
Date: $(date)
Server: $(hostname)
Archive: $BACKUP_NAME.tar.gz
Size: $(du -h "$BACKUP_PATH" | cut -f1)
Checksum: $(cat "$BACKUP_PATH.sha256")

Sources:
$(printf '%s\n' "${BACKUP_SOURCES[@]}")

File count: $(tar -tzf "$BACKUP_PATH" | wc -l)
EOF

# Cleanup old backups
log "Cleaning up old backups"
find "$BACKUP_ROOT" -name "backup-*.tar.gz" -mtime +$RETENTION_DAYS -delete
find "$BACKUP_ROOT" -name "backup-*.txt" -mtime +$RETENTION_DAYS -delete
find "$BACKUP_ROOT" -name "backup-*.sha256" -mtime +$RETENTION_DAYS -delete

# Cleanup old database dumps
find /var/backups/db-dumps -name "*.sql.gz" -mtime +3 -delete

log "Backup completed successfully"

# Success notification
{
    echo "Backup completed successfully"
    echo ""
    cat "$BACKUP_ROOT/$BACKUP_NAME.txt"
} | mail -s "Backup Success - $(hostname)" "$ADMIN_EMAIL"

exit 0

Make executable:

sudo chmod +x /usr/local/bin/tar-backup.sh

Rotating Backup Script with Tiered Retention

#!/bin/bash
# /usr/local/bin/tar-rotating-backup.sh
# Implements GFS (Grandfather-Father-Son) rotation

BACKUP_TYPE="$1"  # daily, weekly, or monthly
BACKUP_ROOT="/backup/tar"
DATE=$(date +%Y%m%d)
SOURCES="/etc /home /var/www"

case "$BACKUP_TYPE" in
    daily)
        BACKUP_FILE="$BACKUP_ROOT/daily/backup-daily-$DATE.tar.gz"
        KEEP_DAYS=7
        ;;
    weekly)
        BACKUP_FILE="$BACKUP_ROOT/weekly/backup-weekly-$(date +%YW%V).tar.gz"
        KEEP_DAYS=28
        ;;
    monthly)
        BACKUP_FILE="$BACKUP_ROOT/monthly/backup-monthly-$(date +%Y%m).tar.gz"
        KEEP_DAYS=365
        ;;
    *)
        echo "Usage: $0 {daily|weekly|monthly}"
        exit 1
        ;;
esac

# Create backup directory
mkdir -p "$(dirname "$BACKUP_FILE")"

# Create backup
echo "Creating $BACKUP_TYPE backup: $BACKUP_FILE"
tar -czf "$BACKUP_FILE" \
    --exclude='*.log' \
    --exclude='cache' \
    --exclude='tmp' \
    $SOURCES

# Cleanup old backups
find "$(dirname "$BACKUP_FILE")" -name "*.tar.gz" -mtime +$KEEP_DAYS -delete

echo "$BACKUP_TYPE backup completed: $BACKUP_FILE"

Cron schedule:

# Daily backup at 2 AM
0 2 * * * /usr/local/bin/tar-rotating-backup.sh daily

# Weekly backup on Sunday at 3 AM
0 3 * * 0 /usr/local/bin/tar-rotating-backup.sh weekly

# Monthly backup on 1st at 4 AM
0 4 1 * * /usr/local/bin/tar-rotating-backup.sh monthly

Restoration Procedures

Basic Restoration

# Extract entire archive
tar -xzf backup.tar.gz

# Extract to specific location
tar -xzf backup.tar.gz -C /restore/path/

# Extract with verbose output
tar -xzvf backup.tar.gz

# Extract and overwrite existing files
tar -xzf backup.tar.gz --overwrite

Selective Restoration

# List archive contents first
tar -tzf backup.tar.gz | grep important-file.txt

# Extract specific file
tar -xzf backup.tar.gz path/to/important-file.txt

# Extract directory
tar -xzf backup.tar.gz path/to/directory/

# Extract multiple items
tar -xzf backup.tar.gz \
    path/to/file1.txt \
    path/to/file2.txt \
    path/to/directory/

# Extract files matching pattern
tar -xzf backup.tar.gz --wildcards '*.conf'

# Extract to stdout (view without extracting)
tar -xzf backup.tar.gz -O path/to/file.txt

Disaster Recovery

Complete system restoration:

# Boot from live USB
# Mount target filesystem
mount /dev/sda1 /mnt/target

# Extract backup
cd /mnt/target
tar -xzf /backup/full-system-backup.tar.gz

# Restore bootloader
grub-install --root-directory=/mnt/target /dev/sda
update-grub

# Reboot
reboot

Verification After Restoration

# Compare restored files with archive
tar -tzf backup.tar.gz > archive-files.txt
find /restored/path -type f > restored-files.txt
diff archive-files.txt restored-files.txt

# Verify file counts match
ARCHIVE_COUNT=$(tar -tzf backup.tar.gz | wc -l)
RESTORED_COUNT=$(find /restored/path | wc -l)
echo "Archive: $ARCHIVE_COUNT | Restored: $RESTORED_COUNT"

Automation and Monitoring

Systemd Timer Implementation

Service file (/etc/systemd/system/tar-backup.service):

[Unit]
Description=Tar Backup Service
After=network.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/tar-backup.sh
User=root
Nice=19
IOSchedulingClass=2
IOSchedulingPriority=7

[Install]
WantedBy=multi-user.target

Timer file (/etc/systemd/system/tar-backup.timer):

[Unit]
Description=Daily Tar Backup
Requires=tar-backup.service

[Timer]
OnCalendar=daily
OnCalendar=*-*-* 02:00:00
Persistent=true

[Install]
WantedBy=timers.target

Enable:

sudo systemctl daemon-reload
sudo systemctl enable --now tar-backup.timer

Monitoring Script

#!/bin/bash
# /usr/local/bin/monitor-tar-backups.sh

BACKUP_DIR="/backup"
MAX_AGE_HOURS=26
ADMIN_EMAIL="[email protected]"

# Find latest backup
LATEST_BACKUP=$(find "$BACKUP_DIR" -name "backup-*.tar.gz" -type f -printf '%T@ %p\n' | sort -n | tail -1 | cut -d' ' -f2-)

if [ -z "$LATEST_BACKUP" ]; then
    echo "ERROR: No backups found" | \
        mail -s "Backup Monitoring Alert" "$ADMIN_EMAIL"
    exit 1
fi

# Check age
BACKUP_TIME=$(stat -c %Y "$LATEST_BACKUP")
CURRENT_TIME=$(date +%s)
AGE_HOURS=$(( (CURRENT_TIME - BACKUP_TIME) / 3600 ))

if [ $AGE_HOURS -gt $MAX_AGE_HOURS ]; then
    echo "WARNING: Latest backup is $AGE_HOURS hours old" | \
        mail -s "Backup Age Alert" "$ADMIN_EMAIL"
    exit 1
else
    echo "OK: Latest backup is $AGE_HOURS hours old"
    exit 0
fi

Real-World Scenarios

Scenario 1: Website Backup

#!/bin/bash
# Website backup with database

BACKUP_DIR="/backup/website"
DATE=$(date +%Y%m%d)

# Dump database
mysqldump website_db | gzip > /tmp/website-db.sql.gz

# Create tar backup
tar -czf "$BACKUP_DIR/website-$DATE.tar.gz" \
    --exclude='cache/*' \
    --exclude='logs/*' \
    /var/www/website/ \
    /tmp/website-db.sql.gz

# Cleanup
rm /tmp/website-db.sql.gz

# Keep 30 days
find "$BACKUP_DIR" -name "website-*.tar.gz" -mtime +30 -delete

Scenario 2: Configuration Backup

#!/bin/bash
# System configuration backup

tar -czf /backup/config-$(date +%Y%m%d).tar.gz \
    /etc \
    /root/.ssh \
    /home/*/.ssh \
    /var/spool/cron \
    --exclude='/etc/shadow-' \
    --exclude='/etc/gshadow-'

Scenario 3: Offsite Backup

#!/bin/bash
# Create backup and sync to remote server

# Create backup
/usr/local/bin/tar-backup.sh

# Sync to remote
rsync -avz --delete \
    /backup/ \
    user@backup-server:/backups/$(hostname)/

# Upload to S3
LATEST=$(ls -t /backup/backup-*.tar.gz | head -1)
aws s3 cp "$LATEST" s3://my-backups/$(hostname)/

Conclusion

Tar and gzip provide reliable, universal backup solutions that work across all Unix/Linux systems. While modern backup tools offer advanced features, tar+gzip remains relevant for its simplicity, portability, and effectiveness.

Key takeaways:

  1. Choose compression wisely: Balance speed vs. size based on your needs
  2. Exclude unnecessary data: Use exclude patterns to optimize backups
  3. Verify archives: Always test archive integrity after creation
  4. Automate consistently: Schedule regular automated backups
  5. Test restoration: Practice restoration procedures regularly
  6. Implement retention: Automated cleanup prevents disk exhaustion
  7. Follow 3-2-1: Local, remote, and offsite tar backups

Combined with proper automation, monitoring, and the 3-2-1 backup rule, tar and gzip provide a solid foundation for data protection in Linux environments.