Bare Metal Recovery Procedures

Bare metal recovery enables complete server restoration from scratch, including bootloader, filesystem, and all data. This guide covers full system backup strategies, bootable rescue environments, GRUB restoration, LVM recovery, and network boot procedures for comprehensive infrastructure disaster recovery.

Table of Contents

  1. Bare Metal Recovery Overview
  2. Full System Backup Strategies
  3. Creating Bootable Rescue Environments
  4. GRUB Restoration
  5. LVM Recovery
  6. Network Boot Recovery
  7. Automated Recovery
  8. Recovery Testing
  9. Conclusion

Bare Metal Recovery Overview

Bare metal recovery requires:

  • Hardware: Target machine or spare hardware
  • Bootable Media: USB, CD, or network boot
  • System Image: Complete disk backup including partitions
  • Recovery Tools: Imaging software (dd, tar, Clonezilla)
  • Boot Configuration: GRUB/bootloader restoration
  • Testing: Verified procedures before disaster
# Bare metal recovery capabilities matrix
cat > /tmp/recovery-matrix.txt << 'EOF'
Recovery Method | Recovery Time | Data Integrity | Hardware Match
----------------|---------------|----------------|----------------
Full disk image | 30-60 min     | Exact          | Required
Filesystem tar  | 1-4 hours     | Very high      | Not required
Incremental bkp | Variable      | High           | Not required
Network clone   | 30-60 min     | Exact          | Preferred
Logical restore | 4-24 hours    | Good           | Not required

Best choice depends on:
- RTO (Recovery Time Objective)
- RPO (Recovery Point Objective)
- Hardware availability
- Data consistency requirements
EOF

cat /tmp/recovery-matrix.txt

Full System Backup Strategies

Using dd for Disk Imaging

# Create full disk image with dd
create_disk_image_dd() {
    local source_device=$1
    local image_file=$2
    local block_size=4M
    
    echo "Creating disk image: $source_device -> $image_file"
    
    # Check if device exists
    if [ ! -b "$source_device" ]; then
        echo "Error: Device not found: $source_device"
        return 1
    fi
    
    # Create image with progress
    dd if="$source_device" \
        of="$image_file" \
        bs="$block_size" \
        conv=noerror,sync \
        status=progress
    
    if [ $? -eq 0 ]; then
        echo "✓ Disk image created successfully"
        ls -lh "$image_file"
    else
        echo "✗ Disk image creation failed"
        return 1
    fi
}

# Restore disk image with dd
restore_disk_image_dd() {
    local image_file=$1
    local target_device=$2
    
    echo "WARNING: This will overwrite device: $target_device"
    read -p "Are you sure? (type 'yes' to continue): " confirmation
    
    if [ "$confirmation" != "yes" ]; then
        echo "Cancelled"
        return 1
    fi
    
    echo "Restoring disk image to: $target_device"
    
    dd if="$image_file" \
        of="$target_device" \
        bs=4M \
        conv=noerror,sync \
        status=progress
    
    if [ $? -eq 0 ]; then
        echo "✓ Disk restoration complete"
        sync
    else
        echo "✗ Restoration failed"
        return 1
    fi
}

# Compress disk image for storage
compress_disk_image() {
    local image_file=$1
    local compression_level=6
    
    echo "Compressing disk image..."
    
    # Use gzip for compression
    gzip -$compression_level "$image_file"
    
    if [ $? -eq 0 ]; then
        echo "✓ Compression complete"
        ls -lh "${image_file}.gz"
    fi
}

Using tar for Filesystem Backup

# Create incremental filesystem backup
create_filesystem_backup_tar() {
    local source_dir=$1
    local backup_file=$2
    
    echo "Creating filesystem backup: $source_dir -> $backup_file"
    
    # Create comprehensive backup excluding unnecessary files
    tar \
        --exclude='/proc/*' \
        --exclude='/sys/*' \
        --exclude='/dev/*' \
        --exclude='/tmp/*' \
        --exclude='/var/tmp/*' \
        --exclude='/var/cache/*' \
        --exclude='/var/log/*' \
        --exclude='/run/*' \
        --exclude='/mnt/*' \
        --exclude='/backup/*' \
        --exclude='lost+found' \
        --exclude='.cache' \
        --exclude='*.swp' \
        -czf "$backup_file" \
        -C / \
        . 2>&1 | tee /tmp/tar-backup.log
    
    if [ $? -eq 0 ]; then
        echo "✓ Filesystem backup complete"
        du -sh "$backup_file"
    fi
}

# Restore filesystem from tar backup
restore_filesystem_tar() {
    local backup_file=$1
    local restore_mount=$2
    
    echo "Restoring filesystem from: $backup_file"
    echo "Restore location: $restore_mount"
    
    # Verify mount point exists
    if [ ! -d "$restore_mount" ]; then
        echo "Error: Mount point does not exist: $restore_mount"
        return 1
    fi
    
    # Extract backup
    tar -xzf "$backup_file" -C "$restore_mount" \
        --warning=no-file-changed \
        --numeric-owner
    
    if [ $? -eq 0 ]; then
        echo "✓ Filesystem restoration complete"
    else
        echo "✗ Restoration failed"
        return 1
    fi
}

# Incremental backup tracking
track_incremental_backups() {
    local snapshot_file="/var/lib/backup/.last_backup"
    local backup_dir="/backup/incremental"
    
    mkdir -p "$backup_dir"
    
    if [ ! -f "$snapshot_file" ]; then
        touch "$snapshot_file"
    fi
    
    # Create incremental backup of changed files
    tar -czf "$backup_dir/incremental-$(date +%Y%m%d_%H%M%S).tar.gz" \
        --newer-mtime-than "$snapshot_file" \
        --exclude-from=/tmp/tar-excludes.txt \
        /
    
    # Update snapshot
    touch "$snapshot_file"
}

Hash-Based Backup Verification

# Verify backup integrity using checksums
verify_backup_integrity() {
    local backup_file=$1
    local checksum_file="${backup_file}.sha256"
    
    echo "Verifying backup integrity..."
    
    # Create checksum if it doesn't exist
    if [ ! -f "$checksum_file" ]; then
        sha256sum "$backup_file" > "$checksum_file"
        echo "Checksum created: $checksum_file"
    fi
    
    # Verify checksum
    if sha256sum -c "$checksum_file" --quiet; then
        echo "✓ Backup integrity verified"
        return 0
    else
        echo "✗ Backup corruption detected!"
        return 1
    fi
}

Creating Bootable Rescue Environments

Create Bootable Rescue USB

# Create comprehensive rescue environment on USB
create_rescue_usb() {
    local usb_device=$1  # e.g., /dev/sdb
    local iso_file=${2:-"systemrescue-10.00.iso"}
    
    echo "Creating rescue environment on: $usb_device"
    
    # Verify device is actually a USB drive
    if [ ! -b "$usb_device" ]; then
        echo "Error: Not a valid device: $usb_device"
        return 1
    fi
    
    # Unmount if mounted
    umount "${usb_device}"* 2>/dev/null
    
    # Write ISO to USB
    dd if="$iso_file" of="$usb_device" bs=4M status=progress
    sync
    
    if [ $? -eq 0 ]; then
        echo "✓ Rescue USB created successfully"
    fi
}

# Download SystemRescueCD
download_rescue_iso() {
    local download_dir="/tmp/rescue-iso"
    mkdir -p "$download_dir"
    
    echo "Downloading SystemRescueCD..."
    
    # Get latest version
    wget -O "$download_dir/systemrescue.iso" \
        https://download.system-rescue.org/systemrescue-10.00.iso
    
    echo "✓ Downloaded to: $download_dir/systemrescue.iso"
}

Build Custom Rescue Environment

# Create custom rescue environment with required tools
build_custom_rescue_env() {
    local rescue_dir="/opt/rescue"
    
    mkdir -p "$rescue_dir/bin"
    mkdir -p "$rescue_dir/sbin"
    mkdir -p "$rescue_dir/lib64"
    
    echo "Building custom rescue environment..."
    
    # Copy essential utilities
    local essential_tools=(
        "/bin/bash"
        "/bin/ls"
        "/bin/cp"
        "/bin/dd"
        "/bin/mount"
        "/bin/umount"
        "/sbin/fsck"
        "/sbin/mkfs"
        "/sbin/grub-install"
        "/usr/bin/rsync"
        "/usr/bin/curl"
        "/usr/bin/wget"
        "/usr/sbin/parted"
        "/usr/sbin/fdisk"
    )
    
    for tool in "${essential_tools[@]}"; do
        if [ -f "$tool" ]; then
            cp --parents "$tool" "$rescue_dir/"
        fi
    done
    
    # Copy shared libraries
    ldd "${essential_tools[@]}" | grep "=> /" | awk '{print $3}' | sort -u | while read lib; do
        cp --parents "$lib" "$rescue_dir/" 2>/dev/null
    done
    
    echo "✓ Custom rescue environment created"
}

GRUB Restoration

Restore GRUB Bootloader

# Restore GRUB bootloader to MBR or UEFI
restore_grub_bootloader() {
    local target_device=$1
    local boot_partition=${2:-"${target_device}1"}
    local grub_root_partition=${3:-"${target_device}2"}
    
    echo "Restoring GRUB bootloader"
    echo "Target device: $target_device"
    echo "Boot partition: $boot_partition"
    echo "Root partition: $grub_root_partition"
    
    # Mount root partition
    local temp_root="/mnt/restore-root"
    mkdir -p "$temp_root"
    
    mount "$grub_root_partition" "$temp_root"
    
    # Mount boot partition if separate
    if [ "$boot_partition" != "$grub_root_partition" ]; then
        mkdir -p "$temp_root/boot"
        mount "$boot_partition" "$temp_root/boot"
    fi
    
    # Mount essential filesystems
    mount -B /dev "$temp_root/dev"
    mount -t proc proc "$temp_root/proc"
    mount -t sysfs sys "$temp_root/sys"
    mount -t efivarfs efivarfs "$temp_root/sys/firmware/efi/efivars" 2>/dev/null
    
    # Chroot and reinstall GRUB
    echo "Installing GRUB to $target_device..."
    
    chroot "$temp_root" /bin/bash -c "grub-install --target=x86_64-pc $target_device"
    
    if [ $? -eq 0 ]; then
        echo "✓ GRUB installed successfully"
    else
        echo "✗ GRUB installation failed"
    fi
    
    # Reinstall GRUB for UEFI if applicable
    if [ -d "$temp_root/sys/firmware/efi" ]; then
        echo "Restoring GRUB for UEFI..."
        chroot "$temp_root" /bin/bash -c "grub-install --target=x86_64-efi --efi-directory=/boot/efi"
    fi
    
    # Update GRUB configuration
    echo "Updating GRUB configuration..."
    chroot "$temp_root" /bin/bash -c "grub-mkconfig -o /boot/grub/grub.cfg"
    
    # Unmount
    umount "$temp_root"/sys/firmware/efi/efivars 2>/dev/null
    umount "$temp_root"/{sys,proc,dev,boot,}
    
    echo "GRUB restoration complete"
}

# Restore GRUB from rescue environment
restore_grub_from_rescue() {
    local root_device=$1
    local boot_device=$2
    
    echo "Restoring GRUB from rescue environment"
    
    # Create temporary root
    mkdir -p /mnt/recovered
    
    # Mount recovered filesystem
    mount "$root_device" /mnt/recovered
    
    # Bind mount important directories
    mount --rbind /dev /mnt/recovered/dev
    mount --rbind /proc /mnt/recovered/proc
    mount --rbind /sys /mnt/recovered/sys
    
    # Enter chroot
    chroot /mnt/recovered /bin/bash -c "grub-install $boot_device && grub-mkconfig -o /boot/grub/grub.cfg"
    
    # Cleanup
    umount -l /mnt/recovered/{dev,proc,sys}
    umount /mnt/recovered
}

Fix GRUB Configuration

# Fix GRUB configuration after recovery
fix_grub_config() {
    local root_mount=$1
    
    echo "Fixing GRUB configuration..."
    
    # Common GRUB issues and fixes
    local grub_config="$root_mount/boot/grub/grub.cfg"
    
    if [ ! -f "$grub_config" ]; then
        echo "GRUB config not found, regenerating..."
        chroot "$root_mount" grub-mkconfig -o /boot/grub/grub.cfg
    fi
    
    # Verify UUID references are correct
    echo "Verifying filesystem UUIDs..."
    local root_uuid=$(blkid -s UUID -o value "$root_mount")
    
    if ! grep -q "UUID=$root_uuid" "$grub_config"; then
        echo "Updating UUID references..."
        chroot "$root_mount" grub-mkconfig -o /boot/grub/grub.cfg
    fi
    
    echo "✓ GRUB configuration verified"
}

LVM Recovery

Recover LVM Volumes

# Recover data from LVM volumes
recover_lvm_volumes() {
    echo "Scanning for LVM volumes..."
    
    # Scan for physical volumes
    pvscan
    
    # Scan for volume groups
    vgscan
    
    # List all logical volumes
    lvdisplay
    
    # Activate volume groups
    vgchange -ay
    
    # Mount recovered volumes
    local lv_path="/dev/vg0/lv_root"
    local mount_point="/mnt/recovered"
    
    mkdir -p "$mount_point"
    
    echo "Mounting LVM volume: $lv_path"
    mount "$lv_path" "$mount_point"
    
    if [ $? -eq 0 ]; then
        echo "✓ LVM volume mounted successfully"
        df -h "$mount_point"
    fi
}

# Restore LVM configuration from backup
restore_lvm_config() {
    local backup_file=$1
    
    echo "Restoring LVM configuration from: $backup_file"
    
    # Export current LVM state
    vgcfgbackup -f /tmp/vg.backup
    
    # Restore from backup
    vgcfgrestore -f "$backup_file" 2>&1 | head -20
    
    # Activate restored volume group
    vgchange -ay
    
    # Verify restoration
    lvdisplay
}

# Create LVM backup for recovery
create_lvm_backup() {
    local backup_dir="/backup/lvm"
    mkdir -p "$backup_dir"
    
    echo "Creating LVM configuration backup..."
    
    # Backup physical volume info
    pvdisplay > "$backup_dir/pv-info.txt"
    
    # Backup volume group info
    vgdisplay > "$backup_dir/vg-info.txt"
    
    # Backup logical volume info
    lvdisplay > "$backup_dir/lv-info.txt"
    
    # Backup LVM metadata
    vgcfgbackup -f "$backup_dir/vg-config.backup"
    
    echo "✓ LVM backup created in: $backup_dir"
}

Network Boot Recovery

Setup PXE Boot for Recovery

# Configure PXE boot for bare metal recovery
setup_pxe_recovery() {
    local tftp_root="/var/lib/tftpboot"
    local pxelinux_cfgdir="$tftp_root/pxelinux.cfg"
    
    echo "Setting up PXE recovery environment..."
    
    # Create directory structure
    mkdir -p "$pxelinux_cfgdir"
    
    # Copy PXE bootloader
    cp /usr/lib/syslinux/pxelinux.0 "$tftp_root/"
    cp /usr/lib/syslinux/ldlinux.c32 "$tftp_root/"
    
    # Create default PXE configuration
    cat > "$pxelinux_cfgdir/default" << 'EOF'
DEFAULT menu.c32
PROMPT 0
MENU TITLE System Recovery Boot Menu
TIMEOUT 300

LABEL SystemRescue
    MENU LABEL System Rescue Environment
    LINUX systemrescue.kernel
    APPEND initrd=systemrescue.initramfs root=live:/systemrescue.iso

LABEL ClonezillaRestore
    MENU LABEL Clonezilla (Restore)
    LINUX clonezilla/vmlinuz
    APPEND initrd=clonezilla/initrd.img boot=live live-config noswap nolocales edd=on nomodeset vga=788

LABEL DebianInstaller
    MENU LABEL Debian Installer
    LINUX debian-installer/amd64/linux
    APPEND initrd=debian-installer/amd64/initrd.gz
EOF
    
    echo "✓ PXE recovery environment configured"
}

# Boot system from network for recovery
boot_from_network() {
    echo "Instructions for network boot recovery:"
    echo ""
    echo "1. Power on bare metal machine"
    echo "2. Enter BIOS/UEFI setup (typically F2, Del, or Esc)"
    echo "3. Set boot priority:"
    echo "   - Primary: Network/PXE"
    echo "   - Secondary: Local disk"
    echo "4. Save and exit"
    echo "5. Machine will boot from PXE"
    echo ""
    echo "Select recovery option from boot menu"
}

Automated Recovery

Automated Recovery Script

# Automated bare metal recovery with minimal intervention
cat > /usr/local/bin/automated-recovery.sh << 'EOF'
#!/bin/bash

RECOVERY_MODE=${1:-"interactive"}
SOURCE_IMAGE=${2:-""}
TARGET_DEVICE=${3:-""}

# Configuration
RECOVERY_LOG="/var/log/bare-metal-recovery.log"
BACKUP_LOCATION="/backup"

log_recovery() {
    echo "[$(date)] $1" | tee -a "$RECOVERY_LOG"
}

# Step 1: Identify recovery image
find_recovery_image() {
    if [ -n "$SOURCE_IMAGE" ]; then
        log_recovery "Using specified image: $SOURCE_IMAGE"
        return 0
    fi
    
    echo "Available backup images:"
    ls -lh "$BACKUP_LOCATION"/*.img* 2>/dev/null || \
        ls -lh "$BACKUP_LOCATION"/*.tar.gz 2>/dev/null || \
        echo "No images found"
    
    read -p "Enter backup image path: " SOURCE_IMAGE
    
    if [ ! -f "$SOURCE_IMAGE" ]; then
        log_recovery "Error: Image not found"
        return 1
    fi
}

# Step 2: Identify target device
find_target_device() {
    if [ -n "$TARGET_DEVICE" ]; then
        log_recovery "Using target device: $TARGET_DEVICE"
        return 0
    fi
    
    echo "Available devices:"
    lsblk -d -o NAME,SIZE,TYPE
    
    read -p "Enter target device (e.g., /dev/sda): " TARGET_DEVICE
    
    if [ ! -b "$TARGET_DEVICE" ]; then
        log_recovery "Error: Invalid device"
        return 1
    fi
}

# Step 3: Verify backup
verify_backup() {
    log_recovery "Verifying backup image..."
    
    if [[ "$SOURCE_IMAGE" == *.gz ]]; then
        gzip -t "$SOURCE_IMAGE" || return 1
    fi
    
    log_recovery "✓ Backup verification passed"
}

# Step 4: Perform recovery
perform_recovery() {
    log_recovery "Starting recovery to: $TARGET_DEVICE"
    
    if [[ "$SOURCE_IMAGE" == *.img.gz ]]; then
        log_recovery "Restoring disk image..."
        gunzip < "$SOURCE_IMAGE" | dd of="$TARGET_DEVICE" bs=4M status=progress
    elif [[ "$SOURCE_IMAGE" == *.tar.gz ]]; then
        log_recovery "Restoring filesystem from tar..."
        mkdir -p /mnt/recovery
        mount "$TARGET_DEVICE" /mnt/recovery
        tar -xzf "$SOURCE_IMAGE" -C /mnt/recovery
        umount /mnt/recovery
    fi
    
    sync
    log_recovery "✓ Recovery complete"
}

# Step 5: Restore GRUB
restore_bootloader() {
    log_recovery "Restoring GRUB bootloader..."
    
    # Attempt automatic GRUB restoration
    mkdir -p /mnt/recovery
    mount "$TARGET_DEVICE" /mnt/recovery
    
    mount --rbind /dev /mnt/recovery/dev
    mount --rbind /proc /mnt/recovery/proc
    mount --rbind /sys /mnt/recovery/sys
    
    chroot /mnt/recovery grub-install "$TARGET_DEVICE"
    chroot /mnt/recovery grub-mkconfig -o /boot/grub/grub.cfg
    
    umount -l /mnt/recovery/{dev,proc,sys}
    umount /mnt/recovery
    
    log_recovery "✓ Bootloader restored"
}

main() {
    log_recovery "Starting automated bare metal recovery"
    
    if [ "$RECOVERY_MODE" = "interactive" ]; then
        find_recovery_image || exit 1
        find_target_device || exit 1
    fi
    
    verify_backup || exit 1
    
    read -p "WARNING: This will overwrite $TARGET_DEVICE. Continue? (yes/no): " confirm
    [ "$confirm" = "yes" ] || exit 1
    
    perform_recovery || exit 1
    restore_bootloader || exit 1
    
    log_recovery "Recovery process complete"
    echo "System ready for reboot"
}

main
EOF

chmod +x /usr/local/bin/automated-recovery.sh

Recovery Testing

Test Recovery Procedures Regularly

# Test bare metal recovery procedure
test_recovery_procedure() {
    local test_image="/backup/test-recovery.img.gz"
    local test_vm_device="/dev/loop0"
    
    echo "Testing bare metal recovery procedure..."
    
    # Create test environment
    mkdir -p /tmp/recovery-test
    cd /tmp/recovery-test
    
    # Create loop device for testing
    dd if=/dev/zero of=test-disk.img bs=1G count=10
    losetup "$test_vm_device" test-disk.img
    
    # Perform recovery
    if gunzip < "$test_image" | dd of="$test_vm_device" bs=4M; then
        echo "✓ Image restoration successful"
    else
        echo "✗ Image restoration failed"
        return 1
    fi
    
    # Verify filesystem
    fsck -n "$test_vm_device"
    
    # Cleanup
    losetup -d "$test_vm_device"
    rm test-disk.img
    
    echo "✓ Recovery test completed"
}

# Document recovery procedure
document_recovery_steps() {
    cat > /tmp/recovery-runbook.txt << 'EOF'
==============================================
BARE METAL RECOVERY RUNBOOK
==============================================

Prerequisites:
- Bootable rescue media (USB or CD)
- Latest backup image
- Recovery procedure documentation

Recovery Steps:

1. Boot Server into Rescue Environment
   - Plug in bootable USB/CD
   - Power on server
   - Select boot device from BIOS menu
   - Wait for rescue environment to load

2. Identify Disks and Partitions
   lsblk
   fdisk -l

3. Locate Recovery Image
   - Check network storage: NFS, SAMBA
   - Download from backup server if needed
   ls -lh /mnt/backup/

4. Restore System Image
   gunzip < backup.img.gz | dd of=/dev/sda bs=4M status=progress
   OR
   tar -xzf backup.tar.gz -C /mnt/recovered

5. Restore GRUB Bootloader
   mount /dev/sda1 /mnt/recovered
   mount --bind /dev /mnt/recovered/dev
   chroot /mnt/recovered
   grub-install /dev/sda
   grub-mkconfig -o /boot/grub/grub.cfg
   exit

6. Verify Recovery
   fsck -n /dev/sda1
   mount /dev/sda1 /mnt/verify
   ls /mnt/verify/

7. Reboot
   reboot

Expected Recovery Time: 30-60 minutes
Success Indicators:
- System boots without errors
- All filesystems mount
- Services start correctly
- Network connectivity confirmed
EOF

cat /tmp/recovery-runbook.txt
}

Conclusion

Bare metal recovery requires:

  1. Complete Backups: Full disk or filesystem images at regular intervals
  2. Bootable Media: Rescue environment ready for immediate use
  3. GRUB Knowledge: Ability to restore bootloader if needed
  4. LVM Understanding: Recovery of logical volumes if applicable
  5. Network Boot: PXE configured as fallback recovery method
  6. Regular Testing: Proven procedures before actual disaster

Maintain recovery procedures with your infrastructure:

  • Update backup media when hardware changes
  • Document custom configurations
  • Test recovery at least quarterly
  • Keep multiple backup copies in different locations

The ultimate goal is to reduce MTTR (Mean Time To Recovery) and ensure RTO/RPO targets are met.