Smartmontools Disk Health Monitoring
Smartmontools provides access to the SMART (Self-Monitoring, Analysis and Reporting Technology) data built into HDDs, SSDs, and NVMe drives. This guide covers installing Smartmontools on Linux, reading SMART attributes, configuring automated testing and email alerts, monitoring NVMe drives, and using SMART data to predict disk failures.
Prerequisites
- Ubuntu 20.04/22.04 or CentOS/Rocky Linux 8+
- Root or sudo access
- One or more HDDs, SSDs, or NVMe drives
Install Smartmontools
# Ubuntu/Debian
sudo apt update && sudo apt install -y smartmontools
# CentOS/Rocky Linux
sudo dnf install -y smartmontools
# Verify installation
smartctl --version
# List all detected storage devices
sudo smartctl --scan
# Output example:
# /dev/sda -d scsi # /dev/sda, SCSI device
# /dev/sdb -d scsi # /dev/sdb, SCSI device
# /dev/nvme0 -d nvme # /dev/nvme0, NVMe device
Check Disk Health with smartctl
# Get overall health status
sudo smartctl -H /dev/sda
# SMART overall-health self-assessment test result: PASSED
# Show all SMART information for a drive
sudo smartctl -a /dev/sda
# More detailed output including error log
sudo smartctl -x /dev/sda
# For SSDs, check wear indicators specifically:
sudo smartctl -A /dev/sda | grep -E "Wear|Media_Wearout|SSD_Life"
# Show drive identity information
sudo smartctl -i /dev/sda
Key SMART Attributes Explained
These attributes are most important for predicting failure:
# Show all attributes with current, worst, threshold values
sudo smartctl -A /dev/sda
# Critical attributes to watch:
# ID 1 Raw_Read_Error_Rate — Hardware errors reading from disk surface
# ID 5 Reallocated_Sector_Ct — Sectors with read errors remapped to spares
# ANY non-zero value is concerning
# ID 9 Power_On_Hours — Total operating hours
# ID 10 Spin_Retry_Count — Motor spin-up failures (HDD only)
# ID 177 Wear_Leveling_Count — SSD wear (lower = more worn)
# ID 187 Reported_Uncorrect — Errors that couldn't be recovered
# ID 188 Command_Timeout — Commands that timed out
# ID 196 Reallocated_Event_Count — Reallocation events (even one is a warning)
# ID 197 Current_Pending_Sector — Sectors waiting to be reallocated
# Non-zero = imminent data loss risk
# ID 198 Offline_Uncorrectable — Sectors found bad during offline tests
# ID 199 UDMA_CRC_Error_Count — Data transmission errors (cable/interface)
# Script to check critical attributes
sudo smartctl -A /dev/sda | awk '
/Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable/ {
if ($10 > 0) {
print "WARNING: " $2 " = " $10 " (non-zero indicates potential failure)"
}
}'
Run SMART Self-Tests
# Short test (1-2 minutes) — checks electronics and surface
sudo smartctl -t short /dev/sda
# Long test (several hours) — full surface scan
sudo smartctl -t long /dev/sda
# Conveyance test (for drives after transport, ~5 minutes)
sudo smartctl -t conveyance /dev/sda
# Offline test (runs in background without impacting performance)
sudo smartctl -t offline /dev/sda
# Check test status (run after starting a test)
sudo smartctl -a /dev/sda | grep -A 10 "Self-test execution status"
# View test history
sudo smartctl -l selftest /dev/sda
# Cancel a running test
sudo smartctl -X /dev/sda
Example output of a healthy drive after long test:
Num Test_Description Status Remaining LifeTime(hours)
# 1 Extended offline Completed without error 00% 12345
# 2 Short offline Completed without error 00% 12340
NVMe Monitoring
NVMe drives use different SMART commands:
# Show NVMe SMART health information
sudo smartctl -a /dev/nvme0
# Key NVMe health fields to monitor:
sudo smartctl -a /dev/nvme0 | grep -E \
"Critical Warning|Temperature:|Available Spare|Percentage Used|Data Units|Error Information"
# Critical Warning: 0x00 is healthy; non-zero indicates a problem
# Available Spare: should be above Available Spare Threshold
# Percentage Used: close to 100% means drive is near end of life
# Media and Data Integrity Errors: should be 0
# NVMe-specific: list namespaces
sudo nvme list # requires nvme-cli: apt install nvme-cli
# NVMe SMART log via nvme-cli
sudo nvme smart-log /dev/nvme0
# Check error log
sudo nvme error-log /dev/nvme0
# Monitor NVMe temperature
sudo smartctl -a /dev/nvme0 | grep "Temperature"
Automated Monitoring with smartd
smartd is a daemon that monitors drives and runs periodic tests:
# Configure smartd
sudo tee /etc/smartd.conf > /dev/null <<'EOF'
# Monitor all drives
DEVICESCAN -d auto \
-H \ # Check overall health
-l error \ # Report error log changes
-l selftest \ # Report self-test log changes
-f \ # Report attribute failures
-s (S/../.././02|L/../../6/03) \ # Short test daily at 2am, long test Saturdays at 3am
-m [email protected] \ # Email address for alerts
-M exec /usr/share/smartmontools/smartd-runner # Email handler
# Monitor a specific drive with more aggressive testing
/dev/sda -d sat -a \
-o on \ # Enable automatic offline testing
-S on \ # Enable attribute auto-save
-s (S/../.././04|L/../../7/04) \
-m [email protected]
EOF
# Enable and start smartd
sudo systemctl enable smartd
sudo systemctl start smartd
# Check smartd is running
sudo systemctl status smartd
# View smartd logs
sudo journalctl -u smartd -n 50
Email Alerts
# Install mail utilities for smartd email alerts
sudo apt install -y mailutils postfix # Ubuntu
sudo dnf install -y mailx postfix # CentOS/Rocky
# Test that email is working
echo "Test from smartmontools" | mail -s "Test Alert" [email protected]
# Configure smartd to send alerts on attribute change
# In /etc/smartd.conf, the -m flag sets the recipient:
# -m [email protected] -M once # Send once per problem (default)
# -M daily # Send daily while problem persists
# -M diminishing # Send 1st, 2nd, 4th, 8th day...
# -M test # Send test email on startup
# Test the alert configuration
sudo smartd --quit -d -M test 2>&1 | head -20
# Custom alert script
sudo tee /usr/local/bin/smart-alert.sh > /dev/null <<'EOF'
#!/bin/bash
# Called by smartd when a problem is detected
# Arguments: see smartd.conf manpage for $SMARTD_* variables
echo "SMART Alert: $SMARTD_MESSAGE" | \
mail -s "DISK HEALTH WARNING: $SMARTD_DEVICEPATH" [email protected]
# Also log to syslog
logger -t smartd "ALERT: $SMARTD_MESSAGE on $SMARTD_DEVICEPATH"
EOF
sudo chmod +x /usr/local/bin/smart-alert.sh
# Reference the script in smartd.conf:
# -m [email protected] -M exec /usr/local/bin/smart-alert.sh
Failure Prediction and Planning
# Create a health summary script for all drives
sudo tee /usr/local/bin/disk-health-check.sh > /dev/null <<'EOF'
#!/bin/bash
echo "=== Disk Health Report: $(date) ==="
echo ""
for DEV in $(smartctl --scan | awk '{print $1}'); do
echo "--- $DEV ---"
MODEL=$(smartctl -i $DEV 2>/dev/null | grep "Device Model\|Model Number" | awk -F: '{print $2}')
HOURS=$(smartctl -A $DEV 2>/dev/null | grep "Power_On_Hours" | awk '{print $10}')
HEALTH=$(smartctl -H $DEV 2>/dev/null | grep "result:" | awk '{print $NF}')
echo " Model: $MODEL"
echo " Hours: $HOURS"
echo " Health: $HEALTH"
# Check for concerning attributes
PENDING=$(smartctl -A $DEV 2>/dev/null | grep "Current_Pending_Sector" | awk '{print $10}')
REALLOCATED=$(smartctl -A $DEV 2>/dev/null | grep "Reallocated_Sector_Ct" | awk '{print $10}')
[ "$PENDING" -gt 0 ] 2>/dev/null && echo " *** WARNING: Pending sectors: $PENDING ***"
[ "$REALLOCATED" -gt 0 ] 2>/dev/null && echo " *** WARNING: Reallocated sectors: $REALLOCATED ***"
echo ""
done
EOF
sudo chmod +x /usr/local/bin/disk-health-check.sh
# Run it
sudo /usr/local/bin/disk-health-check.sh
# Schedule it with cron
echo "0 6 * * * root /usr/local/bin/disk-health-check.sh | mail -s 'Daily Disk Health' [email protected]" | \
sudo tee /etc/cron.d/disk-health-check
Troubleshooting
"SMART support is: Unavailable":
# Try with explicit device type
sudo smartctl -d sat /dev/sda # For SATA drives behind adapters
sudo smartctl -d scsi /dev/sda # For SAS drives
sudo smartctl -d nvme /dev/nvme0 # For NVMe
# For drives behind RAID controllers, use specific driver
sudo smartctl -d megaraid,0 /dev/sda # LSI/Avago MegaRAID slot 0
sudo smartctl -d 3ware,0 /dev/twa0 # 3Ware RAID
Self-test won't run:
# Check if SMART is enabled on the drive
sudo smartctl -i /dev/sda | grep "SMART support is"
# Enable SMART if disabled
sudo smartctl -s on /dev/sda
smartd not sending emails:
# Test postfix is working
echo "test" | mail -s "test" [email protected]
# Check smartd logs for email errors
sudo journalctl -u smartd | grep -i "mail\|email"
# Test with smartd's test mode
sudo smartd -q onecheck -d
Conclusion
Smartmontools provides early warning of drive failures through SMART attribute monitoring, automated self-tests, and real-time daemon monitoring. Reallocated sectors, pending sectors, and uncorrectable errors are the most reliable failure predictors — any non-zero value warrants immediate backup verification and drive replacement planning. Integrate smartd with your alerting system to catch failures before they cause data loss.


