I/O Scheduler Selection and Tuning

I/O schedulers determine how the Linux kernel queues and dispatches storage requests to drives. Selecting and tuning the appropriate scheduler significantly impacts throughput, latency, and fairness. Modern solid-state drives and NVMe devices have different optimal schedulers than traditional spinning disks. This guide covers scheduler selection, tuning parameters, and performance optimization strategies.

Table of Contents

  1. I/O Scheduler Overview
  2. Available Schedulers
  3. Scheduler Selection
  4. Tunable Parameters
  5. SSD vs HDD Tuning
  6. Performance Benchmarking
  7. Monitoring I/O
  8. Conclusion

I/O Scheduler Overview

Understanding I/O Scheduling

I/O schedulers optimize storage request handling by:

  • Reordering requests for mechanical efficiency
  • Merging adjacent I/O operations
  • Preventing request starvation
  • Balancing throughput and latency

Different workloads require different schedulers:

  • Database: Low latency critical
  • Streaming: Throughput optimization
  • Multimedia: Fairness and interactivity

Available Schedulers

Modern Linux Schedulers

# List available schedulers
cat /sys/block/sda/queue/scheduler

# Output example:
# noop [deadline] cfq

# Check current scheduler
cat /sys/block/nvme0n1/queue/scheduler

# Available schedulers:
# - noop: No-op (bypass scheduling, for fast storage)
# - deadline: Priority on read requests, fairness
# - cfq: Completely Fair Queuing (fairness, interactive performance)
# - mq-deadline: Multi-queue deadline (modern default)
# - bfq: Budget Fair Queuing (fairness with high throughput)
# - kyber: Latency-optimized (NVMe default)

Scheduler Characteristics

# noop: Minimal overhead, trusts drive intelligence
# - Best for: NVMe, high-speed SSDs
# - Pros: Low latency, minimal CPU
# - Cons: No optimization

# deadline: Prevents starvation, prioritizes reads
# - Best for: Traditional rotating disks
# - Pros: Fair, predictable latency
# - Cons: Lower throughput than CFQ

# cfq: Fair distribution, interactive performance
# - Best for: General-purpose systems
# - Pros: Fair, good interactivity
# - Cons: Latency variance

# mq-deadline: Modern multi-queue version of deadline
# - Best for: Modern systems with many CPUs
# - Pros: Scalable, fair
# - Cons: None for most workloads

# bfq: Bandwidth and fairness
# - Best for: Desktop/interactive workloads
# - Pros: Excellent fairness and responsiveness
# - Cons: Slightly higher CPU

# kyber: Latency-optimized
# - Best for: NVMe devices
# - Pros: Low latency targeting
# - Cons: Not general-purpose

Scheduler Selection

Changing Scheduler Dynamically

# Check current scheduler
cat /sys/block/sda/queue/scheduler

# Change scheduler at runtime
echo "deadline" | sudo tee /sys/block/sda/queue/scheduler

# Verify change
cat /sys/block/sda/queue/scheduler

# Change for multiple devices
for device in /sys/block/sd*/queue/scheduler; do
  echo "mq-deadline" | sudo tee $device
done

# Change for NVMe
echo "noop" | sudo tee /sys/block/nvme0n1/queue/scheduler

Persistent Scheduler Configuration

# Method 1: Kernel command line parameter
sudo nano /etc/default/grub

# Add or modify GRUB_CMDLINE_LINUX:
GRUB_CMDLINE_LINUX="... elevator=mq-deadline"

# Update GRUB and reboot
sudo grub-mkconfig -o /boot/grub/grub.cfg
sudo reboot

# Method 2: udev rules (persistent without reboot)
cat > /etc/udev/rules.d/60-scheduler.rules <<'EOF'
# Set scheduler for SSD
ACTION=="add|change", KERNEL=="nvme*", ATTR{queue/scheduler}="noop"
# Set scheduler for HDD
ACTION=="add|change", KERNEL=="sd*", ATTR{queue/scheduler}="mq-deadline"
# Set scheduler for virtual devices
ACTION=="add|change", KERNEL=="vd*", ATTR{queue/scheduler}="mq-deadline"
EOF

# Reload udev rules
sudo udevadm control --reload
sudo udevadm trigger

Tunable Parameters

deadline Scheduler Tuning

# View deadline parameters
ls -la /sys/block/sda/queue/iosched/

# Read expiration (default 500ms)
# After this time, read requests get priority
cat /sys/block/sda/queue/iosched/read_expire
echo 250 | sudo tee /sys/block/sda/queue/iosched/read_expire

# Write expiration (default 5000ms, 10x read)
cat /sys/block/sda/queue/iosched/write_expire
echo 2500 | sudo tee /sys/block/sda/queue/iosched/write_expire

# Write starve priority
cat /sys/block/sda/queue/iosched/writes_starved
echo 2 | sudo tee /sys/block/sda/queue/iosched/writes_starved  # Lower = writes get priority sooner

CFQ Scheduler Tuning

# View CFQ parameters
ls -la /sys/block/sda/queue/iosched/

# Time quantum (default 64ms per process)
cat /sys/block/sda/queue/iosched/time_quantum
echo 32 | sudo tee /sys/block/sda/queue/iosched/time_quantum

# Fifo rates (default 2 async per sync)
cat /sys/block/sda/queue/iosched/fifo_expire_async

# Slice idle (seek optimization, default 8ms)
cat /sys/block/sda/queue/iosched/slice_idle
echo 0 | sudo tee /sys/block/sda/queue/iosched/slice_idle  # Disable for batch workloads

BFQ Scheduler Tuning

# View BFQ parameters
ls /sys/block/sda/queue/iosched/

# Max budget per request group
cat /sys/block/sda/queue/iosched/max_budget

# Time quantum per group
cat /sys/block/sda/queue/iosched/time_quantum

# Optimize for latency
echo 1 | sudo tee /sys/block/sda/queue/iosched/low_latency

SSD vs HDD Tuning

SSD Optimization

# SSDs benefit from minimal scheduling
# Selector: mq-deadline or noop
echo "mq-deadline" | sudo tee /sys/block/nvme0n1/queue/scheduler

# Disable rotational media settings
cat /sys/block/nvme0n1/queue/rotational
# Should be 0 for SSDs (usually automatic)

# Enable NCQ (Native Command Queuing)
cat /sys/block/sda/device/queue_depth
# Increase if possible for your device

# Disable I/O merging (SSDs handle this efficiently)
echo 2 | sudo tee /sys/block/nvme0n1/queue/nomerges

HDD Optimization

# HDDs benefit from more aggressive scheduling
echo "mq-deadline" | sudo tee /sys/block/sda/queue/scheduler

# Verify rotational media detection
cat /sys/block/sda/queue/rotational
# Should be 1 for HDDs

# Increase read anticipation
echo 500 | sudo tee /sys/block/sda/queue/iosched/read_expire

# Tuned for sequential access patterns
cat /sys/block/sda/queue/iosched/slice_async_rq

Performance Benchmarking

Scheduler Performance Comparison

# Prepare test file
dd if=/dev/zero of=/tmp/test.img bs=1M count=10000

# Benchmark with different schedulers
for scheduler in noop deadline mq-deadline bfq; do
  echo "=== Testing $scheduler ==="
  echo $scheduler | sudo tee /sys/block/sda/queue/scheduler
  
  # Warm cache
  cat /tmp/test.img > /dev/null
  
  # Sequential read
  time dd if=/tmp/test.img of=/dev/null bs=4M
  
  # Random read
  fio --filename=/tmp/test.img --rw=randread --bs=4k \
    --iodepth=32 --runtime=30 --name=test
done

Database Workload Testing

# pgbench benchmark with different schedulers
for scheduler in deadline mq-deadline bfq; do
  echo "=== Testing $scheduler ==="
  echo $scheduler | sudo tee /sys/block/sdb/queue/scheduler
  
  # Initialize database
  pgbench -i -s 100 test_db
  
  # Run benchmark
  pgbench -c 20 -j 4 -T 60 test_db | grep "tps ="
done

Monitoring I/O

I/O Statistics

# Monitor device I/O
iostat -x 1

# Fields to watch:
# - r/s: Read requests per second
# - w/s: Write requests per second
# - rrqm/s: Read requests merged
# - wrqm/s: Write requests merged
# - svctm: Service time
# - %util: Device utilization

# Per-process I/O activity
pidstat -d 1

# System-wide I/O summary
iotop -b -n 1

# Block device event tracing
blktrace /dev/sda
# Parse results
blkparse /dev/sda > trace.txt

Scheduler Metrics

# Check scheduler queue length
watch -n 1 'cat /sys/block/sda/queue/iosched/pending'

# Monitor request dispatching
cat /sys/block/sda/queue/iosched/dispatched

# Check merge activity
cat /sys/block/sda/queue/iosched/merged

# For deadline scheduler
watch -n 1 'cat /sys/block/sda/queue/iosched/*'

# Understand performance impact
cat /proc/diskstats | grep sda

Conclusion

I/O scheduler selection and tuning directly impacts storage performance, affecting both throughput and latency. Modern systems with NVMe and high-performance SSDs often benefit from minimal scheduling overhead, while traditional spinning disks require more sophisticated request optimization. By understanding scheduler characteristics, tuning parameters, and workload-specific requirements, infrastructure teams optimize storage performance without complex code changes. Regular benchmarking validates scheduler choices and tuning effectiveness, ensuring storage infrastructure delivers optimal performance for diverse workloads.