NUMA Aware Application Deployment
Non-Uniform Memory Access (NUMA) architectures dominate modern high-performance systems, where memory latency varies significantly based on access location. Applications unaware of NUMA topology suffer performance degradation through remote memory access penalties. This guide covers understanding NUMA architecture, detecting topology, and deploying applications with NUMA awareness for optimal performance.
Table of Contents
- NUMA Architecture Overview
- Hardware Topology Detection
- NUMA-aware CPU Pinning
- Memory Policy Configuration
- Database NUMA Optimization
- Monitoring NUMA Performance
- Conclusion
NUMA Architecture Overview
Understanding NUMA
Modern multi-socket servers employ NUMA where:
- Each socket has local memory and CPUs
- CPUs access local memory faster than remote
- Memory access latency 2-3x higher for remote access
- Applications must account for topology
Failure to optimize causes:
- Reduced memory bandwidth
- Increased latency variance
- Poor cache efficiency
- Wasted CPU performance
Hardware Topology Detection
Detecting NUMA Configuration
# List NUMA nodes
numactl --hardware
# Output example:
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 4 5 6 7
# node 0 size: 32768 MB
# node 0 free: 28900 MB
# node 1 cpus: 8 9 10 11 12 13 14 15
# node 1 size: 32768 MB
# node 1 free: 29100 MB
# Detailed topology
lscpu --all
# NUMA node assignment
cat /proc/cpuinfo | grep -E "processor|physical id|core id"
# Memory information per node
cat /proc/meminfo | grep -i numa
Visualizing Topology
# hwloc tools provide detailed topology
sudo apt-get install -y hwloc
# Visual topology display
lstopo
# Text summary
lstopo-no-graphics
# More details
hwloc-info
# Export topology
lstopo topology.xml
hwloc-info --input topology.xml
NUMA-aware CPU Pinning
Process CPU Affinity
# Check current CPU assignment
taskset -pc $$
# Example: Pin single process to NUMA node 0
numactl --cpunodebind=0 --membind=0 ./application
# Parameters:
# --cpunodebind: CPUs to use
# --membind: Memory nodes to use
# Pin to specific CPUs
numactl -C 0,1,2,3 ./application
# Pin to node 1
numactl --cpunodebind=1 --membind=1 ./application
Multi-threaded Application Pinning
# Set up pinning script for multi-threaded app
cat > run_numa_app.sh <<'EOF'
#!/bin/bash
APP=$1
NUM_THREADS=$(nproc)
# Determine NUMA nodes
NUMA_NODES=$(numactl --hardware | grep "^node" | awk '{print $2}' | tr -d ':' | sort -u)
NODE_COUNT=$(echo "$NUMA_NODES" | wc -l)
# Distribute threads across NUMA nodes
for thread in $(seq 0 $((NUM_THREADS-1))); do
node=$((thread % NODE_COUNT))
cpu=$(lscpu -J | jq -r ".cpuinfo[] | select(.\"node-id\"==$node) | .\"cpu-index\"" | head -1)
taskset -cp $cpu $((thread + 1))
done
# Run application
numactl --cpunodebind=0,1 --membind=0,1 $APP
EOF
chmod +x run_numa_app.sh
./run_numa_app.sh ./myapp
Interleaved Thread Pinning
# Pin threads to alternate NUMA nodes for balance
cat > interleaved_pin.sh <<'EOF'
#!/bin/bash
APP=$1
# For 2-node system, interleave threads
# Thread 0,2,4,... -> Node 0
# Thread 1,3,5,... -> Node 1
export OMP_NUM_THREADS=16
export OMP_PLACES="{0,2,4,6,8,10,12,14},{1,3,5,7,9,11,13,15}"
export OMP_PROC_BIND=close
$APP
EOF
chmod +x interleaved_pin.sh
./interleaved_pin.sh
Memory Policy Configuration
numactl Memory Binding
# Allocate memory from specific node
numactl --membind=0 ./application
# Local allocation only (fail if insufficient local memory)
numactl --localalloc ./application
# Interleaved allocation across nodes
numactl --interleave=all ./application
# Preferred node with fallback
numactl --preferred=0 ./application
# Multiple memory nodes
numactl --membind=0,1 ./application
Memory Policy Verification
# Check process memory binding
numactl -s
numastat -p $$
# Memory placement per process
cat /proc/self/numa_maps
# Detailed memory statistics
numastat
# Output shows:
# Per-node memory usage
# Page migration activity
# Remote access patterns
Transparent Huge Pages and NUMA
# Check THP status
cat /sys/kernel/mm/transparent_hugepage/enabled
# Enable THP
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
# NUMA-specific THP settings
cat /sys/kernel/mm/transparent_hugepage/numa_stat
# Monitor THP allocation
grep thp_fault /proc/meminfo
Database NUMA Optimization
MySQL NUMA Configuration
# Configure MySQL for NUMA-aware operation
# In /etc/mysql/mysql.conf.d/mysqld.cnf:
cat <<'EOF' | sudo tee -a /etc/mysql/mysql.conf.d/mysqld.cnf
# NUMA Optimization
numactl --cpunodebind=0 --membind=0
# Buffer pool size (per node)
innodb_buffer_pool_size = 16G
# Per-thread memory usage
sort_buffer_size = 2M
read_rnd_buffer_size = 2M
# Thread allocation to NUMA
innodb_numa_interleave = ON
EOF
# Restart MySQL
sudo systemctl restart mysql
# Start MySQL with NUMA awareness
numactl --cpunodebind=0,1 --membind=0,1 mysqld_safe
PostgreSQL NUMA Configuration
# PostgreSQL automatic NUMA detection
# Configure in postgresql.conf:
cat <<'EOF' | sudo tee -a /etc/postgresql/*/main/postgresql.conf
# Work memory per operation
work_mem = 4MB
# Shared buffers (usually 25% of RAM)
shared_buffers = 8GB
# Effective cache size
effective_cache_size = 32GB
EOF
# Start PostgreSQL with NUMA awareness
sudo -u postgres numactl --localalloc /usr/lib/postgresql/*/bin/postgres \
-D /var/lib/postgresql/*/main \
-c config_file=/etc/postgresql/*/main/postgresql.conf
Monitoring NUMA Performance
NUMA Statistics
# Overall NUMA statistics
numastat
# Per-process NUMA stats
numastat -p $(pgrep mysql)
# Memory allocation distribution
cat /proc/self/numa_maps
# Monitor remote memory access
watch -n 1 'numastat'
# Track migration
while true; do
clear
cat /proc/*/numa_maps | grep 'bind:' | wc -l
sleep 1
done
Performance Monitoring
# Monitor page migration
perf stat -e numa_hit,numa_miss ./application
# Detailed NUMA event counting
perf record -e cycles,instructions,dTLB-loads,dTLB-load-misses,LLC-loads,LLC-load-misses -g ./application
perf report
# Real-time NUMA performance
nstat -a | grep numa
# Check memory latency
# Using lmbench
lmbench -c 0 -S
Detecting NUMA Issues
# High remote memory access indicates poor NUMA awareness
numastat | grep remote
# Check for memory migration
cat /proc/vmstat | grep numa_pte_updates
# Monitor task migration between NUMA nodes
watch -n 1 'cat /proc/*/stat | awk "{print \$39}" | sort | uniq -c'
# Identify cross-node memory pressure
cat /proc/sys/kernel/numa_stat
Conclusion
NUMA-aware application deployment transforms performance on modern multi-socket systems. By aligning CPU and memory affinity with hardware topology, organizations eliminate memory access latency penalties and achieve optimal utilization. Understanding numactl tools, memory policies, and topology-specific tuning enables deployment teams to realize full potential of high-end hardware. Combined with monitoring tools, NUMA optimization becomes a measurable, repeatable practice yielding significant performance improvements for latency-sensitive applications.


