NUMA Aware Application Deployment

Non-Uniform Memory Access (NUMA) architectures dominate modern high-performance systems, where memory latency varies significantly based on access location. Applications unaware of NUMA topology suffer performance degradation through remote memory access penalties. This guide covers understanding NUMA architecture, detecting topology, and deploying applications with NUMA awareness for optimal performance.

NUMA Architecture Overview

Understanding NUMA

Modern multi-socket servers employ NUMA where:

Each socket has local memory and CPUs
CPUs access local memory faster than remote
Memory access latency 2-3x higher for remote access
Applications must account for topology

Failure to optimize causes:

Reduced memory bandwidth
Increased latency variance
Poor cache efficiency
Wasted CPU performance

Hardware Topology Detection

Detecting NUMA Configuration

# List NUMA nodes
numactl --hardware

# Output example:
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 4 5 6 7
# node 0 size: 32768 MB
# node 0 free: 28900 MB
# node 1 cpus: 8 9 10 11 12 13 14 15
# node 1 size: 32768 MB
# node 1 free: 29100 MB

# Detailed topology
lscpu --all

# NUMA node assignment
cat /proc/cpuinfo | grep -E "processor|physical id|core id"

# Memory information per node
cat /proc/meminfo | grep -i numa

Visualizing Topology

# hwloc tools provide detailed topology
sudo apt-get install -y hwloc

# Visual topology display
lstopo

# Text summary
lstopo-no-graphics

# More details
hwloc-info

# Export topology
lstopo topology.xml
hwloc-info --input topology.xml

NUMA-aware CPU Pinning

Process CPU Affinity

# Check current CPU assignment
taskset -pc $$

# Example: Pin single process to NUMA node 0
numactl --cpunodebind=0 --membind=0 ./application

# Parameters:
# --cpunodebind: CPUs to use
# --membind: Memory nodes to use

# Pin to specific CPUs
numactl -C 0,1,2,3 ./application

# Pin to node 1
numactl --cpunodebind=1 --membind=1 ./application

Multi-threaded Application Pinning

# Set up pinning script for multi-threaded app
cat > run_numa_app.sh <<'EOF'
#!/bin/bash
APP=$1
NUM_THREADS=$(nproc)

# Determine NUMA nodes
NUMA_NODES=$(numactl --hardware | grep "^node" | awk '{print $2}' | tr -d ':' | sort -u)
NODE_COUNT=$(echo "$NUMA_NODES" | wc -l)

# Distribute threads across NUMA nodes
for thread in $(seq 0 $((NUM_THREADS-1))); do
  node=$((thread % NODE_COUNT))
  cpu=$(lscpu -J | jq -r ".cpuinfo[] | select(.\"node-id\"==$node) | .\"cpu-index\"" | head -1)
  taskset -cp $cpu $((thread + 1))
done

# Run application
numactl --cpunodebind=0,1 --membind=0,1 $APP
EOF

chmod +x run_numa_app.sh
./run_numa_app.sh ./myapp

Interleaved Thread Pinning

# Pin threads to alternate NUMA nodes for balance
cat > interleaved_pin.sh <<'EOF'
#!/bin/bash
APP=$1

# For 2-node system, interleave threads
# Thread 0,2,4,... -> Node 0
# Thread 1,3,5,... -> Node 1

export OMP_NUM_THREADS=16
export OMP_PLACES="{0,2,4,6,8,10,12,14},{1,3,5,7,9,11,13,15}"
export OMP_PROC_BIND=close

$APP
EOF

chmod +x interleaved_pin.sh
./interleaved_pin.sh

Memory Policy Configuration

numactl Memory Binding

# Allocate memory from specific node
numactl --membind=0 ./application

# Local allocation only (fail if insufficient local memory)
numactl --localalloc ./application

# Interleaved allocation across nodes
numactl --interleave=all ./application

# Preferred node with fallback
numactl --preferred=0 ./application

# Multiple memory nodes
numactl --membind=0,1 ./application

Memory Policy Verification

# Check process memory binding
numactl -s
numastat -p $$

# Memory placement per process
cat /proc/self/numa_maps

# Detailed memory statistics
numastat

# Output shows:
# Per-node memory usage
# Page migration activity
# Remote access patterns

Transparent Huge Pages and NUMA

# Check THP status
cat /sys/kernel/mm/transparent_hugepage/enabled

# Enable THP
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

# NUMA-specific THP settings
cat /sys/kernel/mm/transparent_hugepage/numa_stat

# Monitor THP allocation
grep thp_fault /proc/meminfo

Database NUMA Optimization

MySQL NUMA Configuration

# Configure MySQL for NUMA-aware operation
# In /etc/mysql/mysql.conf.d/mysqld.cnf:

cat <<'EOF' | sudo tee -a /etc/mysql/mysql.conf.d/mysqld.cnf

# NUMA Optimization
numactl --cpunodebind=0 --membind=0

# Buffer pool size (per node)
innodb_buffer_pool_size = 16G

# Per-thread memory usage
sort_buffer_size = 2M
read_rnd_buffer_size = 2M

# Thread allocation to NUMA
innodb_numa_interleave = ON
EOF

# Restart MySQL
sudo systemctl restart mysql

# Start MySQL with NUMA awareness
numactl --cpunodebind=0,1 --membind=0,1 mysqld_safe

PostgreSQL NUMA Configuration

# PostgreSQL automatic NUMA detection
# Configure in postgresql.conf:

cat <<'EOF' | sudo tee -a /etc/postgresql/*/main/postgresql.conf

# Work memory per operation
work_mem = 4MB

# Shared buffers (usually 25% of RAM)
shared_buffers = 8GB

# Effective cache size
effective_cache_size = 32GB
EOF

# Start PostgreSQL with NUMA awareness
sudo -u postgres numactl --localalloc /usr/lib/postgresql/*/bin/postgres \
  -D /var/lib/postgresql/*/main \
  -c config_file=/etc/postgresql/*/main/postgresql.conf

Monitoring NUMA Performance

NUMA Statistics

# Overall NUMA statistics
numastat

# Per-process NUMA stats
numastat -p $(pgrep mysql)

# Memory allocation distribution
cat /proc/self/numa_maps

# Monitor remote memory access
watch -n 1 'numastat'

# Track migration
while true; do
  clear
  cat /proc/*/numa_maps | grep 'bind:' | wc -l
  sleep 1
done

Performance Monitoring

# Monitor page migration
perf stat -e numa_hit,numa_miss ./application

# Detailed NUMA event counting
perf record -e cycles,instructions,dTLB-loads,dTLB-load-misses,LLC-loads,LLC-load-misses -g ./application
perf report

# Real-time NUMA performance
nstat -a | grep numa

# Check memory latency
# Using lmbench
lmbench -c 0 -S

Detecting NUMA Issues

# High remote memory access indicates poor NUMA awareness
numastat | grep remote

# Check for memory migration
cat /proc/vmstat | grep numa_pte_updates

# Monitor task migration between NUMA nodes
watch -n 1 'cat /proc/*/stat | awk "{print \$39}" | sort | uniq -c'

# Identify cross-node memory pressure
cat /proc/sys/kernel/numa_stat

Conclusion

NUMA-aware application deployment transforms performance on modern multi-socket systems. By aligning CPU and memory affinity with hardware topology, organizations eliminate memory access latency penalties and achieve optimal utilization. Understanding numactl tools, memory policies, and topology-specific tuning enables deployment teams to realize full potential of high-end hardware. Combined with monitoring tools, NUMA optimization becomes a measurable, repeatable practice yielding significant performance improvements for latency-sensitive applications.

NUMA aware application deployment

En esta página