CPU Pinning and NUMA Configuration: Complete Guide

Introduction

CPU pinning and NUMA (Non-Uniform Memory Access) configuration are advanced virtualization techniques that dramatically improve virtual machine performance by optimizing CPU and memory locality. These techniques are essential for high-performance workloads, real-time applications, and environments where predictable, consistent performance is critical.

This comprehensive guide explores CPU pinning, NUMA topology configuration, and performance optimization strategies for KVM/QEMU virtual machines. Whether you're running latency-sensitive applications, high-throughput databases, or compute-intensive workloads, understanding these concepts will help you extract maximum performance from your virtualization infrastructure.

Without proper CPU and NUMA configuration, virtual machines may experience performance variability, cache thrashing, and memory access penalties that can reduce performance by 50% or more. Modern multi-socket systems with NUMA architecture require careful configuration to avoid cross-socket memory access and ensure optimal resource locality.

By the end of this guide, you'll master CPU pinning strategies, NUMA topology configuration, and performance tuning techniques that enable your VMs to achieve near-native performance for demanding production workloads.

Understanding CPU Architecture and NUMA

CPU Topology Fundamentals

Modern servers use hierarchical CPU architectures:

Server
  └─ Sockets (Physical CPUs)
      └─ Cores (Physical cores per socket)
          └─ Threads (Logical CPUs via Hyperthreading/SMT)

Example topology:

  • 2 sockets (NUMA nodes)
  • 8 cores per socket
  • 2 threads per core (Hyperthreading)
  • Total: 32 logical CPUs (threads)

Check Host CPU Topology

# View CPU topology
lscpu

# Output includes:
# CPU(s):              32
# Thread(s) per core:  2
# Core(s) per socket:  8
# Socket(s):           2
# NUMA node(s):        2
# NUMA node0 CPU(s):   0-15
# NUMA node1 CPU(s):   16-31

# Detailed topology
lscpu -e

# CPU NODE SOCKET CORE L1d:L1i:L2:L3
#   0    0      0    0 0:0:0:0
#   1    0      0    0 0:0:0:0
#   2    0      0    1 1:1:1:0
# ...

# Visual topology
lstopo
# Or for terminal
lstopo-no-graphics

Understanding NUMA

NUMA (Non-Uniform Memory Access) means memory access time depends on memory location relative to the processor.

┌─────────────────────────────────────────┐
│         NUMA Node 0                     │
│  ┌──────────────┐    ┌──────────────┐  │
│  │  CPU 0-15    │───▶│   Memory     │  │
│  │  (Socket 0)  │    │   64GB       │  │
│  └──────────────┘    └──────────────┘  │
└────────────┬────────────────────────────┘
             │
             │ Interconnect (slower)
             │
┌────────────▼────────────────────────────┐
│         NUMA Node 1                     │
│  ┌──────────────┐    ┌──────────────┐  │
│  │  CPU 16-31   │───▶│   Memory     │  │
│  │  (Socket 1)  │    │   64GB       │  │
│  └──────────────┘    └──────────────┘  │
└─────────────────────────────────────────┘

Local vs Remote Memory Access:

  • Local: CPU accessing memory on same NUMA node (fast)
  • Remote: CPU accessing memory on different NUMA node (slower, ~2x latency)

Check NUMA Configuration

# Install numactl
apt install numactl  # Debian/Ubuntu
dnf install numactl  # RHEL/CentOS

# View NUMA topology
numactl --hardware

# Output:
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# node 0 size: 65536 MB
# node 0 free: 32768 MB
# node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
# node 1 size: 65536 MB
# node 1 free: 28672 MB
# node distances:
# node   0   1
#   0:  10  20
#   1:  20  10

# Distance interpretation:
# 10 = local (same node)
# 20 = remote (different node)
# Higher numbers = more hops/slower

# Check NUMA statistics
numastat

# Per-node memory usage
numastat -m

# NUMA hits and misses
numastat -c qemu-system-x86

# Watch NUMA statistics
watch -n 1 'numastat -c qemu-system-x86'

CPU Pinning Concepts

Why Pin CPUs?

Without CPU pinning:

  • VMs can run on any physical CPU
  • Linux scheduler migrates VM threads
  • Cache thrashing occurs
  • Inconsistent performance
  • Poor CPU cache utilization

With CPU pinning:

  • VM bound to specific physical CPUs
  • No unwanted migrations
  • Better cache locality
  • Predictable performance
  • Reduced context switching

CPU Pinning Strategies

1. Dedicated CPU Assignment:

# Each vCPU pinned to dedicated physical CPU
# VM with 4 vCPUs → Pin to CPUs 0,1,2,3
# Best performance, exclusive resources

2. Shared CPU Assignment:

# Multiple VMs share same physical CPUs
# Useful for non-critical workloads
# Higher density, lower cost

3. NUMA-Aware Pinning:

# Pin VMs entirely within single NUMA node
# Avoid cross-node memory access
# Critical for performance

4. Hyperthreading Considerations:

# Siblings share CPU core resources
# CPU 0 and CPU 16 might be siblings
# Check: cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list

Implementing CPU Pinning

Check VM CPU Configuration

# View current VM CPU info
virsh vcpuinfo ubuntu-vm

# Output:
# VCPU:           0
# CPU:            3
# State:          running
# CPU time:       25.2s
# CPU Affinity:   yyyyyyyyyyyyyyyy

# "yyyyyyyy..." means can run on any CPU

# View vCPU count
virsh vcpucount ubuntu-vm

Pin vCPUs to Physical CPUs

Basic CPU pinning:

# Pin vCPU 0 to physical CPU 0
virsh vcpupin ubuntu-vm 0 0

# Pin vCPU 1 to physical CPU 1
virsh vcpupin ubuntu-vm 1 1

# Pin vCPU 2 to physical CPU 2
virsh vcpupin ubuntu-vm 2 2

# Pin vCPU 3 to physical CPU 3
virsh vcpupin ubuntu-vm 3 3

# Verify pinning
virsh vcpupin ubuntu-vm

# Output:
# VCPU   CPU Affinity
# -----------------------
# 0      0
# 1      1
# 2      2
# 3      3

Pin to CPU range:

# Allow vCPU to run on range of CPUs
virsh vcpupin ubuntu-vm 0 0-3

# Pin to specific CPUs (non-contiguous)
virsh vcpupin ubuntu-vm 0 0,2,4,6

# Pin all vCPUs at once
virsh vcpupin ubuntu-vm --vcpu 0 --cpulist 0
virsh vcpupin ubuntu-vm --vcpu 1 --cpulist 1
virsh vcpupin ubuntu-vm --vcpu 2 --cpulist 2
virsh vcpupin ubuntu-vm --vcpu 3 --cpulist 3

Make CPU Pinning Persistent

CPU pinning with virsh is live but not persistent. To make permanent:

# Edit VM XML
virsh edit ubuntu-vm

# Add CPU pinning in <cputune> section:
<domain>
  <vcpu placement='static'>4</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='0'/>
    <vcpupin vcpu='1' cpuset='1'/>
    <vcpupin vcpu='2' cpuset='2'/>
    <vcpupin vcpu='3' cpuset='3'/>
  </cputune>
  ...
</domain>

# Save and exit
# Restart VM for changes to take effect
virsh shutdown ubuntu-vm
virsh start ubuntu-vm

# Verify
virsh vcpupin ubuntu-vm

Avoiding Hyperthreading Siblings

# Find hyperthreading siblings
for cpu in /sys/devices/system/cpu/cpu[0-9]*; do
    echo -n "CPU $(basename $cpu | sed 's/cpu//'): "
    cat $cpu/topology/thread_siblings_list
done

# Example output:
# CPU 0: 0,16
# CPU 1: 1,17
# CPU 2: 2,18
# ...

# Pin to physical cores only (avoid siblings)
# If you have 16 physical cores with HT:
# Use CPUs 0-15 OR 16-31, not mixed

virsh edit ubuntu-vm

<cputune>
  <vcpupin vcpu='0' cpuset='0'/>
  <vcpupin vcpu='1' cpuset='1'/>
  <vcpupin vcpu='2' cpuset='2'/>
  <vcpupin vcpu='3' cpuset='3'/>
</cputune>

# This avoids siblings if using lower half

Emulator Thread Pinning

QEMU emulator threads handle I/O and device emulation. Pin them separately:

virsh edit ubuntu-vm

<cputune>
  <!-- vCPU pinning -->
  <vcpupin vcpu='0' cpuset='0'/>
  <vcpupin vcpu='1' cpuset='1'/>
  <vcpupin vcpu='2' cpuset='2'/>
  <vcpupin vcpu='3' cpuset='3'/>

  <!-- Emulator thread pinning -->
  <emulatorpin cpuset='4,5'/>

  <!-- I/O thread pinning -->
  <iothreadids>
    <iothread id='1'/>
    <iothread id='2'/>
  </iothreadids>
  <iothreadpin iothread='1' cpuset='6'/>
  <iothreadpin iothread='2' cpuset='7'/>
</cputune>

Verify emulator pinning:

# Find QEMU process
ps aux | grep qemu | grep ubuntu-vm

# Check thread affinity
ps -mo pid,tid,comm,psr -p <qemu-pid>

# PSR column shows which CPU thread is running on

NUMA Configuration

Single NUMA Node Assignment

Restrict VM to single NUMA node (recommended):

virsh edit ubuntu-vm

<domain>
  <vcpu placement='static'>8</vcpu>
  <numatune>
    <memory mode='strict' nodeset='0'/>
  </numatune>
  <cpu mode='host-passthrough'>
    <numa>
      <cell id='0' cpus='0-7' memory='16' unit='GiB' memAccess='shared'/>
    </numa>
  </cpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='0'/>
    <vcpupin vcpu='1' cpuset='1'/>
    <vcpupin vcpu='2' cpuset='2'/>
    <vcpupin vcpu='3' cpuset='3'/>
    <vcpupin vcpu='4' cpuset='4'/>
    <vcpupin vcpu='5' cpuset='5'/>
    <vcpupin vcpu='6' cpuset='6'/>
    <vcpupin vcpu='7' cpuset='7'/>
  </cputune>
</domain>

# This ensures:
# - All vCPUs from NUMA node 0
# - All memory from NUMA node 0
# - No cross-node access

Multi-Node NUMA Topology

Create guest NUMA topology matching host:

# For VM spanning multiple NUMA nodes
virsh edit ubuntu-vm

<domain>
  <vcpu placement='static'>16</vcpu>
  <cpu mode='host-passthrough'>
    <numa>
      <cell id='0' cpus='0-7' memory='16' unit='GiB' memAccess='shared'/>
      <cell id='1' cpus='8-15' memory='16' unit='GiB' memAccess='shared'/>
    </numa>
  </cpu>
  <numatune>
    <memory mode='strict' nodeset='0-1'/>
    <memnode cellid='0' mode='strict' nodeset='0'/>
    <memnode cellid='1' mode='strict' nodeset='1'/>
  </numatune>
  <cputune>
    <!-- Node 0 vCPUs -->
    <vcpupin vcpu='0' cpuset='0'/>
    <vcpupin vcpu='1' cpuset='1'/>
    <vcpupin vcpu='2' cpuset='2'/>
    <vcpupin vcpu='3' cpuset='3'/>
    <vcpupin vcpu='4' cpuset='4'/>
    <vcpupin vcpu='5' cpuset='5'/>
    <vcpupin vcpu='6' cpuset='6'/>
    <vcpupin vcpu='7' cpuset='7'/>
    <!-- Node 1 vCPUs -->
    <vcpupin vcpu='8' cpuset='16'/>
    <vcpupin vcpu='9' cpuset='17'/>
    <vcpupin vcpu='10' cpuset='18'/>
    <vcpupin vcpu='11' cpuset='19'/>
    <vcpupin vcpu='12' cpuset='20'/>
    <vcpupin vcpu='13' cpuset='21'/>
    <vcpupin vcpu='14' cpuset='22'/>
    <vcpupin vcpu='15' cpuset='23'/>
  </cputune>
</domain>

NUMA Memory Policies

# strict: Allocation must come from specified nodes (fail if unavailable)
<memory mode='strict' nodeset='0'/>

# preferred: Try specified nodes first, fall back if needed
<memory mode='preferred' nodeset='0'/>

# interleave: Distribute memory across nodes
<memory mode='interleave' nodeset='0-1'/>

# restrictive: Like strict but with more strict enforcement
<memory mode='restrictive' nodeset='0'/>

Automatic NUMA Placement

# Let libvirt automatically place VM on optimal NUMA node
virsh edit ubuntu-vm

<vcpu placement='auto'>4</vcpu>
<numatune>
  <memory mode='strict' placement='auto'/>
</numatune>

# libvirt will:
# - Analyze host NUMA topology
# - Find node with sufficient resources
# - Pin VM to that node automatically

# View auto-placement result
virsh vcpuinfo ubuntu-vm
virsh numatune ubuntu-vm

Advanced CPU Configuration

CPU Topology Definition

Define specific CPU topology inside guest:

virsh edit ubuntu-vm

<cpu mode='host-passthrough'>
  <topology sockets='2' cores='4' threads='2'/>
  <numa>
    <cell id='0' cpus='0-7' memory='16' unit='GiB'/>
    <cell id='1' cpus='8-15' memory='16' unit='GiB'/>
  </numa>
</cpu>

# This creates inside guest:
# - 2 sockets
# - 4 cores per socket
# - 2 threads per core
# - Total: 16 vCPUs
# - 2 NUMA nodes visible in guest

CPU Model Selection

# Host-passthrough (best performance, limits migration)
<cpu mode='host-passthrough'/>

# Host-model (good performance, better migration)
<cpu mode='host-model'/>

# Custom (most compatible, limits features)
<cpu mode='custom' match='exact'>
  <model>Broadwell</model>
</cpu>

# With specific features
<cpu mode='custom' match='exact'>
  <model>Broadwell</model>
  <feature policy='require' name='pdpe1gb'/>
  <feature policy='require' name='pcid'/>
  <feature policy='disable' name='x2apic'/>
</cpu>

CPU Cache Passthrough

# Pass through host CPU cache topology
virsh edit ubuntu-vm

<cpu mode='host-passthrough'>
  <cache mode='passthrough'/>
  <topology sockets='1' cores='4' threads='2'/>
</cpu>

# Improves performance for cache-sensitive workloads

CPU Scheduling

# Set CPU scheduler parameters
virsh schedinfo ubuntu-vm --set cpu_shares=2048

# CPU shares (relative weight)
# Default: 1024
# Higher value = more CPU time

# View current settings
virsh schedinfo ubuntu-vm

# Real-time scheduling (requires RT kernel)
virsh edit ubuntu-vm

<cputune>
  <vcpusched vcpus='0-3' scheduler='fifo' priority='1'/>
</cputune>

Performance Optimization

CPU Governor Configuration

# Set CPU governor to performance
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Verify
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

# Make persistent
cat > /etc/systemd/system/cpu-performance.service << 'EOF'
[Unit]
Description=Set CPU governor to performance

[Service]
Type=oneshot
ExecStart=/bin/bash -c 'echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor'

[Install]
WantedBy=multi-user.target
EOF

systemctl enable cpu-performance
systemctl start cpu-performance

Disable CPU Power Saving

# Disable C-states for consistent latency
# Add to kernel parameters
vim /etc/default/grub

GRUB_CMDLINE_LINUX="intel_idle.max_cstate=0 processor.max_cstate=1 intel_pstate=disable idle=poll"

# Update grub
update-grub  # Debian/Ubuntu
grub2-mkconfig -o /boot/grub2/grub.cfg  # RHEL/CentOS

# Reboot
reboot

# Verify
cat /sys/module/intel_idle/parameters/max_cstate

CPU Isolation

Isolate CPUs from host scheduler for VM use only:

# Edit kernel parameters
vim /etc/default/grub

# Isolate CPUs 0-7 for VMs
GRUB_CMDLINE_LINUX="isolcpus=0-7 nohz_full=0-7 rcu_nocbs=0-7"

# Update grub and reboot
update-grub
reboot

# Verify isolation
cat /sys/devices/system/cpu/isolated

# Now VMs pinned to CPUs 0-7 have minimal host interference

Huge Pages Configuration

# Calculate huge pages needed
# Example: 16GB VM = 8192 huge pages (2MB each)

# Configure huge pages
echo 8192 > /proc/sys/vm/nr_hugepages

# Make persistent
echo "vm.nr_hugepages=8192" >> /etc/sysctl.conf
sysctl -p

# Verify
cat /proc/meminfo | grep Huge

# Configure VM to use huge pages
virsh edit ubuntu-vm

<memoryBacking>
  <hugepages/>
  <locked/>
</memoryBacking>

# Restart VM
virsh shutdown ubuntu-vm
virsh start ubuntu-vm

# Verify VM is using huge pages
grep Huge /proc/meminfo

NUMA Balancing

# Disable automatic NUMA balancing (can interfere with pinning)
echo 0 > /proc/sys/kernel/numa_balancing

# Make persistent
echo "kernel.numa_balancing=0" >> /etc/sysctl.conf
sysctl -p

# Verify
cat /proc/sys/kernel/numa_balancing

Monitoring and Verification

Verify CPU Pinning

# Check vCPU affinity
virsh vcpupin ubuntu-vm

# Check actual CPU usage
virsh vcpuinfo ubuntu-vm

# Find QEMU process
ps aux | grep qemu | grep ubuntu-vm

# Check thread affinity
taskset -cp <qemu-pid>

# Detailed per-thread info
ps -mo pid,tid,comm,psr -p <qemu-pid>

# PSR = Processor number (which CPU)

Monitor NUMA Statistics

# VM NUMA statistics
virsh numastat ubuntu-vm

# System-wide NUMA stats
numastat

# Per-node memory usage
numastat -m

# Watch in real-time
watch -n 1 'numastat -c qemu-system-x86'

# Check for NUMA misses (bad - indicates cross-node access)
numastat | grep numa_miss

CPU Performance Monitoring

# Install perf
apt install linux-tools-generic  # Debian/Ubuntu
dnf install perf  # RHEL/CentOS

# Monitor VM CPU performance
perf stat -p <qemu-pid> -a sleep 10

# Check cache misses
perf stat -e cache-misses,cache-references -p <qemu-pid> sleep 10

# Monitor context switches
perf stat -e context-switches -p <qemu-pid> sleep 10

# Detailed analysis
perf top -p <qemu-pid>

Latency Testing

# Inside VM: Test latency with cyclictest
apt install rt-tests

# Run latency test
cyclictest -t4 -p80 -n -i1000 -l10000

# Output shows min/max/avg latency
# Lower max latency = better for real-time workloads

Troubleshooting

Issue: Poor Performance Despite Pinning

# Check if CPUs are actually isolated
cat /sys/devices/system/cpu/isolated

# Verify CPU governor
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Check for high context switches
pidstat -w -p <qemu-pid> 1 10

# Monitor CPU steal time
top -d 1
# Look for %st (steal time) - should be 0

# Solutions:
# 1. Isolate CPUs from host scheduler
# 2. Set performance governor
# 3. Disable power management
# 4. Use huge pages

Issue: High NUMA Misses

# Check NUMA statistics
numastat -c qemu-system-x86 | grep numa_miss

# Verify VM is on single node
virsh numatune ubuntu-vm

# Check if memory is split across nodes
numastat -p <qemu-pid>

# Solutions:
# 1. Pin VM to single NUMA node
virsh numatune ubuntu-vm --mode strict --nodeset 0

# 2. Edit VM XML for strict NUMA
virsh edit ubuntu-vm
<numatune>
  <memory mode='strict' nodeset='0'/>
</numatune>

# 3. Disable automatic NUMA balancing
echo 0 > /proc/sys/kernel/numa_balancing

Issue: CPU Pinning Not Working

# Verify pinning is set
virsh vcpupin ubuntu-vm

# Check if libvirt service is running
systemctl status libvirtd

# Verify XML configuration
virsh dumpxml ubuntu-vm | grep -A 10 cputune

# Check for conflicts
# - Ensure CPUs not oversubscribed
# - Verify no overlapping pinning with other VMs

# Re-apply pinning
virsh vcpupin ubuntu-vm 0 0 --live --config
virsh vcpupin ubuntu-vm 1 1 --live --config

# Restart VM
virsh shutdown ubuntu-vm
virsh start ubuntu-vm

Issue: Migration Fails with Pinning

# CPU pinning can prevent migration if destination lacks same CPUs

# Solution 1: Use cpuset ranges for flexibility
<vcpupin vcpu='0' cpuset='0-3'/>

# Solution 2: Remove pinning before migration
virsh vcpupin ubuntu-vm 0 0-31  # Allow all CPUs
virsh migrate --live ubuntu-vm qemu+ssh://dest/system
# Re-pin on destination

# Solution 3: Ensure identical CPU topology on both hosts

Real-World Configuration Examples

High-Performance Database Server

<domain type='kvm'>
  <name>db-server</name>
  <memory unit='GiB'>32</memory>
  <vcpu placement='static'>8</vcpu>
  <memoryBacking>
    <hugepages/>
    <locked/>
  </memoryBacking>
  <numatune>
    <memory mode='strict' nodeset='0'/>
  </numatune>
  <cpu mode='host-passthrough'>
    <cache mode='passthrough'/>
    <topology sockets='1' cores='8' threads='1'/>
    <numa>
      <cell id='0' cpus='0-7' memory='32' unit='GiB'/>
    </numa>
  </cpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='0'/>
    <vcpupin vcpu='1' cpuset='1'/>
    <vcpupin vcpu='2' cpuset='2'/>
    <vcpupin vcpu='3' cpuset='3'/>
    <vcpupin vcpu='4' cpuset='4'/>
    <vcpupin vcpu='5' cpuset='5'/>
    <vcpupin vcpu='6' cpuset='6'/>
    <vcpupin vcpu='7' cpuset='7'/>
    <emulatorpin cpuset='8-9'/>
  </cputune>
</domain>

Real-Time Application Server

<domain type='kvm'>
  <name>rt-app</name>
  <memory unit='GiB'>16</memory>
  <vcpu placement='static'>4</vcpu>
  <memoryBacking>
    <hugepages/>
    <locked/>
    <nosharepages/>
  </memoryBacking>
  <numatune>
    <memory mode='strict' nodeset='0'/>
  </numatune>
  <cpu mode='host-passthrough'>
    <cache mode='passthrough'/>
    <topology sockets='1' cores='4' threads='1'/>
  </cpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='0'/>
    <vcpupin vcpu='1' cpuset='1'/>
    <vcpupin vcpu='2' cpuset='2'/>
    <vcpupin vcpu='3' cpuset='3'/>
    <emulatorpin cpuset='4'/>
    <vcpusched vcpus='0-3' scheduler='fifo' priority='1'/>
  </cputune>
</domain>

Multi-NUMA Large VM

<domain type='kvm'>
  <name>large-vm</name>
  <memory unit='GiB'>128</memory>
  <vcpu placement='static'>32</vcpu>
  <memoryBacking>
    <hugepages/>
  </memoryBacking>
  <cpu mode='host-model'>
    <topology sockets='2' cores='8' threads='2'/>
    <numa>
      <cell id='0' cpus='0-15' memory='64' unit='GiB'/>
      <cell id='1' cpus='16-31' memory='64' unit='GiB'/>
    </numa>
  </cpu>
  <numatune>
    <memory mode='strict' nodeset='0-1'/>
    <memnode cellid='0' mode='strict' nodeset='0'/>
    <memnode cellid='1' mode='strict' nodeset='1'/>
  </numatune>
  <cputune>
    <!-- NUMA node 0 -->
    <vcpupin vcpu='0' cpuset='0'/>
    <vcpupin vcpu='1' cpuset='1'/>
    <!-- ... pins for vcpu 2-15 -->
    <vcpupin vcpu='15' cpuset='15'/>
    <!-- NUMA node 1 -->
    <vcpupin vcpu='16' cpuset='16'/>
    <vcpupin vcpu='17' cpuset='17'/>
    <!-- ... pins for vcpu 18-31 -->
    <vcpupin vcpu='31' cpuset='31'/>
  </cputune>
</domain>

Best Practices

Planning CPU Allocation

  1. Map host topology first
lscpu
numactl --hardware
lstopo
  1. Size VMs to fit NUMA nodes
  • Prefer VMs within single NUMA node
  • If multi-node needed, align with host topology
  1. Reserve CPUs for host
  • Don't pin all CPUs to VMs
  • Leave 1-2 CPUs for host processes
  1. Document pinning strategy
# Create allocation map
# Node 0: CPUs 0-15
#   - VM1: CPUs 0-3
#   - VM2: CPUs 4-7
#   - VM3: CPUs 8-11
#   - Host: CPUs 12-15

Performance Tuning Checklist

# 1. Enable huge pages
echo 8192 > /proc/sys/vm/nr_hugepages

# 2. Set performance governor
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# 3. Disable NUMA balancing
echo 0 > /proc/sys/kernel/numa_balancing

# 4. Isolate CPUs (optional)
# Add to kernel: isolcpus=0-15

# 5. Disable CPU power saving
# Add to kernel: intel_idle.max_cstate=0

# 6. Pin VMs to NUMA nodes
virsh edit vm # Add numatune

# 7. Use host-passthrough CPU mode
virsh edit vm # <cpu mode='host-passthrough'/>

# 8. Enable cache passthrough
virsh edit vm # <cache mode='passthrough'/>

Conclusion

CPU pinning and NUMA configuration are powerful techniques for optimizing virtual machine performance in KVM/QEMU environments. By ensuring proper CPU affinity, memory locality, and minimizing cross-NUMA-node access, you can achieve performance levels approaching bare-metal servers.

Key takeaways:

  • Always analyze host topology before configuring VMs
  • Prefer single NUMA node placement for best performance
  • Use CPU pinning for consistent, predictable performance
  • Enable huge pages for memory-intensive workloads
  • Monitor NUMA statistics to identify cross-node penalties
  • Isolate CPUs for latency-sensitive applications
  • Document CPU allocation strategies for maintenance

These optimizations are especially critical for:

  • High-performance databases
  • Real-time applications
  • HPC workloads
  • Latency-sensitive services
  • High-throughput applications

Proper CPU and NUMA configuration transforms virtualization from a performance compromise into a platform capable of supporting demanding production workloads with minimal overhead. Combined with other optimizations like SR-IOV networking and NVMe storage, modern virtualization can deliver performance that rivals or even exceeds traditional bare-metal deployments.