Bottleneck Identification with perf

Introduction

Linux perf (performance analyzing tool) is a powerful profiling framework that provides deep insight into application and system performance. Unlike strace which traces system calls, perf uses hardware counters and kernel instrumentation to identify CPU bottlenecks, memory access patterns, cache misses, and performance hotspots with minimal overhead. It's the gold standard for performance analysis on Linux systems.

perf can answer critical performance questions: Which functions consume the most CPU? Where are cache misses occurring? What causes context switches? Which code paths are hot? By sampling the execution stack at regular intervals, perf builds a statistical profile showing exactly where time is spent, enabling targeted optimization efforts.

This comprehensive guide covers perf installation, usage patterns, flame graph generation, bottleneck identification techniques, and real-world optimization examples. You'll learn how to systematically identify and eliminate performance bottlenecks in applications and systems.

Understanding perf

What perf Measures

CPU Performance:

  • Function execution time
  • Instruction counts
  • CPU cycles
  • Branch mispredictions

Memory Performance:

  • Cache misses (L1, L2, L3)
  • Memory loads/stores
  • TLB misses
  • Page faults

System Events:

  • Context switches
  • CPU migrations
  • System calls
  • Interrupts

How perf Works

  1. Sampling: Periodically interrupts CPU to record stack trace
  2. Hardware Counters: Uses CPU performance monitoring units (PMU)
  3. Kernel Events: Traces kernel-level activities
  4. Symbol Resolution: Maps addresses to function names
  5. Statistical Analysis: Aggregates samples to identify hotspots

Installation

# Ubuntu/Debian
apt-get install linux-tools-common linux-tools-generic linux-tools-$(uname -r) -y

# CentOS/Rocky Linux
dnf install perf -y

# Verify installation
perf --version

# Check available events
perf list

# Output shows hundreds of events:
# cpu-cycles OR cycles                               [Hardware event]
# instructions                                        [Hardware event]
# cache-references                                    [Hardware event]
# cache-misses                                        [Hardware event]
# branch-instructions OR branches                     [Hardware event]
# branch-misses                                       [Hardware event]
# ...

Basic Usage

CPU Profiling

# Profile application
perf record ./application

# Creates perf.data file
# View report
perf report

# Interactive TUI showing:
# Overhead  Command   Shared Object      Symbol
#  45.23%   app       app                [.] compute_results
#  18.67%   app       libc-2.31.so       [.] __memcpy_avx_unaligned
#  12.34%   app       app                [.] process_data
#   8.92%   app       app                [.] parse_input
#   ...

# Functions consuming most CPU time are at top

Real-Time Monitoring

# Top-like interface for system-wide profiling
perf top

# Shows:
# Overhead  Shared Object       Symbol
#   8.23%  [kernel]            [k] _raw_spin_lock
#   5.67%  libc-2.31.so        [.] __memcpy_avx_unaligned
#   4.12%  nginx               [.] ngx_http_process_request
#   3.45%  [kernel]            [k] copy_user_enhanced_fast_string
#   ...

# Updates in real-time
# Press 'h' for help, 'q' to quit

System-Wide Profiling

# Profile entire system for 30 seconds
perf record -a -g sleep 30

# -a: all CPUs
# -g: capture call graphs (stack traces)

# Generate report
perf report

# Shows all processes and their CPU usage

Advanced Profiling Techniques

Call Graph Profiling

# Record with call graphs
perf record -g ./application

# Or with more detail
perf record -g --call-graph dwarf ./application

# View hierarchical call graph
perf report -g

# Output shows call chains:
# Children  Self  Command  Shared Object  Symbol
#  45.23%  0.00%  app      app            [.] main
#     |
#     ---main
#        |
#        |--30.45%--compute_results
#        |    |
#        |    |--20.12%--heavy_calculation
#        |    |
#        |    --10.33%--data_processing
#        |
#        --14.78%--parse_input

# Identifies call paths consuming most time

Sampling Frequency

# Default: 1000 Hz (1000 samples per second)
perf record ./app

# High frequency (more accurate, more overhead)
perf record -F 4000 ./app  # 4000 Hz

# Low frequency (less overhead, less accurate)
perf record -F 100 ./app   # 100 Hz

# Adaptive sampling (let perf decide)
perf record -F max ./app

Specific Event Profiling

# Cache miss profiling
perf record -e cache-misses ./app
perf report

# Branch misprediction profiling
perf record -e branch-misses ./app

# Multiple events
perf record -e cpu-cycles,cache-misses,branch-misses ./app

# View specific event in report
perf report --stdio | head -50

Process-Specific Profiling

# Attach to running process
perf record -p <PID> -g sleep 30

# Profile specific command with arguments
perf record ./benchmark --iterations=1000

# Profile all threads of a process
perf record -p <PID> -g -a sleep 30

Performance Analysis Workflow

Step 1: Identify Hotspots

# Profile application
perf record -g ./application

# Generate basic report
perf report --stdio > perf-report.txt

# Extract top 20 functions
head -50 perf-report.txt

# Example output:
# Overhead  Command  Shared Object      Symbol
#  45.23%   app      app                [.] md5_transform
#  18.67%   app      libc-2.31.so       [.] memcpy
#  12.34%   app      app                [.] json_parse
#   8.92%   app      app                [.] string_concat
#   5.45%   app      libz.so.1          [.] inflate
#   ...

# Interpretation:
# md5_transform consuming 45% of CPU - optimization target!

Step 2: Analyze Call Chains

# View call graph for specific function
perf report -g 'graph,0.5,caller' --no-children

# Focus on specific symbol
perf report --symbol md5_transform

# Shows:
# Overhead  Command  Symbol
#  45.23%   app      [.] md5_transform
#     |
#     ---md5_transform
#        |
#        |--35.12%--compute_hash
#        |          |
#        |          ---process_file
#        |             |
#        |             ---main
#        |
#        --10.11%--verify_checksum
#                  |
#                  ---validate_data
#                     |
#                     ---main

# Shows md5_transform called from two paths
# compute_hash path is hottest (35.12%)

Step 3: Annotate Source Code

# Show assembly and source with performance data
perf annotate md5_transform

# Output shows instruction-level hotspots:
#       :      Disassembly of md5_transform:
#  0.00 :      push   %r12
#  0.00 :      push   %rbx
#  0.00 :      mov    %rsi,%r12
# 12.34 :      mov    0x10(%rsi),%eax    # <-- 12% of time here
# 18.92 :      mov    0x14(%rsi),%ebx    # <-- 19% of time here
#  8.45 :      rol    $0x7,%eax           # <-- 8% of time here
#  ...

# Identifies specific instructions causing bottlenecks

Real-World Case Studies

Case Study 1: Slow Image Processing

Problem: Image processing application slow, unclear why

# Profile application
perf record -g ./process-images input/*.jpg

# View report
perf report --stdio | head -30

# Results:
# Overhead  Symbol
#  67.23%   [.] resize_bilinear
#  12.45%   [.] convert_colorspace
#   8.92%   [.] save_jpeg
#   5.34%   [.] load_jpeg
#   ...

# Bottleneck identified: resize_bilinear consuming 67% CPU

# Analyze call graph
perf report -g --symbol resize_bilinear

# Shows:
# resize_bilinear
#   |
#   |--45.67%--interpolate_pixel    # Main bottleneck
#   |
#   --21.56%--calculate_weights

# Annotate to see hotspot code
perf annotate interpolate_pixel

# Shows tight loop with expensive divisions:
# 23.45%:  divsd  %xmm1,%xmm0    # Division in hot loop!
# 19.87%:  mulsd  %xmm2,%xmm0
# 15.23%:  addsd  %xmm3,%xmm0

# Root cause: Division operation in inner loop
# Solution: Precompute division results

# After optimization:
# Overhead  Symbol
#  28.12%   [.] resize_bilinear  # 58% reduction!
#  15.34%   [.] convert_colorspace
#  12.45%   [.] save_jpeg
#   ...

# Total speedup: 2.4x faster image processing

Case Study 2: High CPU Usage Mystery

Problem: Web server using 80% CPU on idle-looking system

# System-wide profiling
sudo perf record -a -g sleep 30

# View report
sudo perf report

# Results show:
# Overhead  Command   Symbol
#  32.45%   php-fpm   [.] php_json_decode
#  18.23%   php-fpm   [.] zend_hash_find
#  12.67%   nginx     [.] ngx_http_process_request
#   8.92%   php-fpm   [.] pcre_compile
#   ...

# json_decode consuming significant CPU despite "idle" system

# Examine call graph
sudo perf report -g --symbol php_json_decode

# Shows:
# php_json_decode
#   |
#   ---zif_json_decode
#      |
#      ---execute_ex
#         |
#         ---zend_execute
#            |
#            ---check_session.php:35

# Traced to check_session.php line 35

# Investigate source:
# while (true) {
#     $session_data = json_decode(file_get_contents('/tmp/session'));
#     usleep(1000);  # Check every 1ms!
# }

# Root cause: Polling loop decoding JSON 1000 times per second
# Solution: Use proper session handling, reduce polling frequency

# After fix:
# Overhead  Symbol
#   5.12%   [.] php_json_decode  # 84% reduction
#   4.23%   [.] ngx_http_process_request
#   ...

# CPU usage dropped to 12% (85% reduction)

Case Study 3: Memory Access Patterns

Problem: Application slower than expected despite optimized algorithms

# Profile cache misses
perf record -e cache-misses,cache-references -g ./app

# View cache statistics
perf report --stdio

# Shows:
# cache-misses: 234,567,890
# cache-references: 1,234,567,890
# Cache miss rate: 19% (very high!)

# Identify functions with most cache misses
perf report -e cache-misses

# Results:
# Overhead  Symbol
#  42.34%   [.] process_matrix
#  23.45%   [.] sort_array
#  15.67%   [.] search_tree
#   ...

# Analyze process_matrix
perf annotate process_matrix

# Shows column-major access on row-major data:
# for (int col = 0; col < COLS; col++) {
#     for (int row = 0; row < ROWS; row++) {
#         result += matrix[row][col];  # Cache-unfriendly!
#     }
# }

# Root cause: Poor cache locality due to access pattern
# Solution: Change to row-major access

# After optimization:
# cache-misses: 12,345,678 (95% reduction!)
# cache-references: 1,234,567,890
# Cache miss rate: 1% (excellent)

# Speedup: 4.2x faster execution

Flame Graphs

Generating Flame Graphs

# Install FlameGraph tools
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph

# Record perf data
perf record -F 99 -a -g -- sleep 30

# Generate flame graph
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > flamegraph.svg

# Open in web browser
firefox flamegraph.svg

# Flame graph visualization:
# - Width: proportion of time
# - Height: call stack depth
# - Color: random (aid visibility)
# - Wide "plateaus": performance bottlenecks

Interpreting Flame Graphs

                     [main]                          100%
         ____________/    \__________
        /                            \
  [process_data]                [handle_requests]    60%     40%
    /         \                     /      \
[parse] [compute]             [read_db] [respond]    30% 30% 25% 15%
   |       |                     |         |
 [json] [heavy_calc]        [mysql]    [format]     30% 30% 25% 15%

# Interpretation:
# - heavy_calc is 30% wide = 30% of CPU time
# - mysql operations are 25% = database bottleneck
# - json parsing is 30% = parsing bottleneck

# Optimization targets:
# 1. heavy_calc (30%) - largest single function
# 2. mysql (25%) - database optimization potential
# 3. json (30%) - parsing optimization

Differential Flame Graphs

# Compare before and after optimization

# Before optimization
perf record -F 99 -a -g -o perf-before.data ./app
perf script -i perf-before.data | ./stackcollapse-perf.pl > before.folded

# After optimization
perf record -F 99 -a -g -o perf-after.data ./app
perf script -i perf-after.data | ./stackcollapse-perf.pl > after.folded

# Generate differential flame graph
./difffolded.pl before.folded after.folded | ./flamegraph.pl > diff.svg

# Visualization:
# - Red: increased CPU time (regressions)
# - Blue: decreased CPU time (improvements)
# - Gray: unchanged

# Shows impact of optimizations visually

Specialized Profiling

Off-CPU Analysis

# Profile scheduler events (why processes block)
perf record -e sched:sched_switch -e sched:sched_stat_sleep -a -g sleep 30

# Generate off-CPU flame graph
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl --color=io > off-cpu.svg

# Shows:
# - I/O waits
# - Lock contention
# - Network delays
# - Sleep/yield calls

# Complements CPU profiling (CPU = on-CPU + off-CPU)

Memory Profiling

# Profile page faults
perf record -e page-faults -g ./app

# Profile memory loads/stores
perf record -e cpu/mem-loads/ -e cpu/mem-stores/ -g ./app

# Analyze memory access patterns
perf mem record ./app
perf mem report

# Shows:
# - Load/store distribution
# - NUMA node access patterns
# - TLB miss rates

Context Switch Analysis

# Profile context switches
perf record -e context-switches -g -a sleep 30

# View report
perf report

# High context switches indicate:
# - Lock contention
# - Too many threads
# - I/O blocking
# - Improper sleep/wake patterns

# Example output:
# Overhead  Command  Symbol
#  23.45%   app      [k] schedule
#  18.92%   app      [k] futex_wait_queue_me
#  12.34%   app      [.] pthread_mutex_lock
#   ...

# Shows mutex contention causing context switches

Performance Counters

Hardware Counter Analysis

# List available hardware counters
perf list hardware

# Profile multiple counters
perf stat -e cycles,instructions,cache-misses,branch-misses ./app

# Output:
Performance counter stats for './app':

    2,345,678,901      cycles
    3,456,789,012      instructions    #  1.47  insn per cycle
       45,678,901      cache-misses    #  1.95% of all cache refs
        8,901,234      branch-misses   #  0.73% of all branches

      2.345678 seconds time elapsed

# Interpretation:
# IPC (insn per cycle): 1.47 is good (> 1.0)
# Cache miss rate: 1.95% is excellent (< 5%)
# Branch miss rate: 0.73% is good (< 2%)

Custom Event Groups

# Define event group
perf stat -e '{cycles,instructions,cache-misses}' ./app

# Profile with detailed breakdown
perf stat -d ./app

# Output shows:
# - L1-dcache-loads
# - L1-dcache-load-misses
# - LLC-loads
# - LLC-load-misses
# - dTLB-loads
# - dTLB-load-misses

# Comprehensive cache hierarchy analysis

Optimization Workflow

1. Baseline Measurement

# Establish baseline performance
perf stat -r 10 ./app

# -r 10: run 10 times, report average
# Gives consistent baseline before optimization

# Results:
Performance counter stats for './app' (10 runs):

      2,345.67 msec task-clock       #  0.998 CPUs utilized  ( +-  0.23% )
    2,345,678,901 cycles             #  1.000 GHz            ( +-  0.18% )
    3,456,789,012 instructions       #  1.47  insn per cycle ( +-  0.12% )

      2.3512 +- 0.0054 seconds time elapsed  ( +-  0.23% )

# Save for comparison after optimization

2. Identify Hotspots

# Profile with call graphs
perf record -g -F 999 ./app

# Generate annotated report
perf report -g --stdio > hotspots.txt

# Extract top functions
grep "^#.*Overhead" -A 30 hotspots.txt > top-functions.txt

# Prioritize based on:
# 1. High overhead (> 5%)
# 2. Accessible code (can modify)
# 3. Optimization potential (algorithm vs library)

3. Optimize and Validate

# After making optimization
perf stat -r 10 ./app > after-optimization.txt

# Compare with baseline
echo "=== Before ==="
cat baseline.txt | grep "seconds time elapsed"

echo "=== After ==="
cat after-optimization.txt | grep "seconds time elapsed"

# Calculate improvement
# Before: 2.3512 seconds
# After: 1.1234 seconds
# Improvement: 52% faster (2.09x speedup)

4. Iterative Optimization

#!/bin/bash
# optimize-iteratively.sh

APP="./app"
ITERATIONS=10

echo "Iteration,Time(s),Cycles,Instructions,IPC" > optimization-log.csv

for i in {1..10}; do
    echo "Optimization iteration $i"

    # Measure performance
    STATS=$(perf stat -r $ITERATIONS $APP 2>&1)
    TIME=$(echo "$STATS" | grep "seconds time elapsed" | awk '{print $1}')
    CYCLES=$(echo "$STATS" | grep "cycles" | awk '{print $1}' | tr -d ',')
    INSNS=$(echo "$STATS" | grep "instructions" | awk '{print $1}' | tr -d ',')
    IPC=$(echo "scale=2; $INSNS / $CYCLES" | bc)

    echo "$i,$TIME,$CYCLES,$INSNS,$IPC" >> optimization-log.csv

    # Identify next hotspot
    perf record -g $APP
    HOTSPOT=$(perf report --stdio | grep "^#.*%" | head -1 | awk '{print $5}')

    echo "Top hotspot: $HOTSPOT"
    echo "Press Enter to optimize and continue..."
    read

    # User optimizes code, then continues loop
done

echo "Optimization complete. Review optimization-log.csv for progress."

Best Practices

1. Reduce Noise

# Bad: Profile with other processes running
perf record ./app  # System under load

# Good: Minimize background activity
# Stop unnecessary services
systemctl stop unnecessary-service

# Set CPU governor to performance
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
    echo performance > $cpu
done

# Disable frequency scaling effects
sudo perf record ./app

2. Adequate Sample Count

# Bad: Short profiling session
perf record -F 99 sleep 1  # Too few samples

# Good: Longer profiling for statistical significance
perf record -F 99 sleep 30  # 30 seconds = ~3000 samples

# Better: Profile realistic workload duration
perf record -F 99 ./run-benchmark  # Profile complete benchmark

3. Symbol Resolution

# Ensure debug symbols available
# Install debug packages
apt-get install libc6-dbg

# Compile with debug symbols
gcc -g -O2 app.c -o app

# Verify symbols present
nm app | grep function_name

# Without symbols, perf shows addresses:
# 0x00007f8b4c2d1234
# With symbols:
# compute_results

4. Focus Analysis

# Don't profile everything
# Focus on specific aspects:

# CPU bottlenecks:
perf record -e cycles -g

# Memory issues:
perf record -e cache-misses -g

# Branch prediction:
perf record -e branch-misses -g

# Targeted analysis is more actionable

Troubleshooting

Permission Issues

# Error: Permission denied
# Solution 1: Run as root
sudo perf record ./app

# Solution 2: Adjust paranoid level
sudo sysctl -w kernel.perf_event_paranoid=-1
# Or permanently:
echo "kernel.perf_event_paranoid = -1" | sudo tee -a /etc/sysctl.conf

Missing Symbols

# Functions shown as addresses
# Solution: Install debug symbols
apt-get install <package>-dbg

# For kernel symbols
apt-get install linux-image-$(uname -r)-dbg

High Overhead

# Profiling slowing down application
# Reduce sampling frequency
perf record -F 99 ./app  # Lower frequency

# Use hardware events (lower overhead)
perf record -e cycles ./app  # Instead of -g

Conclusion

perf is the most powerful Linux performance analysis tool:

Key Capabilities:

  • CPU profiling with minimal overhead
  • Hardware counter analysis
  • Memory access pattern analysis
  • Call graph generation
  • Flame graph visualization
  • System-wide or process-specific profiling

Common Bottlenecks Identified:

  • CPU-intensive functions (70-90% of optimization targets)
  • Cache misses (2-5x speedups possible)
  • Branch mispredictions (10-30% improvements)
  • Memory access patterns (2-10x with optimization)
  • Lock contention (context switch analysis)

Typical Results:

  • Function-level hotspots identified: 10-50x speedup potential
  • Algorithm optimizations: 2-100x improvements
  • Cache optimization: 2-5x faster
  • Overall application speedup: 1.5-10x with systematic optimization

Workflow:

  1. Baseline measurement (perf stat)
  2. Identify hotspots (perf record + perf report)
  3. Analyze call chains (flame graphs)
  4. Annotate source (perf annotate)
  5. Optimize targeted functions
  6. Validate improvements (compare metrics)
  7. Iterate until goals met

By mastering perf, you gain the ability to systematically identify and eliminate performance bottlenecks, transforming slow applications into highly optimized, efficient systems. The data-driven approach ensures optimization efforts focus on actual bottlenecks rather than guesses, maximizing return on optimization investment.