Performance Analysis with strace and ltrace

Introduction

strace and ltrace are powerful diagnostic tools for analyzing application behavior at the system level. strace traces system calls (kernel-level operations), while ltrace traces library calls (user-space function calls). These tools are invaluable for performance analysis, debugging, and understanding why applications are slow, where they spend time, and what resources they access.

When applications perform poorly without obvious cause, strace and ltrace can reveal hidden bottlenecks: excessive system calls, slow I/O operations, failed operations retried repeatedly, or inefficient library usage. They provide visibility into the black box of application execution, showing exactly what operations are performed and how long each takes.

This comprehensive guide covers both tools' usage, performance profiling techniques, bottleneck identification, and real-world optimization examples. You'll learn how to diagnose performance issues that other tools miss.

Understanding System and Library Calls

System Calls vs Library Calls

System Calls (strace):

  • Kernel-level operations
  • Process management (fork, exec, wait)
  • File operations (open, read, write, close)
  • Network operations (socket, connect, send, recv)
  • Memory management (mmap, brk, munmap)
  • Examples: read(), write(), open(), close(), socket()

Library Calls (ltrace):

  • User-space function calls
  • Standard C library (libc)
  • Dynamic library calls
  • Examples: malloc(), printf(), strlen(), strcmp()

Why Performance Issues Happen

Common patterns that cause slowdowns:

  1. Excessive system calls: Thousands of small reads instead of buffered I/O
  2. Failed operations: Repeated failed attempts (file not found, connection refused)
  3. Blocking I/O: Waiting for slow disk or network
  4. Inefficient algorithms: Excessive allocations, string operations
  5. Lock contention: Waiting for mutexes or file locks

strace: System Call Tracing

Installation

# Ubuntu/Debian
apt-get install strace -y

# CentOS/Rocky Linux
dnf install strace -y

# Verify installation
strace -V

Basic Usage

# Trace a command
strace ls -la

# Output shows every system call:
# execve("/bin/ls", ["ls", "-la"], 0x7ffde2a3ef40 /* 24 vars */) = 0
# brk(NULL)                               = 0x55d8a3c4e000
# access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
# openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
# fstat(3, {st_mode=S_IFREG|0644, st_size=27002, ...}) = 0
# ...hundreds more lines...

# Attach to running process
strace -p <PID>

# Example: trace web server process
strace -p $(pidof nginx)

Performance-Focused Options

# Count system calls and time (-c)
strace -c ls -la

# Output summary:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 34.21    0.000234          12        19           mmap
 18.42    0.000126          16         8           read
 15.79    0.000108          14         8           openat
 10.53    0.000072          24         3           write
 ...

# Show timestamps (-t, -tt, -ttt)
strace -tt ls -la  # HH:MM:SS.microseconds

# Show time spent in each call (-T)
strace -T ls -la
# read(3, "..."..., 832) = 832 <0.000045>  # 45 microseconds

# Combine timing options
strace -c -T -p <PID>

# Trace specific system calls (-e)
strace -e trace=open,read,write ls -la
strace -e trace=network nc google.com 80
strace -e trace=file ls -la

Real-World Performance Analysis

Example 1: Slow File Access

# Trace file operations
strace -e trace=open,read,close -T php slow-script.php

# Output reveals:
openat(AT_FDCWD, "config.php", O_RDONLY) = 3 <0.000234>
read(3, "<?php\n..."..., 4096) = 4096 <0.000045>
close(3)                         = 0 <0.000012>
openat(AT_FDCWD, "config.php", O_RDONLY) = 3 <0.000187>  # Opened again!
read(3, "<?php\n..."..., 4096) = 4096 <0.000038>
close(3)                         = 0 <0.000011>
# Repeated 500 times!

# Problem: config file opened 500 times instead of once
# Solution: Cache file contents

# After fix:
openat(AT_FDCWD, "config.php", O_RDONLY) = 3 <0.000234>
read(3, "<?php\n..."..., 4096) = 4096 <0.000045>
close(3)                         = 0 <0.000012>
# Only once! 99.8% reduction in file operations

Example 2: Database Connection Issues

# Trace network calls
strace -e trace=network -T python app.py

# Output shows:
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3 <0.000145>
connect(3, {sa_family=AF_INET, sin_port=htons(3306), sin_addr=inet_addr("10.0.1.5")}, 16) = -1 ETIMEDOUT <3.015234>  # 3 seconds!
close(3)                                  = 0 <0.000034>
# Connection timeout repeated 20 times = 60 seconds wasted

# Problem: Database server unreachable, no timeout configured
# Solution: Set connection timeout, add connection pooling

# After fix:
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3 <0.000145>
connect(3, {sa_family=AF_INET, sin_port=htons(3306), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 <0.000234>
# Connects immediately to local database

Profiling Script

#!/bin/bash
# profile-syscalls.sh - Analyze system call patterns

PID="$1"
DURATION=30

if [ -z "$PID" ]; then
    echo "Usage: $0 <PID>"
    exit 1
fi

echo "Profiling process $PID for $DURATION seconds..."

# Collect system call statistics
strace -c -p $PID 2>&1 &
STRACE_PID=$!

sleep $DURATION

# Stop strace
kill -INT $STRACE_PID 2>/dev/null
wait $STRACE_PID 2>/dev/null

echo
echo "Analysis complete. Look for:"
echo "1. High % time calls (slow operations)"
echo "2. High call counts (excessive operations)"
echo "3. Many errors (failed operations)"

ltrace: Library Call Tracing

Installation

# Ubuntu/Debian
apt-get install ltrace -y

# CentOS/Rocky Linux
dnf install ltrace -y

# Verify installation
ltrace -V

Basic Usage

# Trace library calls
ltrace ls -la

# Output shows library function calls:
# __libc_start_main(0x55a7c79a3170, 2, 0x7ffe32fa7598, 0x55a7c79a9e50...
# strrchr("ls", '/')                     = nil
# setlocale(LC_ALL, "")                  = "en_US.UTF-8"
# bindtextdomain("coreutils", "/usr/share/locale") = "/usr/share/locale"
# ...

# Trace specific functions
ltrace -e malloc,free ls -la

# Count calls (-c)
ltrace -c ls -la

# Show timestamps (-t, -tt, -ttt)
ltrace -tt ls -la

# Show time spent (-T)
ltrace -T ls -la

Performance Analysis with ltrace

Example 1: Memory Allocation Issues

# Trace memory operations
ltrace -e malloc,realloc,free -c python script.py

# Output summary:
% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
 45.67    0.234567          23     10000 malloc
 32.15    0.165234         165      1000 realloc
 22.18    0.114000          11     10000 free
------ ----------- ----------- --------- --------------------
100.00    0.513801                 21000 total

# Analysis:
# - 10,000 malloc calls (excessive)
# - 1,000 realloc calls (resizing arrays repeatedly)
# Problem: Not pre-allocating memory, causing many reallocations
# Solution: Pre-allocate with expected size

# After optimization:
% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
 52.34    0.023456          23      1000 malloc  # 90% reduction!
 25.87    0.011592         116       100 realloc  # 90% reduction!
 21.79    0.009762          10      1000 free
------ ----------- ----------- --------- --------------------
100.00    0.044810                  2100 total

# Result: 91% reduction in memory operations, 10x faster

Example 2: String Operations

# Trace string functions
ltrace -e 'strlen,strcpy,strcat,strcmp' -T ./app

# Output reveals:
strlen("very_long_string_here...")        = 1024 <0.000012>
strlen("very_long_string_here...")        = 1024 <0.000011>
strlen("very_long_string_here...")        = 1024 <0.000012>
# strlen() called 50,000 times on same string!

# Problem: Not caching string length
# Solution: Store length in variable

# After optimization:
strlen("very_long_string_here...")        = 1024 <0.000012>
# Only once! 99.998% reduction

Combined Analysis: strace + ltrace

Comprehensive Performance Profile

#!/bin/bash
# full-profile.sh - Complete performance analysis

APP="$1"
OUTPUT_DIR="profile-$(date +%Y%m%d-%H%M%S)"

mkdir -p "$OUTPUT_DIR"

echo "Starting comprehensive performance profile..."

# 1. System call profile
echo "Profiling system calls (30 seconds)..."
timeout 30 strace -c -o "$OUTPUT_DIR/syscalls.txt" "$APP" 2>&1

# 2. Library call profile
echo "Profiling library calls (30 seconds)..."
timeout 30 ltrace -c -o "$OUTPUT_DIR/libcalls.txt" "$APP" 2>&1

# 3. Detailed system call timing
echo "Capturing detailed system call timing..."
timeout 30 strace -T -tt -o "$OUTPUT_DIR/syscalls-detailed.txt" "$APP" 2>&1

# 4. File operations analysis
echo "Analyzing file operations..."
timeout 30 strace -e trace=file -T -tt -o "$OUTPUT_DIR/file-ops.txt" "$APP" 2>&1

# 5. Network operations analysis
echo "Analyzing network operations..."
timeout 30 strace -e trace=network -T -tt -o "$OUTPUT_DIR/network-ops.txt" "$APP" 2>&1

# Generate report
cat > "$OUTPUT_DIR/report.md" << EOF
# Performance Analysis Report

Date: $(date)
Application: $APP

## Summary

### Top System Calls by Time
\`\`\`
$(head -15 "$OUTPUT_DIR/syscalls.txt")
\`\`\`

### Top Library Calls by Time
\`\`\`
$(head -15 "$OUTPUT_DIR/libcalls.txt")
\`\`\`

## Recommendations

EOF

# Analyze patterns
echo "Analyzing for common issues..."

# Check for excessive file opens
FILE_OPENS=$(grep -c "openat" "$OUTPUT_DIR/file-ops.txt")
if [ $FILE_OPENS -gt 1000 ]; then
    echo "- HIGH: Excessive file opens ($FILE_OPENS). Consider caching." >> "$OUTPUT_DIR/report.md"
fi

# Check for malloc/free imbalance
MALLOCS=$(grep -c "malloc" "$OUTPUT_DIR/libcalls.txt" || echo 0)
FREES=$(grep -c "free" "$OUTPUT_DIR/libcalls.txt" || echo 0)
if [ $MALLOCS -gt $((FREES * 2)) ]; then
    echo "- WARNING: Memory allocation/free imbalance. Potential memory leak." >> "$OUTPUT_DIR/report.md"
fi

echo
echo "Profile complete. Results in $OUTPUT_DIR/"
echo "Review report.md for findings and recommendations."

Real-World Case Studies

Case Study 1: Slow Application Startup

Problem: Application takes 15 seconds to start

# Trace application startup
strace -T -tt -o startup-trace.txt ./app

# Analyze trace
grep "<[0-9]\." startup-trace.txt | sort -t'<' -k2 -n | tail -20

# Top slow operations found:
19:45:32.123456 connect(3, ...) = -1 ETIMEDOUT <5.015234>
19:45:37.138690 connect(4, ...) = -1 ETIMEDOUT <5.012345>
19:45:42.151035 connect(5, ...) = -1 ETIMEDOUT <5.011234>
# 15 seconds wasted on 3 failed connection attempts

# Root cause: Trying to connect to unavailable service on startup
# Solution: Make service connection async or fail faster

# After fix:
# Startup time: 0.5 seconds (96% improvement)

Case Study 2: High CPU Usage

Problem: Process using 80% CPU, unclear why

# Profile running process
strace -c -p <PID> &
TRACE_PID=$!
sleep 60
kill -INT $TRACE_PID

# Results show:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- --------
 89.45   12.456789          25    498234           stat
  5.23    0.728456          18     40567           read
  3.12    0.434567          22     19823           write
  2.20    0.306789          34      9012           open

# 498,234 stat() calls in 60 seconds = 8,304 per second!
# Investigate with detailed trace:
strace -e trace=stat -p <PID> | head -100

# Shows repeatedly checking if file exists:
stat("/var/www/html/cache/user_123", {st_mode=S_IFREG|0644, ...}) = 0
stat("/var/www/html/cache/user_124", {st_mode=S_IFREG|0644, ...}) = 0
stat("/var/www/html/cache/user_125", {st_mode=S_IFREG|0644, ...}) = 0
# Thousands of times per second!

# Root cause: Cache system checking file existence before every read
# Solution: Keep file descriptors open or use different caching strategy

# After fix:
# CPU usage: 15% (81% reduction)
# stat() calls: 45 per second (99.5% reduction)

Case Study 3: Memory Leak Detection

# Trace memory operations over time
ltrace -e malloc,free -p <PID> > mem-trace.txt &
LTRACE_PID=$!
sleep 300  # 5 minutes
kill -INT $LTRACE_PID

# Count allocations and frees
MALLOCS=$(grep -c "malloc(" mem-trace.txt)
FREES=$(grep -c "free(" mem-trace.txt)

echo "Mallocs: $MALLOCS"
echo "Frees: $FREES"
echo "Leak: $((MALLOCS - FREES)) unfreed allocations"

# Result:
# Mallocs: 45,678
# Frees: 23,456
# Leak: 22,222 unfreed allocations

# Identify leak pattern
grep "malloc(" mem-trace.txt | awk '{print $2}' | sort | uniq -c | sort -rn | head -10

# Shows most common allocation sizes:
# 8,543 malloc(4096)  # 4KB blocks not being freed
# 6,234 malloc(1024)
# 3,456 malloc(512)

# Find allocation context with full trace
ltrace -T -e malloc,free -p <PID> | grep "malloc(4096)"

# Shows allocations from specific function
# 0x00007f8b... malloc(4096) = 0x... <0.000023>  # From cache_add_entry()

# Root cause: cache_add_entry() allocating but never freeing
# Solution: Add free() in cache_remove_entry()

Advanced Techniques

Filtering and Focus

# Trace only file operations taking > 10ms
strace -e trace=file -T ./app 2>&1 | awk '$NF ~ /<0\.0[1-9]/ || $NF ~ /<0\.[1-9]/ || $NF ~ /<[1-9]/'

# Trace failures only
strace -e trace=file ./app 2>&1 | grep "= -1"

# Count failures by type
strace -e trace=file ./app 2>&1 | grep "= -1" | awk '{print $NF}' | sort | uniq -c | sort -rn

# Example output:
# 1,234 ENOENT (No such file or directory)
# 567 EACCES (Permission denied)
# 89 EMFILE (Too many open files)

Automated Analysis Script

#!/bin/bash
# analyze-bottlenecks.sh - Automated bottleneck detection

PID="$1"
DURATION=60

if [ -z "$PID" ]; then
    echo "Usage: $0 <PID>"
    exit 1
fi

TMPFILE=$(mktemp)

echo "Analyzing process $PID for $DURATION seconds..."

# Collect trace
timeout $DURATION strace -c -T -p $PID > "$TMPFILE" 2>&1

# Parse results
echo
echo "=== PERFORMANCE ANALYSIS ==="
echo

# Check for excessive system calls
TOTAL_CALLS=$(grep "total" "$TMPFILE" | awk '{print $4}')
CALLS_PER_SEC=$((TOTAL_CALLS / DURATION))

if [ $CALLS_PER_SEC -gt 10000 ]; then
    echo "WARNING: High system call rate ($CALLS_PER_SEC per second)"
    echo "Consider batching operations or using buffered I/O"
    echo
fi

# Check for slow operations
echo "Slowest Operations (% time):"
head -15 "$TMPFILE" | tail -10 | sort -k1 -rn | head -5
echo

# Check for errors
ERROR_COUNT=$(grep "errors syscall" "$TMPFILE" -A 100 | grep -v "errors syscall" | awk '{sum += $5} END {print sum}')
if [ "$ERROR_COUNT" -gt 0 ]; then
    echo "WARNING: $ERROR_COUNT failed system calls detected"
    echo "Review errors for potential issues"
    echo
fi

# Recommendations
echo "=== RECOMMENDATIONS ==="
echo

# Check for file I/O issues
FILE_OPS=$(grep -E "read|write|open|close|stat" "$TMPFILE" | awk '{sum += $4} END {print sum}')
if [ "$FILE_OPS" -gt $((TOTAL_CALLS / 2)) ]; then
    echo "- File I/O is dominant ($(echo "$FILE_OPS * 100 / $TOTAL_CALLS" | bc)% of calls)"
    echo "  Consider: caching, buffering, or async I/O"
fi

rm "$TMPFILE"

Best Practices

1. Use Appropriate Scope

# Bad: Tracing entire application (too much data)
strace ./long-running-app

# Good: Trace specific time window
timeout 30 strace -c ./long-running-app

# Better: Attach to already-running process
strace -c -p <PID> &
sleep 60
killall -INT strace

2. Focus Analysis

# Don't trace everything
# Focus on likely problem areas:

# File I/O issues:
strace -e trace=file -T -c

# Network issues:
strace -e trace=network -T -c

# Memory issues:
ltrace -e malloc,free,realloc -c

3. Minimize Performance Impact

# strace/ltrace ADD overhead
# Don't use in production unless necessary

# Measure impact:
# Without tracing:
time ./benchmark
# real: 2.345s

# With strace:
time strace -o /dev/null ./benchmark
# real: 4.567s (95% overhead)

# With strace -c (summary only):
time strace -c -o /dev/null ./benchmark
# real: 2.678s (14% overhead)

# Use -c flag for minimal impact

4. Combine with Other Tools

# Use strace/ltrace WITH other tools

# CPU profiling:
perf record ./app &
strace -c -p $(pidof app)

# Memory profiling:
valgrind --tool=massif ./app &
ltrace -e malloc,free -c -p $(pidof app)

# I/O profiling:
iotop -p $(pidof app) &
strace -e trace=file -T -p $(pidof app)

Troubleshooting Common Issues

Permission Denied

# Error: Operation not permitted
# Solution: Run with sudo or as root
sudo strace -p <PID>

Process Stops When Traced

# strace sends SIGSTOP
# Process resumes when strace exits
# Use Ctrl+C to stop tracing gracefully

Too Much Output

# Redirect to file
strace ./app 2> trace.log

# Or use summary mode
strace -c ./app

# Filter output
strace ./app 2>&1 | grep -E "read|write"

Conclusion

strace and ltrace are indispensable tools for performance analysis:

Key Benefits:

  • Reveal hidden bottlenecks invisible to other tools
  • Identify excessive system/library calls
  • Detect failed operations
  • Understand application behavior
  • Validate optimization effectiveness

Common Issues Detected:

  • Excessive file opens (cache misses)
  • Repeated failed connection attempts
  • Memory allocation storms
  • Inefficient string operations
  • Missing buffering
  • Lock contention

Usage Guidelines:

  • Use for targeted analysis, not continuous monitoring
  • Focus on specific call types (-e flag)
  • Use -c for summary (less overhead)
  • Combine with other profiling tools
  • Test in development/staging first

Performance Improvements:

  • Typical findings: 50-95% reduction in system calls
  • Response time improvements: 2-10x faster
  • Resource usage: 30-80% less CPU/I/O

By mastering strace and ltrace, you gain deep visibility into application behavior, enabling identification and resolution of performance issues that are invisible to conventional monitoring tools.