Log Analysis with awk, grep, and sed

Introduction

Log analysis is a fundamental skill for system administrators, DevOps engineers, and security professionals. While graphical log analysis tools offer powerful features, command-line utilities like awk, grep, and sed provide unmatched speed, flexibility, and availability for real-time log analysis, troubleshooting, and pattern extraction directly on production servers.

These three utilities form the cornerstone of Unix text processing: grep excels at searching and filtering, sed specializes in stream editing and text transformation, and awk provides powerful data extraction and reporting capabilities. Together, they enable you to parse gigabytes of log data in seconds, extract actionable insights, identify patterns, and automate log processing tasks without installing additional software.

This comprehensive guide teaches you how to master log analysis using awk, grep, and sed, from basic filtering to advanced pattern matching, data extraction, statistical analysis, and automated reporting. You'll learn practical techniques for analyzing web server logs, system logs, application logs, and security logs, enabling rapid troubleshooting and deep operational insights.

Prerequisites

Before diving into log analysis with these tools, ensure you have:

  • A Linux server or workstation (any distribution)
  • Access to log files (typically in /var/log/)
  • Root or sudo access for protected log files
  • Basic understanding of regular expressions
  • Familiarity with common log formats

Required Tools: All three utilities are pre-installed on virtually every Linux distribution:

  • grep (GNU grep recommended)
  • sed (GNU sed)
  • awk (GNU awk/gawk)

Verify Installation:

grep --version
sed --version
awk --version

Understanding Log Formats

Common Log Formats

Syslog Format:

Jan 11 10:30:45 server1 sshd[1234]: Accepted password for user from 192.168.1.100 port 12345 ssh2

Apache Combined Format:

192.168.1.100 - - [11/Jan/2024:10:30:45 +0000] "GET /index.html HTTP/1.1" 200 1234 "https://example.com" "Mozilla/5.0"

Nginx Access Log:

192.168.1.100 - user [11/Jan/2024:10:30:45 +0000] "GET /api/v1/users HTTP/1.1" 200 567 "-" "curl/7.68.0"

JSON Application Log:

{"timestamp":"2024-01-11T10:30:45Z","level":"ERROR","message":"Database connection failed","user_id":123}

Mastering grep for Log Analysis

Basic grep Usage

Search for specific term:

# Find all error messages
grep "error" /var/log/syslog

# Case-insensitive search
grep -i "error" /var/log/syslog

# Search multiple files
grep "failed" /var/log/*.log

# Recursive search
grep -r "connection refused" /var/log/

Display context:

# Show 3 lines before match
grep -B 3 "error" /var/log/syslog

# Show 3 lines after match
grep -A 3 "error" /var/log/syslog

# Show 3 lines before and after
grep -C 3 "error" /var/log/syslog

# Show line numbers
grep -n "error" /var/log/syslog

Count and statistics:

# Count matching lines
grep -c "error" /var/log/syslog

# Show only matching part
grep -o "ERROR.*" /var/log/app.log

# List files containing match
grep -l "error" /var/log/*.log

# List files NOT containing match
grep -L "error" /var/log/*.log

Advanced grep Patterns

Regular expression patterns:

# Match IP addresses
grep -E '\b([0-9]{1,3}\.){3}[0-9]{1,3}\b' /var/log/syslog

# Match email addresses
grep -E '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b' /var/log/mail.log

# Match dates (YYYY-MM-DD format)
grep -E '[0-9]{4}-[0-9]{2}-[0-9]{2}' /var/log/app.log

# Match times (HH:MM:SS)
grep -E '[0-9]{2}:[0-9]{2}:[0-9]{2}' /var/log/syslog

Multiple patterns:

# Match either pattern (OR)
grep -E "error|warning|critical" /var/log/syslog

# Match multiple patterns (AND)
grep "error" /var/log/syslog | grep "database"

# Exclude pattern
grep "error" /var/log/syslog | grep -v "debug"

# Match word boundaries
grep -w "error" /var/log/syslog  # Won't match "errors"

Practical grep Examples

Find failed SSH login attempts:

grep "Failed password" /var/log/auth.log

# With IP addresses
grep "Failed password" /var/log/auth.log | grep -oE '\b([0-9]{1,3}\.){3}[0-9]{1,3}\b' | sort | uniq -c | sort -rn

# Count by user
grep "Failed password" /var/log/auth.log | awk '{print $(NF-5)}' | sort | uniq -c | sort -rn

Analyze HTTP error codes:

# Find 404 errors
grep " 404 " /var/log/nginx/access.log

# Find 5xx errors
grep -E " 5[0-9]{2} " /var/log/nginx/access.log

# Count errors by type
grep -oE " [4-5][0-9]{2} " /var/log/nginx/access.log | sort | uniq -c | sort -rn

Search application errors with context:

# Find errors with stack traces
grep -A 20 "Exception" /var/log/app/error.log

# Find database errors
grep -B 5 -A 10 "SQLException" /var/log/app/app.log

Time-based filtering:

# Find logs from specific hour
grep "Jan 11 14:" /var/log/syslog

# Find logs from specific date
grep "2024-01-11" /var/log/app.log

# Find logs from today
grep "$(date '+%b %d')" /var/log/syslog

Mastering sed for Log Processing

Basic sed Usage

Print specific lines:

# Print line 10
sed -n '10p' /var/log/syslog

# Print lines 10-20
sed -n '10,20p' /var/log/syslog

# Print every 10th line
sed -n '0~10p' /var/log/syslog

# Print last line
sed -n '$p' /var/log/syslog

Delete lines:

# Delete blank lines
sed '/^$/d' /var/log/syslog

# Delete lines containing pattern
sed '/debug/d' /var/log/syslog

# Delete lines 1-10
sed '1,10d' /var/log/syslog

# Delete last line
sed '$d' /var/log/syslog

Substitute text:

# Replace first occurrence
sed 's/error/ERROR/' /var/log/app.log

# Replace all occurrences (global)
sed 's/error/ERROR/g' /var/log/app.log

# Case-insensitive replacement
sed 's/error/ERROR/gi' /var/log/app.log

# Replace only on lines matching pattern
sed '/WARNING/s/error/ERROR/g' /var/log/app.log

Advanced sed Patterns

Extract specific fields:

# Extract IP addresses
sed -n 's/.*\([0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\).*/\1/p' /var/log/nginx/access.log

# Extract timestamps
sed -n 's/.*\[\([^]]*\)\].*/\1/p' /var/log/nginx/access.log

# Remove timestamps (first 3 fields)
sed 's/^[^ ]* [^ ]* [^ ]* //' /var/log/syslog

Multi-line operations:

# Join lines ending with backslash
sed -e :a -e '/\\$/N; s/\\\n//; ta' /var/log/app.log

# Add line after pattern
sed '/ERROR/a\--- Error detected ---' /var/log/app.log

# Insert line before pattern
sed '/ERROR/i\--- Warning: Error follows ---' /var/log/app.log

Conditional processing:

# Process only lines between patterns
sed -n '/START/,/END/p' /var/log/app.log

# Delete everything after first ERROR
sed '/ERROR/,$d' /var/log/app.log

# Replace only in specific line range
sed '10,20s/old/new/g' /var/log/app.log

Practical sed Examples

Clean and format logs:

# Remove ANSI color codes
sed 's/\x1b\[[0-9;]*m//g' /var/log/app.log

# Remove carriage returns
sed 's/\r$//' /var/log/app.log

# Normalize whitespace
sed 's/[[:space:]]\+/ /g' /var/log/app.log

# Add prefix to each line
sed 's/^/[APP] /' /var/log/app.log

Extract and transform data:

# Convert Apache log to CSV
sed 's/\([^ ]*\) - - \[\([^]]*\)\] "\([^"]*\)" \([0-9]*\) \([0-9]*\).*/\1,\2,\3,\4,\5/' /var/log/apache2/access.log

# Extract just URLs from access log
sed -n 's/.*"\w* \([^ ]*\) HTTP.*/\1/p' /var/log/nginx/access.log

# Extract error messages only
sed -n 's/.*ERROR - \(.*\)$/\1/p' /var/log/app.log

Filter by time range:

# Extract logs from 10:00 to 11:00
sed -n '/Jan 11 10:00/,/Jan 11 11:00/p' /var/log/syslog

# Extract logs from specific date
sed -n '/2024-01-11/,/2024-01-12/p' /var/log/app.log

Mastering awk for Log Analysis

Basic awk Usage

Print specific columns:

# Print first column
awk '{print $1}' /var/log/syslog

# Print first and fifth columns
awk '{print $1, $5}' /var/log/syslog

# Print all columns except first
awk '{$1=""; print $0}' /var/log/syslog

# Print last column
awk '{print $NF}' /var/log/syslog

# Print second to last column
awk '{print $(NF-1)}' /var/log/syslog

Pattern matching:

# Print lines matching pattern
awk '/error/ {print}' /var/log/syslog

# Print lines NOT matching pattern
awk '!/debug/ {print}' /var/log/syslog

# Print if column matches
awk '$5 == "error" {print}' /var/log/syslog

# Print if column contains
awk '$5 ~ /error/ {print}' /var/log/syslog

Field separators:

# Custom field separator (colon)
awk -F':' '{print $1, $2}' /etc/passwd

# Multiple field separators
awk -F'[: ]' '{print $1}' /var/log/syslog

# Change output separator
awk -F':' 'BEGIN{OFS=","} {print $1, $2}' /etc/passwd

Advanced awk Operations

Arithmetic and statistics:

# Count lines
awk 'END {print NR}' /var/log/syslog

# Sum column values
awk '{sum+=$1} END {print sum}' numbers.log

# Calculate average
awk '{sum+=$1; count++} END {print sum/count}' numbers.log

# Find minimum and maximum
awk 'NR==1{max=$1; min=$1} $1>max{max=$1} $1<min{min=$1} END {print "Min:", min, "Max:", max}' numbers.log

Conditional processing:

# If-else statements
awk '{if ($1 > 100) print "HIGH:", $0; else print "LOW:", $0}' numbers.log

# Multiple conditions
awk '{if ($1 > 100 && $2 == "error") print $0}' /var/log/app.log

# Ternary operator
awk '{print ($1 > 100) ? "HIGH" : "LOW"}' numbers.log

Arrays and aggregation:

# Count occurrences
awk '{count[$1]++} END {for (ip in count) print ip, count[ip]}' /var/log/nginx/access.log

# Sum by key
awk '{sum[$1]+=$2} END {for (key in sum) print key, sum[key]}' data.log

# Track unique values
awk '{if (!seen[$1]++) print $1}' /var/log/access.log

BEGIN and END blocks:

# Print header and footer
awk 'BEGIN {print "=== Log Analysis ==="} {print $0} END {print "=== Total Lines:", NR, "==="}' /var/log/syslog

# Initialize variables
awk 'BEGIN {count=0} /error/ {count++} END {print "Errors:", count}' /var/log/syslog

Practical awk Examples

Apache/Nginx log analysis:

# Count requests by IP
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -10

# Count by HTTP status code
awk '{print $9}' /var/log/nginx/access.log | sort | uniq -c | sort -rn

# Calculate total bandwidth
awk '{sum+=$10} END {print "Total MB:", sum/1024/1024}' /var/log/nginx/access.log

# Average response time
awk '{sum+=$NF; count++} END {print "Avg:", sum/count}' /var/log/nginx/access.log

# Requests per hour
awk '{print substr($4,14,2)}' /var/log/nginx/access.log | sort | uniq -c

# Top requested URLs
awk '{print $7}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20

# 404 errors with URLs
awk '$9 == 404 {print $7}' /var/log/nginx/access.log | sort | uniq -c | sort -rn

# Response time percentiles
awk '{print $NF}' /var/log/nginx/access.log | sort -n | awk 'BEGIN{c=0} {a[c++]=$1} END {print "50th:", a[int(c*0.5)], "90th:", a[int(c*0.9)], "99th:", a[int(c*0.99)]}'

Syslog analysis:

# Count messages by hour
awk '{print $3}' /var/log/syslog | cut -d: -f1 | sort | uniq -c

# Count by program/service
awk '{print $5}' /var/log/syslog | sed 's/\[.*\]://' | sort | uniq -c | sort -rn

# Failed services
awk '/failed|error/ {print $5}' /var/log/syslog | sort | uniq -c | sort -rn

# Extract just error messages
awk '/error|ERROR/ {for(i=6;i<=NF;i++) printf "%s ", $i; print ""}' /var/log/syslog

Authentication log analysis:

# Failed login attempts by IP
awk '/Failed password/ {print $(NF-3)}' /var/log/auth.log | sort | uniq -c | sort -rn

# Failed login attempts by user
awk '/Failed password/ {print $(NF-5)}' /var/log/auth.log | sort | uniq -c | sort -rn

# Successful logins
awk '/Accepted password/ {print $(NF-3), $(NF-5)}' /var/log/auth.log

# Count auth events by hour
awk '{print $3}' /var/log/auth.log | cut -d: -f1 | sort | uniq -c

# Sudo command usage
awk '/sudo.*COMMAND/ {for(i=1;i<=NF;i++) if($i=="COMMAND=") {for(j=i+1;j<=NF;j++) printf "%s ", $j; print ""}}' /var/log/auth.log

Application log analysis (JSON):

# Extract specific JSON field (requires jq alternative in awk)
awk -F'"' '/timestamp/ {print $4}' /var/log/app.json

# Count by log level
awk -F'"' '/"level"/ {print $4}' /var/log/app.json | sort | uniq -c

# Errors with message
awk -F'"' '/"level":"ERROR"/ {for(i=1;i<=NF;i++) if($i=="message") print $(i+2)}' /var/log/app.json

Combining grep, sed, and awk

Powerful Pipeline Examples

Complete access log analysis:

# Top IPs accessing specific URL
grep "/api/login" /var/log/nginx/access.log | \
    awk '{print $1}' | \
    sort | uniq -c | sort -rn | head -10

# Extract and analyze 404 errors
grep " 404 " /var/log/nginx/access.log | \
    awk '{print $7}' | \
    sort | uniq -c | sort -rn | \
    sed 's/^ *//' | \
    awk '{print $2, ":", $1, "times"}'

# Analyze slow requests (response time > 1s)
awk '$NF > 1.0 {print $0}' /var/log/nginx/access.log | \
    sed 's/.*"\([A-Z]*\) \([^ ]*\) .*/\1 \2/' | \
    sort | uniq -c | sort -rn

Security analysis:

# Identify brute force attempts
grep "Failed password" /var/log/auth.log | \
    awk '{print $(NF-3)}' | \
    sort | uniq -c | \
    awk '$1 > 10 {print "WARNING:", $2, "attempted", $1, "times"}' | \
    sed 's/^/[SECURITY] /'

# Analyze unauthorized access attempts
grep -E "unauthorized|forbidden|denied" /var/log/syslog | \
    awk '{print $5}' | \
    sed 's/\[.*\]://' | \
    sort | uniq -c | sort -rn

# Extract suspicious commands
grep "sudo" /var/log/auth.log | \
    awk '/COMMAND/ {for(i=1;i<=NF;i++) if($i=="COMMAND=") print $(i+1)}' | \
    grep -vE "^/usr/bin/(ls|cat|less|grep)" | \
    sort | uniq -c

Performance analysis:

# Database query performance
grep "Query took" /var/log/app/app.log | \
    sed 's/.*Query took \([0-9.]*\)ms.*/\1/' | \
    awk '{sum+=$1; count++; if($1>max) max=$1} END {print "Avg:", sum/count, "ms, Max:", max, "ms, Total queries:", count}'

# Error rate over time
grep "ERROR" /var/log/app/app.log | \
    awk '{print substr($1,12,5)}' | \
    uniq -c | \
    awk '{print $2, $1}' | \
    sed 's/^/Time: /' | \
    sed 's/ / - Errors: /'

Automated Log Analysis Scripts

Comprehensive Analysis Script

#!/bin/bash
# log-analyzer.sh - Automated log analysis

LOG_FILE="/var/log/nginx/access.log"
REPORT_FILE="/tmp/log-analysis-$(date +%Y%m%d-%H%M).txt"

{
    echo "========================================="
    echo "Log Analysis Report"
    echo "Date: $(date)"
    echo "Log File: $LOG_FILE"
    echo "========================================="
    echo ""

    echo "--- Total Requests ---"
    wc -l < "$LOG_FILE"
    echo ""

    echo "--- Top 10 IP Addresses ---"
    awk '{print $1}' "$LOG_FILE" | sort | uniq -c | sort -rn | head -10
    echo ""

    echo "--- HTTP Status Code Distribution ---"
    awk '{print $9}' "$LOG_FILE" | sort | uniq -c | sort -rn
    echo ""

    echo "--- Top 20 Requested URLs ---"
    awk '{print $7}' "$LOG_FILE" | sort | uniq -c | sort -rn | head -20
    echo ""

    echo "--- 404 Errors ---"
    grep " 404 " "$LOG_FILE" | awk '{print $7}' | sort | uniq -c | sort -rn | head -10
    echo ""

    echo "--- 5xx Errors ---"
    grep -E " 5[0-9]{2} " "$LOG_FILE" | wc -l
    if [ $(grep -cE " 5[0-9]{2} " "$LOG_FILE") -gt 0 ]; then
        grep -E " 5[0-9]{2} " "$LOG_FILE" | awk '{print $7}' | sort | uniq -c | sort -rn | head -10
    fi
    echo ""

    echo "--- Requests per Hour ---"
    awk '{print substr($4,14,2)}' "$LOG_FILE" | sort | uniq -c
    echo ""

    echo "--- User Agents (Top 10) ---"
    awk -F'"' '{print $6}' "$LOG_FILE" | sort | uniq -c | sort -rn | head -10
    echo ""

    echo "========================================="
    echo "Report generated: $(date)"
    echo "========================================="

} > "$REPORT_FILE"

echo "Analysis complete. Report saved to: $REPORT_FILE"
cat "$REPORT_FILE"

Real-Time Log Monitoring

#!/bin/bash
# realtime-monitor.sh - Real-time log monitoring with analysis

LOG_FILE="/var/log/syslog"

echo "Monitoring $LOG_FILE for errors..."
echo "Press Ctrl+C to stop"
echo ""

tail -f "$LOG_FILE" | while read line; do
    # Check for errors
    if echo "$line" | grep -qi "error"; then
        echo "[ERROR] $line" | sed 's/error/\x1b[31mERROR\x1b[0m/i'
    fi

    # Check for warnings
    if echo "$line" | grep -qi "warning"; then
        echo "[WARN] $line" | sed 's/warning/\x1b[33mWARNING\x1b[0m/i'
    fi

    # Check for failed authentication
    if echo "$line" | grep -q "Failed password"; then
        IP=$(echo "$line" | awk '{print $(NF-3)}')
        echo "[SECURITY] Failed login from $IP" | sed 's/SECURITY/\x1b[35mSECURITY\x1b[0m/'
    fi
done

Security Audit Script

#!/bin/bash
# security-audit.sh - Automated security log analysis

AUTH_LOG="/var/log/auth.log"
REPORT="/tmp/security-audit-$(date +%Y%m%d).txt"

{
    echo "Security Audit Report"
    echo "====================="
    echo "Date: $(date)"
    echo ""

    echo "--- Failed Login Attempts by IP ---"
    grep "Failed password" "$AUTH_LOG" | \
        awk '{print $(NF-3)}' | \
        sort | uniq -c | sort -rn | \
        awk '$1 > 5 {print "WARNING:", $2, "failed", $1, "times"}'
    echo ""

    echo "--- Failed Login Attempts by User ---"
    grep "Failed password" "$AUTH_LOG" | \
        awk '{print $(NF-5)}' | \
        sort | uniq -c | sort -rn
    echo ""

    echo "--- Successful Root Logins ---"
    grep "Accepted.*root" "$AUTH_LOG" | wc -l
    if [ $(grep -c "Accepted.*root" "$AUTH_LOG") -gt 0 ]; then
        grep "Accepted.*root" "$AUTH_LOG"
    fi
    echo ""

    echo "--- Sudo Commands ---"
    grep "sudo.*COMMAND" "$AUTH_LOG" | \
        awk '{for(i=1;i<=NF;i++) if($i=="USER=") print $(i+1)}' | \
        sort | uniq -c
    echo ""

    echo "--- New User Additions ---"
    grep "useradd" "$AUTH_LOG"
    echo ""

} > "$REPORT"

echo "Security audit complete. Report: $REPORT"
cat "$REPORT"

Best Practices

Performance Optimization

For large files:

# Use grep first to filter, then process
grep "ERROR" huge.log | awk '{print $5}'

# Process compressed files without decompressing
zgrep "pattern" file.log.gz
zcat file.log.gz | awk '{print $1}'

# Use mawk for better performance with large datasets
mawk '{print $1}' huge.log

# Limit processing with head
grep "ERROR" huge.log | head -1000 | awk '{print $5}'

Memory-efficient processing:

# Don't load entire file into memory
awk 'NR % 1000 == 0 {print "Processed", NR, "lines"}' huge.log

# Use stream processing
tail -f /var/log/app.log | grep --line-buffered "ERROR" | awk '{print $0}'

Regular Expression Tips

Common patterns:

# IP address: \b([0-9]{1,3}\.){3}[0-9]{1,3}\b
# Email: [A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}
# URL: https?://[^\s]+
# UUID: [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}
# Date (YYYY-MM-DD): [0-9]{4}-[0-9]{2}-[0-9]{2}
# Time (HH:MM:SS): [0-9]{2}:[0-9]{2}:[0-9]{2}

Conclusion

Mastering grep, sed, and awk for log analysis provides powerful, flexible, and efficient tools for extracting insights from system and application logs. These utilities are fast, universally available, and capable of processing gigabytes of log data with minimal resource overhead.

Key takeaways:

  1. grep - Fast pattern searching and filtering
  2. sed - Stream editing and text transformation
  3. awk - Data extraction, analysis, and reporting
  4. Pipelines - Combine all three for powerful analysis workflows
  5. Automation - Script common analysis tasks for regular execution

Best practices:

  • Start with grep to filter large datasets
  • Use awk for structured data extraction and statistics
  • Apply sed for text transformation and cleanup
  • Combine tools in pipelines for complex analysis
  • Test patterns on small data samples first
  • Document complex one-liners for future reference
  • Consider performance with large log files
  • Automate routine analysis with scripts

While modern log analysis platforms offer advanced features, command-line tools remain indispensable for quick troubleshooting, ad-hoc analysis, and situations where installing additional software isn't feasible. These fundamental skills translate across all Unix-like systems and will serve you throughout your career in system administration and DevOps.