Skip to content

Linux Performance Tuning Reference: perf, Flame Graphs, eBPF, iostat & Benchmarking

Linux performance investigation tools — from the first top command to CPU flame graphs and eBPF tracing. The Brendan Gregg-style systematic approach to finding bottlenecks.

1. The 60-Second Investigation

Run these 10 commands first — always
# From Brendan Gregg's "USE Method" — run in this order:
uptime                                    # 1. load average trend (1/5/15 min)
dmesg | tail -20                          # 2. kernel errors (OOM, disk errors, network drops)
vmstat 1 5                                # 3. runnable, memory, swap, I/O, CPU
mpstat -P ALL 1 3                         # 4. per-CPU stats (imbalance? one core 100%?)
pidstat 1 3                               # 5. which processes consume CPU
iostat -xz 1 3                            # 6. disk utilisation + await times
free -m                                   # 7. free vs available memory
sar -n DEV 1 3                            # 8. network interface throughput
sar -n TCP,ETCP 1 3                       # 9. TCP retransmits (network quality)
top                                       # 10. overall picture, press 1 for per-CPU

# The USE method for every resource:
# U = Utilisation (% busy)
# S = Saturation (queue depth / wait time)
# E = Errors (reported in dmesg / /proc/net/dev)

2. CPU Analysis

CPU saturation, context switches, load vs utilisation
# Load average vs CPU count:
# Load 4.0 on 4-core = 100% utilised (each core has 1 runnable task)
# Load 8.0 on 4-core = 200% = tasks are queuing
nproc                          # how many logical CPUs
cat /proc/loadavg              # current load averages

# Per-core utilisation:
mpstat -P ALL 1 5              # %usr %sys %iowait %irq %soft %idle per core
# iowait > 20% = I/O-bound, not CPU-bound
# sys > 20% = kernel overhead (syscall rate, interrupts, scheduling)

# What's actually running:
pidstat -u 1 5                 # CPU per process
pidstat -t -p  1 5        # per thread breakdown
ps aux --sort=-%cpu | head -20

# Context switches (high = lock contention or too many threads):
vmstat 1 5                     # cs column = context switches/sec
pidstat -w 1 5                 # context switches per process

# CPU steal (cloud VMs):
vmstat 1 5                     # st column — if > 0, host is overcommitted
top                            # %st in CPU line

# CPU frequency/throttling:
cpupower frequency-info        # current freq, governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq

3. Memory Analysis

The difference between free and available, OOM, huge pages
# Available (not free) is what matters:
free -m
# total   used   free   shared  buff/cache  available
# 15900   8200   1200    400     6500        7100
# "free" = 1.2GB, but "available" = 7.1GB (cache is reclaimable)
# PANIC if available is near zero, not if free is low

# Per-process memory:
ps aux --sort=-%mem | head -20
# VSZ = virtual, RSS = resident (what's actually in RAM)

# What's using memory in detail:
cat /proc//status | grep -E 'Vm|Rss'
pmap -x  | sort -n -k3 | tail -20    # per-mapping RSS

# Slab cache (kernel memory — can be huge on busy servers):
slabtop                        # interactive sorted by total memory
cat /proc/slabinfo | sort -k3 -rn | head -20

# Is swap being used? (if yes, performance will suffer):
vmstat 1 5                     # si/so columns = swap in/out per second
sar -W 1 5                     # swap stats

# OOM killer events:
dmesg | grep -i 'oom\|killed process'
journalctl -k | grep -i 'oom\|killed'

# Transparent huge pages (often a latency source):
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] → can cause latency spikes from defragmentation
# echo madvise > /sys/kernel/mm/transparent_hugepage/enabled  # safer

4. Disk I/O Analysis

iostat, iotop, and finding which process is causing the I/O
# iostat — the primary disk I/O tool:
iostat -xz 1 5
# Key columns:
# %util  — how busy the disk is (100% = saturated)
# await  — average I/O wait time (ms) — goal: <10ms for SSD, <20ms for HDD
# r/s w/s — reads/writes per second
# rkB/s wkB/s — throughput in KB/s
# avgqu-sz — queue depth (>1 = disk can't keep up)

# Which process is doing I/O:
iotop -oa                      # -o: only show active, -a: accumulated
pidstat -d 1 5                 # per-process I/O stats
ls -la /proc//fd          # open file descriptors

# File system latency (real latency seen by apps):
# (requires BCC tools)
ext4slower 10                  # show ext4 ops > 10ms
biolatency                     # histogram of block I/O latency

# Dirty page flush storms:
cat /proc/meminfo | grep Dirty
# Dirty: large value = unflushed writes
# cat /proc/sys/vm/dirty_ratio       — max dirty pages % before sync (default 20%)
# cat /proc/sys/vm/dirty_background_ratio — background flush threshold (default 10%)
# Tune for write-heavy servers: dirty_ratio=40, dirty_background_ratio=10

# inode exhaustion (disk full but df shows space):
df -i                          # shows inode usage per filesystem
# 100% inode = new files can't be created despite free disk space

5. Network Performance

TCP retransmits, socket backlog, network saturation
# Interface throughput:
sar -n DEV 1 5                 # rxkB/s txkB/s per interface
ip -s link show eth0           # total bytes + errors + drops

# TCP retransmits (network quality indicator):
sar -n TCP,ETCP 1 5
# retrans/s > 1% of active/s = network issue or buffer bloat

# Socket queue depth (are connections queuing?):
ss -s                          # summary: total, timewait, close-wait counts
ss -tlnp                       # listening sockets with queue depths
# Recv-Q > 0 on listening socket = application is too slow to accept() connections
# Send-Q > 0 = remote end slow to receive (TCP window full)

# Connection state breakdown:
ss -ta state established | wc -l   # active connections
ss -ta state time-wait | wc -l     # TIME_WAIT (normal but high = lots of short-lived connections)
ss -ta state close-wait | wc -l    # CLOSE_WAIT (application not closing sockets)

# Dropped packets:
netstat -s | grep -i 'drop\|overflow\|fail'
# listen queue overflow = increase /proc/sys/net/core/somaxconn
# and net.ipv4.tcp_max_syn_backlog

# Bandwidth test (between servers):
iperf3 -s                      # server
iperf3 -c  -t 30    # client, 30-second test

6. Profiling with perf

CPU profiling, cache misses, branch mispredictions
# perf stat — hardware counter summary (instant read):
perf stat -a sleep 5           # system-wide for 5 seconds
perf stat -p  sleep 5     # for a specific process

# Key counters to watch:
# task-clock: CPU time (ms)
# instructions: total instructions executed
# cycles: total CPU cycles
# IPC (instructions/cycle): < 1 = CPU stalled waiting (memory-bound), > 2 = compute-bound
# cache-misses / cache-references: high ratio = memory-bound
# branch-misses / branches: high = branch misprediction (hard to optimise)

# CPU profiling — sample the call stack:
perf record -F 99 -a -g -- sleep 30    # system-wide, 30s, with stack traces
perf record -F 99 -p  -g -- sleep 30  # single process
perf report                             # interactive TUI

# perf top — real-time hot functions:
perf top -p               # which functions are consuming CPU right now
perf top -a                    # system-wide

# Count specific events:
perf stat -e cache-misses,cache-references,cycles,instructions ./my-program
perf stat -e 'syscalls:sys_enter_*' -p  sleep 5   # syscall frequency
perf requires perf_event_paranoid ≤ 1 or root access. On cloud VMs, set: echo 1 > /proc/sys/kernel/perf_event_paranoid

7. Flame Graphs

Brendan Gregg’s flame graphs — visualise CPU hot paths
# Install:
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph

# CPU flame graph (most common):
perf record -F 99 -a -g -- sleep 30
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > cpu-flamegraph.svg
# Open in browser — wide bars = hot functions, tall stacks = deep call chains

# For a specific process:
perf record -F 99 -p $(pgrep myapp) -g -- sleep 30
perf script | ./stackcollapse-perf.pl > out.perf-folded
./flamegraph.pl out.perf-folded > flamegraph.svg

# Differential flame graph (before vs after optimisation):
./difffolded.pl baseline.folded optimised.folded | ./flamegraph.pl > diff.svg
# Red = increased CPU time, blue = decreased

# Off-CPU flame graph (time NOT on CPU — lock waits, I/O waits):
# Requires BCC tools (offcputime)
/usr/share/bcc/tools/offcputime -df -p $(pgrep myapp) 30 > out.stacks
./flamegraph.pl --color=io --title="Off-CPU" out.stacks > offcpu.svg

# Go flame graphs (native pprof):
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30
# Served at localhost:8080 with flame graph view built in

8. BCC / eBPF Tools

Production-safe dynamic tracing — no restarts, no recompilation
# Install BCC: apt install bpfcc-tools python3-bpfcc  (Ubuntu 20.04+)
# All tools in /usr/share/bcc/tools/ or apt bpftrace for bpftrace

# File system latency:
/usr/share/bcc/tools/ext4slower 10     # ext4 ops > 10ms
/usr/share/bcc/tools/biolatency        # block I/O latency histogram
/usr/share/bcc/tools/biotop            # top disk I/O by process

# Network:
/usr/share/bcc/tools/tcpconnect        # new TCP connections with latency
/usr/share/bcc/tools/tcpretrans        # TCP retransmits with kernel stack
/usr/share/bcc/tools/tcplife           # TCP session lifespan + bytes

# CPU / scheduling:
/usr/share/bcc/tools/runqlat           # CPU run queue latency histogram
/usr/share/bcc/tools/offcputime -p  5  # time blocked off CPU

# Memory:
/usr/share/bcc/tools/memleak           # heap memory leak detection
/usr/share/bcc/tools/oomkill           # OOM killer events with context

# bpftrace one-liners (modern alternative to BCC):
bpftrace -e 'tracepoint:syscalls:sys_enter_write { @[comm] = count(); }'  # write() per process
bpftrace -e 'tracepoint:block:block_rq_issue { @bytes = hist(args->bytes); }'  # I/O size histogram
bpftrace -e 'kretprobe:vfs_read { @ns[comm] = hist(nsecs); }'  # VFS read latency
BCC vs bpftrace: Use BCC tools for specific pre-built analyses. Use bpftrace for custom one-liners and scripts. Both are production-safe (read-only observation).

9. strace / ltrace

System call tracing — what the process is actually asking the kernel to do
# strace: traces system calls (kernel boundary)
# SAFE for production: read-only observation, but adds 30-100% overhead — keep brief

# Attach to running process:
strace -p                         # show all syscalls
strace -p  -e trace=network       # only network syscalls
strace -p  -e trace=file          # only file operations
strace -p  -f                     # follow threads (-f for forked processes)

# Count + time syscalls (lower overhead):
strace -c -p  sleep 30            # summary: calls, errors, time per syscall
strace -c ./myprogram args             # trace from start

# Per-call timing:
strace -T -p                      # show time spent in each call
# Slow read() calls = disk I/O, slow connect() = network latency

# Find what files a process is opening:
strace -e trace=openat,open -p    # file opens
# Or less overhead:
lsof -p                           # currently open files

# ltrace: traces library calls (libc boundary)
ltrace -p                         # all library calls
ltrace -e malloc,free -c ./myprogram   # count malloc/free calls

# Finding "why is this process slow at startup":
strace -T -e trace=all ./myprogram 2>&1 | sort -t'<' -k2 -rn | head -20

10. Micro-benchmarking

stress-ng, fio, sysbench — isolating resource performance
# CPU benchmark:
sysbench cpu --cpu-max-prime=20000 run
stress-ng --cpu 4 --timeout 60s --metrics-brief   # saturate 4 CPU cores

# Memory bandwidth:
sysbench memory --memory-block-size=1K --memory-total-size=10G run
stress-ng --vm 2 --vm-bytes 1G --timeout 60s --metrics-brief

# Disk I/O benchmark (fio):
# Sequential write:
fio --name=seqwrite --rw=write --bs=1M --size=4G --numjobs=1 \
    --ioengine=libaio --direct=1 --iodepth=16 --filename=/tmp/testfile

# Random read IOPS:
fio --name=randread --rw=randread --bs=4k --size=4G --numjobs=4 \
    --ioengine=libaio --direct=1 --iodepth=32 --filename=/tmp/testfile

# Network latency:
ping -c 100  | tail -2            # min/avg/max/mdev
hping3 -S -p 80 -c 100           # TCP SYN latency

# Context switch overhead (how expensive scheduling is):
sysbench threads --thread-yields=64 --thread-locks=2 run

# Establish baseline before changes, compare after:
# 1. Run benchmark → record to file
# 2. Make change
# 3. Run benchmark → compare
# Don't trust single runs — run 3x and take median

Track Linux kernel releases and the tools that depend on them.
ReleaseRun monitors Linux, Docker, Kubernetes, Python, Go, and 13+ other technologies.

Related: Linux Admin & Debugging Reference | Linux Networking Reference

🔍 Free tool: HTTP Security Headers Analyzer — after tuning your Linux web server, check that it's returning the right HTTP security headers.

Founded

2023 in London, UK

Contact

hello@releaserun.com