Linux Performance Tuning Reference: perf, Flame Graphs, eBPF, iostat & Benchmarking
Linux performance investigation tools — from the first top command to CPU flame graphs and eBPF tracing. The Brendan Gregg-style systematic approach to finding bottlenecks.
Contents
1. The 60-Second Investigation
Run these 10 commands first — always
# From Brendan Gregg's "USE Method" — run in this order: uptime # 1. load average trend (1/5/15 min) dmesg | tail -20 # 2. kernel errors (OOM, disk errors, network drops) vmstat 1 5 # 3. runnable, memory, swap, I/O, CPU mpstat -P ALL 1 3 # 4. per-CPU stats (imbalance? one core 100%?) pidstat 1 3 # 5. which processes consume CPU iostat -xz 1 3 # 6. disk utilisation + await times free -m # 7. free vs available memory sar -n DEV 1 3 # 8. network interface throughput sar -n TCP,ETCP 1 3 # 9. TCP retransmits (network quality) top # 10. overall picture, press 1 for per-CPU # The USE method for every resource: # U = Utilisation (% busy) # S = Saturation (queue depth / wait time) # E = Errors (reported in dmesg / /proc/net/dev)
2. CPU Analysis
CPU saturation, context switches, load vs utilisation
# Load average vs CPU count: # Load 4.0 on 4-core = 100% utilised (each core has 1 runnable task) # Load 8.0 on 4-core = 200% = tasks are queuing nproc # how many logical CPUs cat /proc/loadavg # current load averages # Per-core utilisation: mpstat -P ALL 1 5 # %usr %sys %iowait %irq %soft %idle per core # iowait > 20% = I/O-bound, not CPU-bound # sys > 20% = kernel overhead (syscall rate, interrupts, scheduling) # What's actually running: pidstat -u 1 5 # CPU per process pidstat -t -p1 5 # per thread breakdown ps aux --sort=-%cpu | head -20 # Context switches (high = lock contention or too many threads): vmstat 1 5 # cs column = context switches/sec pidstat -w 1 5 # context switches per process # CPU steal (cloud VMs): vmstat 1 5 # st column — if > 0, host is overcommitted top # %st in CPU line # CPU frequency/throttling: cpupower frequency-info # current freq, governor cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
3. Memory Analysis
The difference between free and available, OOM, huge pages
# Available (not free) is what matters: free -m # total used free shared buff/cache available # 15900 8200 1200 400 6500 7100 # "free" = 1.2GB, but "available" = 7.1GB (cache is reclaimable) # PANIC if available is near zero, not if free is low # Per-process memory: ps aux --sort=-%mem | head -20 # VSZ = virtual, RSS = resident (what's actually in RAM) # What's using memory in detail: cat /proc//status | grep -E 'Vm|Rss' pmap -x | sort -n -k3 | tail -20 # per-mapping RSS # Slab cache (kernel memory — can be huge on busy servers): slabtop # interactive sorted by total memory cat /proc/slabinfo | sort -k3 -rn | head -20 # Is swap being used? (if yes, performance will suffer): vmstat 1 5 # si/so columns = swap in/out per second sar -W 1 5 # swap stats # OOM killer events: dmesg | grep -i 'oom\|killed process' journalctl -k | grep -i 'oom\|killed' # Transparent huge pages (often a latency source): cat /sys/kernel/mm/transparent_hugepage/enabled # [always] → can cause latency spikes from defragmentation # echo madvise > /sys/kernel/mm/transparent_hugepage/enabled # safer
4. Disk I/O Analysis
iostat, iotop, and finding which process is causing the I/O
# iostat — the primary disk I/O tool: iostat -xz 1 5 # Key columns: # %util — how busy the disk is (100% = saturated) # await — average I/O wait time (ms) — goal: <10ms for SSD, <20ms for HDD # r/s w/s — reads/writes per second # rkB/s wkB/s — throughput in KB/s # avgqu-sz — queue depth (>1 = disk can't keep up) # Which process is doing I/O: iotop -oa # -o: only show active, -a: accumulated pidstat -d 1 5 # per-process I/O stats ls -la /proc//fd # open file descriptors # File system latency (real latency seen by apps): # (requires BCC tools) ext4slower 10 # show ext4 ops > 10ms biolatency # histogram of block I/O latency # Dirty page flush storms: cat /proc/meminfo | grep Dirty # Dirty: large value = unflushed writes # cat /proc/sys/vm/dirty_ratio — max dirty pages % before sync (default 20%) # cat /proc/sys/vm/dirty_background_ratio — background flush threshold (default 10%) # Tune for write-heavy servers: dirty_ratio=40, dirty_background_ratio=10 # inode exhaustion (disk full but df shows space): df -i # shows inode usage per filesystem # 100% inode = new files can't be created despite free disk space
5. Network Performance
TCP retransmits, socket backlog, network saturation
# Interface throughput: sar -n DEV 1 5 # rxkB/s txkB/s per interface ip -s link show eth0 # total bytes + errors + drops # TCP retransmits (network quality indicator): sar -n TCP,ETCP 1 5 # retrans/s > 1% of active/s = network issue or buffer bloat # Socket queue depth (are connections queuing?): ss -s # summary: total, timewait, close-wait counts ss -tlnp # listening sockets with queue depths # Recv-Q > 0 on listening socket = application is too slow to accept() connections # Send-Q > 0 = remote end slow to receive (TCP window full) # Connection state breakdown: ss -ta state established | wc -l # active connections ss -ta state time-wait | wc -l # TIME_WAIT (normal but high = lots of short-lived connections) ss -ta state close-wait | wc -l # CLOSE_WAIT (application not closing sockets) # Dropped packets: netstat -s | grep -i 'drop\|overflow\|fail' # listen queue overflow = increase /proc/sys/net/core/somaxconn # and net.ipv4.tcp_max_syn_backlog # Bandwidth test (between servers): iperf3 -s # server iperf3 -c-t 30 # client, 30-second test
6. Profiling with perf
CPU profiling, cache misses, branch mispredictions
# perf stat — hardware counter summary (instant read): perf stat -a sleep 5 # system-wide for 5 seconds perf stat -psleep 5 # for a specific process # Key counters to watch: # task-clock: CPU time (ms) # instructions: total instructions executed # cycles: total CPU cycles # IPC (instructions/cycle): < 1 = CPU stalled waiting (memory-bound), > 2 = compute-bound # cache-misses / cache-references: high ratio = memory-bound # branch-misses / branches: high = branch misprediction (hard to optimise) # CPU profiling — sample the call stack: perf record -F 99 -a -g -- sleep 30 # system-wide, 30s, with stack traces perf record -F 99 -p -g -- sleep 30 # single process perf report # interactive TUI # perf top — real-time hot functions: perf top -p # which functions are consuming CPU right now perf top -a # system-wide # Count specific events: perf stat -e cache-misses,cache-references,cycles,instructions ./my-program perf stat -e 'syscalls:sys_enter_*' -p sleep 5 # syscall frequency
perf requires perf_event_paranoid ≤ 1 or root access. On cloud VMs, set: echo 1 > /proc/sys/kernel/perf_event_paranoid7. Flame Graphs
Brendan Gregg’s flame graphs — visualise CPU hot paths
# Install: git clone https://github.com/brendangregg/FlameGraph cd FlameGraph # CPU flame graph (most common): perf record -F 99 -a -g -- sleep 30 perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > cpu-flamegraph.svg # Open in browser — wide bars = hot functions, tall stacks = deep call chains # For a specific process: perf record -F 99 -p $(pgrep myapp) -g -- sleep 30 perf script | ./stackcollapse-perf.pl > out.perf-folded ./flamegraph.pl out.perf-folded > flamegraph.svg # Differential flame graph (before vs after optimisation): ./difffolded.pl baseline.folded optimised.folded | ./flamegraph.pl > diff.svg # Red = increased CPU time, blue = decreased # Off-CPU flame graph (time NOT on CPU — lock waits, I/O waits): # Requires BCC tools (offcputime) /usr/share/bcc/tools/offcputime -df -p $(pgrep myapp) 30 > out.stacks ./flamegraph.pl --color=io --title="Off-CPU" out.stacks > offcpu.svg # Go flame graphs (native pprof): go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30 # Served at localhost:8080 with flame graph view built in
8. BCC / eBPF Tools
Production-safe dynamic tracing — no restarts, no recompilation
# Install BCC: apt install bpfcc-tools python3-bpfcc (Ubuntu 20.04+) # All tools in /usr/share/bcc/tools/ or apt bpftrace for bpftrace # File system latency: /usr/share/bcc/tools/ext4slower 10 # ext4 ops > 10ms /usr/share/bcc/tools/biolatency # block I/O latency histogram /usr/share/bcc/tools/biotop # top disk I/O by process # Network: /usr/share/bcc/tools/tcpconnect # new TCP connections with latency /usr/share/bcc/tools/tcpretrans # TCP retransmits with kernel stack /usr/share/bcc/tools/tcplife # TCP session lifespan + bytes # CPU / scheduling: /usr/share/bcc/tools/runqlat # CPU run queue latency histogram /usr/share/bcc/tools/offcputime -p5 # time blocked off CPU # Memory: /usr/share/bcc/tools/memleak # heap memory leak detection /usr/share/bcc/tools/oomkill # OOM killer events with context # bpftrace one-liners (modern alternative to BCC): bpftrace -e 'tracepoint:syscalls:sys_enter_write { @[comm] = count(); }' # write() per process bpftrace -e 'tracepoint:block:block_rq_issue { @bytes = hist(args->bytes); }' # I/O size histogram bpftrace -e 'kretprobe:vfs_read { @ns[comm] = hist(nsecs); }' # VFS read latency
9. strace / ltrace
System call tracing — what the process is actually asking the kernel to do
# strace: traces system calls (kernel boundary) # SAFE for production: read-only observation, but adds 30-100% overhead — keep brief # Attach to running process: strace -p# show all syscalls strace -p -e trace=network # only network syscalls strace -p -e trace=file # only file operations strace -p -f # follow threads (-f for forked processes) # Count + time syscalls (lower overhead): strace -c -p sleep 30 # summary: calls, errors, time per syscall strace -c ./myprogram args # trace from start # Per-call timing: strace -T -p # show time spent in each call # Slow read() calls = disk I/O, slow connect() = network latency # Find what files a process is opening: strace -e trace=openat,open -p # file opens # Or less overhead: lsof -p # currently open files # ltrace: traces library calls (libc boundary) ltrace -p # all library calls ltrace -e malloc,free -c ./myprogram # count malloc/free calls # Finding "why is this process slow at startup": strace -T -e trace=all ./myprogram 2>&1 | sort -t'<' -k2 -rn | head -20
10. Micro-benchmarking
stress-ng, fio, sysbench — isolating resource performance
# CPU benchmark:
sysbench cpu --cpu-max-prime=20000 run
stress-ng --cpu 4 --timeout 60s --metrics-brief # saturate 4 CPU cores
# Memory bandwidth:
sysbench memory --memory-block-size=1K --memory-total-size=10G run
stress-ng --vm 2 --vm-bytes 1G --timeout 60s --metrics-brief
# Disk I/O benchmark (fio):
# Sequential write:
fio --name=seqwrite --rw=write --bs=1M --size=4G --numjobs=1 \
--ioengine=libaio --direct=1 --iodepth=16 --filename=/tmp/testfile
# Random read IOPS:
fio --name=randread --rw=randread --bs=4k --size=4G --numjobs=4 \
--ioengine=libaio --direct=1 --iodepth=32 --filename=/tmp/testfile
# Network latency:
ping -c 100 | tail -2 # min/avg/max/mdev
hping3 -S -p 80 -c 100 # TCP SYN latency
# Context switch overhead (how expensive scheduling is):
sysbench threads --thread-yields=64 --thread-locks=2 run
# Establish baseline before changes, compare after:
# 1. Run benchmark → record to file
# 2. Make change
# 3. Run benchmark → compare
# Don't trust single runs — run 3x and take median
Track Linux kernel releases and the tools that depend on them.
ReleaseRun monitors Linux, Docker, Kubernetes, Python, Go, and 13+ other technologies.
Related: Linux Admin & Debugging Reference | Linux Networking Reference
🔍 Free tool: HTTP Security Headers Analyzer — after tuning your Linux web server, check that it's returning the right HTTP security headers.
Founded
2023 in London, UK
Contact
hello@releaserun.com