LinuxCon Europe 2014: Linux Performance Tools 2014
Talk by Brendan Gregg for LinuxCon Europe, 2014.Video: https://www.youtube.com/watch?v=SN7Z0eCn0VY
Description: "There are many performance tools nowadays for Linux, but how do they all fit together, and when do we use them? This talk summarizes the three types of performance tools: observability, benchmarking, and tuning, providing a tour of what exists and why they exist. Advanced tools including those based on tracepoints, kprobes, and uprobes are also included: perf_events, ktap, SystemTap, LTTng, and sysdig. You'll gain a good understanding of the performance tools landscape, knowing what to reach for to get the most out of your systems."
PDF: LinuxConEU2014_LinuxPerfTools.pdf
Keywords (from pdftotext):
slide 1:
Oct, 2014 Linux Performance Tools Brendan Gregg Senior Performance Architect Performance Engineering Team bgregg@ne8lix.com @brendangreggslide 2:
A quick tour of many tools…slide 3:
• Massive AWS EC2 Linux cloud – Tens of thousands of instances – Autoscale by ~3k each day – CentOS and Ubuntu • FreeBSD for content delivery – Approx 33% of US Internet traffic at night • Performance is criRcal – Customer saRsfacRon: >gt;50M subscribers – $$$ price/performance – Develop tools for cloud-‐wide analysis; use server tools as needed • Just launched in Europe!slide 4:
Brendan Gregg • Senior Performance Architect, Ne8lix – Linux and FreeBSD performance – Performance Engineering team (@coburnw) • Recent work: – Linux perf-‐tools, using crace & perf_events – Systems Performance, PrenRce Hall • Previous work includes: – USE Method, flame graphs, uRlizaRon & latency heat maps, DTrace tools, ZFS L2ARC • Twijer @brendangregg (these slides)slide 5:
Agenda • Methodologies & Tools • Tool Types: – Observability Basic Intermediate Advanced – Benchmarking – Tuning – StaRc • Tracing Aim: to show what can be done Knowing that something can be done is more important than knowing how to do it.slide 6:
Methodologies & Toolsslide 7:
Methodologies & Tools • There are dozens of performance tools for Linux – Packages: sysstat, procps, coreuRls, … – Commercial products • Methodologies can provide guidance for choosing and using tools effecRvelyslide 8:
An3-‐Methodologies • The lack of a deliberate methodology… • Street Light AnR-‐Method: – 1. Pick observability tools that are • Familiar • Found on the Internet, or at random – 2. Run tools – 3. Look for obvious issues • Drunk Man AnR-‐Method: – Tune things at random unRl the problem goes awayslide 9:
Methodologies • For example, the USE Method: – For every resource, check: • URlizaRon • SaturaRon • Errors • 5 Whys: – Ask “why?” 5 Rmes • Other methods include: – Workload characterizaRon, drill-‐down analysis, event tracing, baseline stats, staRc performance tuning, … • Start with the quesRons, then find the toolsslide 10:
Command Line Tools • Useful to study even if you never use them: GUIs and commercial products ocen use the same interfaces Kernel /proc, /sys, … $ vmstat 1! procs -----------memory---------- ---swap-- …! r b swpd free buff cache so …! 9 0 0 29549320 29252 9299060 2 0 0 29547876 29252 9299332 4 0 0 29548124 29252 9299460 5 0 0 29548840 29252 9299592slide 11:
Tool Types Type Observability Benchmarking Tuning StaRc Characteris.c Watch acRvity. Safe, usually, depending on resource overhead. Load test. CauRon: producRon tests can cause issues due to contenRon. Change. Danger: changes could hurt performance, now or later with load. Check configuraRon. Should be safe.slide 12:
Observability Toolsslide 13:
How do you measure these?slide 14:
Observability Tools: Basic • upRme • top (or htop) • ps • vmstat • iostat • mpstat • freeslide 15:
upRme • One way to print load averages: $ uptime! 07:42:06 up 8:16, 1 user, load average: 2.27, 2.84, 2.91! • A measure of resource demand: CPUs + disks – Other OSes only show CPUs: easier to interpret • ExponenRally-‐damped moving averages with Rme constants of 1, 5, and 15 minutes – Historic trend without the line graph • Load >gt; # of CPUs, may mean CPU saturaRon – Don’t spend more than 5 seconds studying theseslide 16:
top (or htop) • System and per-‐process interval summary: $ top - 18:50:26 up 7:43, 1 user, load average: 4.11, 4.91, 5.22! Tasks: 209 total, 1 running, 206 sleeping, 0 stopped, 2 zombie! Cpu(s): 47.1%us, 4.0%sy, 0.0%ni, 48.4%id, 0.0%wa, 0.0%hi, 0.3%si, 0.2%st! Mem: 70197156k total, 44831072k used, 25366084k free, 36360k buffers! Swap: 0k total, 0k used, 0k free, 11873356k cached! PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5738 apiprod 0 62.6g 29g 352m S 417 44.2 2144:15 java 1386 apiprod 0 17452 1388 964 R 0 0.0 0:00.02 top 1 root 0 24340 2272 1340 S 0 0.0 0:01.51 init 2 root 0 S 0 0.0 0:00.00 kthreadd […]! • %CPU is summed across all CPUs • Can miss short-‐lived processes (atop won’t) • Can consume noRceable CPU to read /procslide 17:
htopslide 18:
ps • Process status lisRng (eg, “ASCII art forest”): $ ps -ef f! UID PID PPID […]! root root 28261 4546 prod 28287 28261 prod 28288 28287 prod 3156 28288 root root 4969 4965 […]! C STIME TTY STAT TIME CMD! 0 11:08 ? 0 17:24 ? 0 17:24 ? 0 17:24 pts/0 0 19:15 pts/0 0 11:08 ? 0 11:08 ? 0:00 /usr/sbin/sshd -D! 0:00 \_ sshd: prod [priv]! 0:00 \_ sshd: prod@pts/0 ! 0:00 \_ -bash! 0:00 \_ ps -ef f! 0:00 /bin/sh /usr/bin/svscanboot! 0:00 \_ svscan /etc/service! • Custom fields: $ ps -eo user,sz,rss,minflt,majflt,pcpu,args! USER RSS MINFLT MAJFLT %CPU COMMAND! root 6085 2272 11928 24 0.0 /sbin/init! […]!slide 19:
vmstat • Virtual memory staRsRcs and more: $ vmstat –Sm 1! procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----! r b swpd free buff cache cs us sy id wa! 8 0 12 25 34 0 0! 7 0 0 205 186 46 13 0 0! 8 0 8 210 435 39 21 0 0! 8 0 0 218 219 42 17 0 0! […]! • USAGE: vmstat [interval [count]] • First output line has some summary since boot values (should be all; parRal is confusing) • High level CPU summary. “r” is runnable tasks.slide 20:
iostat • Block I/O (disk) stats. 1st output is since boot. $ iostat -xmdz 1! Linux 3.13.0-29 (db001-eb883efa) 08/18/2014 _x86_64_ Device: rrqm/s wrqm/s r/s w/s rMB/s xvda xvdb 0.00 15299.00 xvdc 0.00 15271.00 md0 0.00 31082.00 (16 CPU)! wMB/s \ ...! 0.00 / ...! 0.00 \ ...! 0.01 / ...! 0.01 \ ...! Workload • Very useful set of stats ... \ avgqu-sz ... / ... \ ... / ... \ await r_await w_await ResulRng Performance svctm %util! 0.00! 86.40! 86.00! 0.00!slide 21:
mpstat • MulR-‐processor staRsRcs, per-‐CPU: $ mpstat –P ALL 1! […]! 08:06:43 PM CPU %usr 08:06:44 PM all 53.45 08:06:44 PM 0 49.49 08:06:44 PM 1 51.61 08:06:44 PM 2 58.16 08:06:44 PM 3 54.55 08:06:44 PM 4 47.42 08:06:44 PM 5 65.66 08:06:44 PM 6 50.00 […]! %nice %sys %iowait %irq %soft %steal %guest • Look for unbalanced workloads, hot CPUs. %idle! 42.26! 45.45! 41.94! 33.67! 40.40! 49.48! 31.31! 47.92!slide 22:
free • Main memory usage: $ free -m! total Mem: -/+ buffers/cache: Swap: used free 3313! shared • buffers: block device I/O cache • cached: virtual page cache buffers cached! 527!slide 23:
Observability Tools: Basicslide 24:
Observability Tools: Intermediate • strace • tcpdump • netstat • nicstat • pidstat • swapon • lsof • sar (and collectl, dstat, etc.)slide 25:
strace • System call tracer: $ strace –tttT –p 313! 1408393285.779746 getgroups(0, NULL) = 1slide 26:gt;! 1408393285.779873 getgroups(1, [0]) = 1 gt;! 1408393285.780797 close(3) = 0 gt;! 1408393285.781338 write(1, "LinuxCon 2014!\n", 15LinuxCon 2014!! ) = 15 gt;! • Eg, -‐jt: Rme (us) since epoch; -‐T: syscall Rme (s) • Translates syscall args – Very helpful for solving system usage issues • Currently has massive overhead (ptrace based) – Can slow the target by >gt; 100x. Use extreme cauRon.
tcpdump • Sniff network packets for post analysis: $ tcpdump -i eth0 -w /tmp/out.tcpdump! tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes! ^C7985 packets captured! 8996 packets received by filter! 1010 packets dropped by kernel! # tcpdump -nr /tmp/out.tcpdump | head reading from file /tmp/out.tcpdump, link-type EN10MB (Ethernet) 20:41:05.038437 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 18...! 20:41:05.038533 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 48...! 20:41:05.038584 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 96...! […]! • Study packet sequences with Rmestamps (us) • CPU overhead opRmized (socket ring buffers), but can sRll be significant. Use cauRon.slide 27:
netstat • Various network protocol staRsRcs using -‐s: $ netstat –s! • A mulR-‐tool: […]! -‐i: interface stats -‐r: route table default: list conns • netstat -‐p: shows process details! • Per-‐second interval with -‐c Tcp:! 736455 active connections openings! 176887 passive connection openings! 33 failed connection attempts! 1466 connection resets received! 3311 connections established! 91975192 segments received! 180415763 segments send out! 223685 segments retransmited! 2 bad segments received.! 39481 resets sent! […]! TcpExt:! 12377 invalid SYN cookies received! 2982 delayed acks sent! […]!slide 28:
nicstat • Network interface stats, iostat-‐like output: $ ./nicstat 1! Time Int 21:21:43 21:21:43 eth0 Time Int 21:21:44 21:21:44 eth0 Time Int 21:21:45 21:21:45 eth0 […]! rKB/s rKB/s rKB/s wKB/s wKB/s wKB/s rPk/s rPk/s rPk/s wPk/s wPk/s wPk/s rAvs wAvs %Util 4915.4 4915.4 0.00 139.8 0.00 rAvs wAvs %Util 0.00 0.00 58.76 40441.3 0.00 rAvs wAvs %Util 4400.8 4400.8 0.00 54.99 2979.1 0.00 Sat! 0.00! 0.00! Sat! 0.00! 0.00! Sat! 0.00! 0.00! • Check network throughput and interface %uRl • I wrote this years ago; Tim Cook ported to Linuxslide 29:
pidstat • Very useful process stats. eg, by-‐thread, disk I/O: $ pidstat -t 1! Linux 3.2.0-54 (db002-91befe03) !08/18/2014 !_x86_64_ !(8 CPU)! 08:57:52 PM TGID TID %usr %system %guest %CPU 08:57:54 PM - 484.75 0.00 524.58 08:57:54 PM 08:57:54 PM 08:57:54 PM 08:57:54 PM […]! $ pidstat -d 1! […]! 08:58:27 PM PID kB_rd/s kB_wr/s kB_ccwr/s Command! 08:58:28 PM 0.00 java! […]! • I usually prefer this over top(1) CPU Command! java! |__java! |__java! |__java! |__java!slide 30:
swapon • Show swap device usage: $ swapon -s! • If you have swap enabled… Filename /dev/sda3 Type partition Size Used Priority! -1!slide 31:
lsof • More a debug tool, lsof(8) shows file descriptor usage, which for some apps, equals current acRve network connecRons: # lsof -iTCP -sTCP:ESTABLISHED! COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME! sshd 755 root 3r IPv4 13576887 0t0 TCP bgregg-test-i-f106:ssh->gt;prod100.netflix.com: 15241 (ESTABLISHED)! platforms 2614 app1 8u IPv4 0t0 TCP localhost:33868->gt;localhost:5433 (ESTABLISHED)! postgres 2648 app1 7u IPv4 0t0 TCP localhost:5433->gt;localhost:33868 (ESTABLISHED)! epic_plug 2857 app1 7u IPv4 0t0 TCP localhost:33885->gt;localhost:5433 (ESTABLISHED)! postgres 2892 app1 7u IPv4 0t0 TCP localhost:5433->gt;localhost:33885 (ESTABLISHED)! […]! • I’d prefer to: echo /proc/PID/fd | wc -l!slide 32:
sar • System AcRvity Reporter. Many stats, eg: $ sar -n TCP,ETCP,DEV 1! Linux 3.2.55 (test-e4f1a80b) !08/18/2014 !_x86_64_ !(8 CPU)! 09:10:43 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s 09:10:44 PM 09:10:44 PM eth0 4114.00 4186.00 4537.46 28513.24 09:10:43 PM active/s passive/s iseg/s oseg/s! 09:10:44 PM 4107.00 22511.00! 09:10:43 PM atmptf/s estres/s retrans/s isegerr/s orsts/s! 09:10:44 PM 1.00! […]! • Archive or live mode: (interval [count]) • Well designed. Header naming convenRon, logical groups: TCP, ETCP, DEV, EDEV, … rxmcst/s! 0.00! 0.00!slide 33:
Observability: sarslide 34:
Other Tools • You may also use collectl, atop, dstat, or another measure-‐all tool • The tool isn’t important • It’s important to have a way to measure everything you want • In cloud environments, you are probably using a monitoring product, developed in-‐house or commercial. Same method applies…slide 35:
How does your monitoring tool measure these?slide 36:
Observability Tools: Intermediateslide 37:
Advanced Observability Tools • Misc: – ltrace, ss, iptraf, ethtool, snmpget, lldptool, iotop, blktrace, slabtop, /proc, pcstat • CPU Performance Counters: – perf_events, Rptop, rdmsr • Advanced Tracers: – perf_events, crace, eBPF, SystemTap, ktap, LTTng, dtrace4linux, sysdig • Some selected demos…slide 38:
ss • More socket staRsRcs: $ ss -mop! State Recv-Q Send-Q Local Address:Port Peer Address:Port CLOSE-WAIT 1 127.0.0.1:42295 127.0.0.1:28527 users:(("apacheLogParser",2702,3))! ! mem:(r1280,w0,f2816,t0)! ESTAB 127.0.0.1:5433 127.0.0.1:41312 timer:(keepalive,36min,0) users:(("postgres",2333,7))! ! mem:(r0,w0,f0,t0)! […]! $ ss –i! State Recv-Q Send-Q Local Address:Port Peer Address:Port CLOSE-WAIT 1 127.0.0.1:42295 127.0.0.1:28527 cubic wscale:6,6 rto:208 rtt:9/6 ato:40 cwnd:10 send 145.6Mbps rcv_space:32792! ESTAB 10.144.107.101:ssh 10.53.237.72:4532 cubic wscale:4,6 rto:268 rtt:71.5/3 ato:40 cwnd:10 send 1.5Mbps rcv_rtt:72 rcv_space:14480! […]!slide 39:
iptrafslide 40:
iotop • Block device I/O (disk) by process: $ iotop! Total DISK READ: TID PRIO USER 959 be/4 root 6641 be/4 root 1 be/4 root 2 be/4 root 3 be/4 root 4 be/4 root 5 be/4 root 6 rt/4 root […]! 50.47 M/s | Total DISK WRITE: 59.21 M/s! DISK READ DISK WRITE SWAPIN IO>gt; COMMAND 0.00 B/s 0.00 B/s 0.00 % 99.99 % [flush-202:1]! 50.47 M/s 82.60 M/s 0.00 % 32.51 % java –Dnop –X! 0.00 B/s 0.00 B/s 0.00 % 0.00 % init! 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kthreadd]! 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/0]! 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kworker/0:0]! 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kworker/u:0]! 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/0]! • Needs kernel support enabled – CONFIG_TASK_IO_ACCOUNTINGslide 41:
slabtop • Kernel slab allocator memory usage: $ slabtop! Active / Total Objects (% used) : 4692768 / 4751161 (98.8%)! Active / Total Slabs (% used) : 129083 / 129083 (100.0%)! Active / Total Caches (% used) : 71 / 109 (65.1%)! Active / Total Size (% used) : 729966.22K / 738277.47K (98.9%)! Minimum / Average / Maximum Object : 0.01K / 0.16K / 8.00K! OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 3565575 3565575 100% 0.10K 91425 365700K buffer_head! 314916 314066 99% 0.19K 14996 59984K dentry! 184192 183751 99% 0.06K 11512K kmalloc-64! 138618 138618 100% 0.94K 130464K xfs_inode! 138602 138602 100% 0.21K 29968K xfs_ili! 102116 99012 96% 0.55K 58352K radix_tree_node! 97482 49093 50% 0.09K 9284K kmalloc-96! 22695 20777 91% 0.05K 1068K shared_policy_node! 21312 21312 100% 0.86K 18432K ext4_inode_cache! 16288 14601 89% 0.25K 4072K kmalloc-256! […]!slide 42:
pcstat • Show page cache residency by file: # ./pcstat data0*! |----------+----------------+------------+-----------+---------|! | Name | Size | Pages | Cached | Percent |! |----------+----------------+------------+-----------+---------|! | data00 | 104857600 | 25600 | 25600 | 100.000 |! | data01 | 104857600 | 25600 | 25600 | 100.000 |! | data02 | 104857600 | 25600 | 4080 | 015.938 |! | data03 | 104857600 | 25600 | 25600 | 100.000 |! | data04 | 104857600 | 25600 | 16010 | 062.539 |! | data05 | 104857600 | 25600 | 0 | 000.000 |! |----------+----------------+------------+-----------+---------|! • Uses the mincore(2) syscall. Useful for database performance analysis.slide 43:
perf_events (counters) • Performance Monitoring Counters (PMCs): $ perf list | grep –i hardware! cpu-cycles OR cycles stalled-cycles-frontend OR idle-cycles-frontend stalled-cycles-backend OR idle-cycles-backend instructions […]! branch-misses bus-cycles L1-dcache-loads L1-dcache-load-misses […]! rNNN (see 'perf list --help' on how to encode it) mem:slide 44:gt;[:access] [Hardware event]! [Hardware event]! [Hardware event]! [Hardware event]! [Hardware event]! [Hardware event]! [Hardware cache event]! [Hardware cache event]! • IdenRfy CPU cycle breakdowns, esp. stall types [Raw hardware event … ! [Hardware breakpoint]! – PMCs not enabled by-‐default in clouds (yet) – Can be Rme-‐consuming to use (CPU manuals) • Use flame graphs to visualize sampled stack traces
perf_events CPU Flame Graph Kernel TCP/IP Broken Java stacks (missing frame pointer) GC Locks Time Idle thread epollslide 45:
Rptop • IPC by process, %MISS, %BUS • Needs some love. perfmon2 library integraRon? • SRll can’t use it in clouds yet (needs PMCs enabled)slide 46:
rdmsr • Model Specific Registers (MSRs), unlike PMCs, can be read by default in Xen guests – Timestamp clock, temp, power, … – Use rdmsr(1) from the msr-‐tools package to read them – Uses include (hjps://github.com/brendangregg/msr-‐cloud-‐tools): ./showboost! ec2-guest# [...]! C0_MCYC C0_ACYC TIME 06:11:35 UTIL 51% 50% 49% RATIO 116% 115% 115% 06:11:40 06:11:45 [...]! ec2-guest# ./cputemp 1! CPU1 CPU2 CPU3 CPU4! 61 61 60 59! CPU Temperature 60 61 60 60! [...]! MHz! 2900! 2899! 2899! Real CPU MHzslide 47:
More Advanced Tools… • Some others worth menRoning: Tool ltrace ethtool snmpget lldptool blktrace /proc pmu-‐tools Descrip.on Library call tracer Mostly interface tuning; some stats SNMP network host staRsRcs Can get LLDP broadcast stats Block I/O event tracer Many raw kernel counters On-‐ and off-‐core CPU counter toolsslide 48:
Advanced Tracers • Many opRons on Linux: – perf_events, crace, eBPF, SystemTap, ktap, LTTng, dtrace4linux, sysdig • Most can do staRc and dynamic tracing – StaRc: pre-‐defined events (tracepoints) – Dynamic: instrument any socware (kprobes, uprobes). Custom metrics on-‐demand. Catch all. • Many are in-‐development. – I’ll summarize their state later…slide 49:
Linux Observability Toolsslide 50:
Linux Observability Toolsslide 51:
Benchmarking Toolsslide 52:
Benchmarking Tools • MulR: – UnixBench, lmbench, sysbench, perf bench • FS/disk: – dd, hdparm, fio • App/lib: – ab, wrk, jmeter, openssl • Networking: – ping, hping3, iperf, jcp, traceroute, mtr, pcharslide 53:
AcRve Benchmarking • Most benchmarks are misleading or wrong – You benchmark A, but actually measure B, and conclude that you measured C • AcRve Benchmarking 1. Run the benchmark for hours 2. While running, analyze and confirm the performance limiter using observability tools • We just covered those tools – use them!slide 54:
lmbench • CPU, memory, and kernel micro-‐benchmarks • Eg, memory latency by stride size: $ lat_mem_rd 100m 128 >gt; out.latencies! some R processing…! L2 cache Main Memory L1 cache L3 cacheslide 55:
fio • FS or disk I/O micro-‐benchmarks $ fio --name=seqwrite --rw=write --bs=128k --size=122374m! […]! seqwrite: (groupid=0, jobs=1): err= 0: pid=22321! write: io=122374MB, bw=840951KB/s, iops=6569 , runt=149011msec! clat (usec): min=41 , max=133186 , avg=148.26, stdev=1287.17! lat (usec): min=44 , max=133188 , avg=151.11, stdev=1287.21! bw (KB/s) : min=10746, max=1983488, per=100.18%, avg=842503.94, stdev=262774.35! cpu : usr=2.67%, sys=43.46%, ctx=14284, majf=1, minf=24! IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >gt;=64=0.0%! submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >gt;=64=0.0%! complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >gt;=64=0.0%! issued r/w/d: total=0/978992/0, short=0/0/0! lat (usec): 50=0.02%, 100=98.30%, 250=1.06%, 500=0.01%, 750=0.01%! lat (usec): 1000=0.01%! lat (msec): 2=0.01%, 4=0.01%, 10=0.25%, 20=0.29%, 50=0.06%! lat (msec): 100=0.01%, 250=0.01%! • Results include basic latency distribuRonslide 56:
pchar • Traceroute with bandwidth per hop! $ pchar 10.71.83.1! […]! 4: 10.110.80.1 (10.110.80.1)! Partial loss: 0 / 5 (0%)! Partial char: rtt = 9.351109 ms, (b = 0.004961 ms/B), r2 = 0.184105! stddev rtt = 4.967992, stddev b = 0.006029! Partial queueing: avg = 0.000000 ms (0 bytes)! Hop char: rtt = --.--- ms, bw = 1268.975773 Kbps! Hop queueing: avg = 0.000000 ms (0 bytes)! 5: 10.193.43.181 (10.193.43.181)! Partial loss: 0 / 5 (0%)! Partial char: rtt = 25.461597 ms, (b = 0.011934 ms/B), r2 = 0.228707! stddev rtt = 10.426112, stddev b = 0.012653! Partial queueing: avg = 0.000000 ms (0 bytes)! Hop char: rtt = 16.110487 ms, bw = 1147.210397 Kbps! Hop queueing: avg = 0.000000 ms (0 bytes)! […]! • Needs love. Based on pathchar (Linux 2.0.30).slide 57:
Benchmarking Toolsslide 58:
Tuning Toolsslide 59:
Tuning Tools • Generic interfaces: – sysctl, /sys • Many areas have custom tuning tools: – ApplicaRons: their own config – CPU/scheduler: nice, renice, taskset, ulimit, chcpu – Storage I/O: tune2fs, ionice, hdparm, blockdev, … – Network: ethtool, tc, ip, route – Dynamic patching: stap, kpatchslide 60:
Tuning Methods • ScienRfic Method: QuesRon Hypothesis PredicRon Test Analysis • Any observa3onal or benchmarking tests you can try before tuning? • Consider risks, and see previous toolsslide 61:
Tuning Toolsslide 62:
StaRc Toolsslide 63:
StaRc Tools • StaRc Performance Tuning: check the staRc state and configuraRon of the system – CPU types – Storage devices – File system capacity – File system and volume configuraRon – Route table – State of hardware • What can be checked on a system without loadslide 64:
StaRc Toolsslide 65:
Tracingslide 66:
Tracing Frameworks: Tracepoints • StaRcally placed at logical places in the kernel • Provides key event details as a “format” stringslide 67:
Tracing Frameworks: + probes • kprobes: dynamic kernel tracing – funcRon calls, returns, line numbers • uprobes: dynamic user-‐level tracingslide 68:
Tracing Tools • OpRons: – crace – perf_events – eBPF – SystemTap – ktap – LTTng – dtrace4linux – Oracle Linux DTrace – sysdig • Too many choices, and many sRll in-‐developmentslide 69:
Imagine Linux with Tracing • With a programmable tracer, high level tools can be wrijen, such as: – iosnoop – iolatency – opensnoop – …slide 70:
iosnoop • Block I/O (disk) events with latency: # ./iosnoop –ts! Tracing block I/O. Ctrl-C to end.! STARTs ENDs COMM 5982800.302061 5982800.302679 supervise 5982800.302423 5982800.302842 supervise 5982800.304962 5982800.305446 supervise 5982800.305250 5982800.305676 supervise […]! PID TYPE DEV 202,1 202,1 202,1 202,1 BLOCK BYTES LATms! 0.62! 0.42! 0.48! 0.43! # ./iosnoop –h! USAGE: iosnoop [-hQst] [-d device] [-i iotype] [-p PID] [-n name] [duration]! -d device # device string (eg, "202,1)! -i iotype # match type (eg, '*R*' for all reads)! -n name # process name to match on I/O issue! -p PID # PID to match on I/O issue! # include queueing time in LATms! # include start time of I/O (s)! # include completion time of I/O (s)! # this usage message! duration # duration seconds, and use buffers! […]!slide 71:
iolatency • Block I/O (disk) latency distribuRons: # ./iolatency ! Tracing block I/O. Output every 1 seconds. Ctrl-C to end.! >gt;=(ms) ..slide 72:gt; 1 : 2104 |######################################|! 1 ->gt; 2 : 280 |###### 2 ->gt; 4 : 2 4 ->gt; 8 : 0 8 ->gt; 16 : 202 |#### >gt;=(ms) .. gt; 1 : 1144 |######################################|! 1 ->gt; 2 : 267 |######### 2 ->gt; 4 : 10 4 ->gt; 8 : 5 8 ->gt; 16 : 248 |######### 16 ->gt; 32 : 601 |#################### 32 ->gt; 64 : 117 |#### […]!
opensnoop • Trace open() syscalls showing filenames: # ./opensnoop -t! Tracing open()s. Ctrl-C to end.! TIMEs COMM PID postgres postgres postgres postgres postgres postgres postgres svstat svstat stat stat stat stat stat stat […]! FD FILE! 0x8 /proc/self/oom_adj! 0x5 global/pg_filenode.map! 0x5 global/pg_internal.init! 0x5 base/16384/PG_VERSION! 0x5 base/16384/pg_filenode.map! 0x5 base/16384/pg_internal.init! 0x5 base/16384/11725! 0x4 supervise/ok! 0x4 supervise/status! 0x3 /etc/ld.so.cache! 0x3 /lib/x86_64-linux-gnu/libselinux…! 0x3 /lib/x86_64-linux-gnu/libc.so.6! 0x3 /lib/x86_64-linux-gnu/libdl.so.2! 0x3 /proc/filesystems! 0x3 /etc/nsswitch.conf!slide 73:
funcgraph • Trace a graph of kernel code flow: # ./funcgraph -Htp 5363 vfs_read! Tracing "vfs_read" for PID 5363... Ctrl-C to end.! # tracer: function_graph! TIME CPU DURATION FUNCTION CALLS! 4346366.073832 | | vfs_read() {! 4346366.073834 | rw_verify_area() {! 4346366.073834 | security_file_permission() {! 4346366.073834 | apparmor_file_permission() {! 4346366.073835 | 0.153 us common_file_perm();! 4346366.073836 | 0.947 us 4346366.073836 | 0.066 us __fsnotify_parent();! 4346366.073836 | 0.080 us fsnotify();! 4346366.073837 | 2.174 us 4346366.073837 | 2.656 us 4346366.073837 | tty_read() {! 4346366.073837 | 0.060 us tty_paranoia_check();! […]!slide 74:
kprobe • Dynamically trace a kernel funcRon call or return, with variables, and in-‐kernel filtering: # ./kprobe 'p:open do_sys_open filename=+0(%si):string' 'filename ~ "*stat"'! Tracing kprobe myopen. Ctrl-C to end.! postgres-1172 [000] d... 6594028.787166: open: (do_sys_open +0x0/0x220) filename="pg_stat_tmp/pgstat.stat"! postgres-1172 [001] d... 6594028.797410: open: (do_sys_open +0x0/0x220) filename="pg_stat_tmp/pgstat.stat"! postgres-1172 [001] d... 6594028.797467: open: (do_sys_open +0x0/0x220) filename="pg_stat_tmp/pgstat.stat”! ^C! Ending tracing...! • Add -‐s for stack traces; -‐p for PID filter in-‐kernel. • Quickly confirm kernel behavior; eg: did a tunable take effect?slide 75:
Imagine Linux with Tracing • These tools aren’t using dtrace4linux, SystemTap, ktap, or any other add-‐on tracer • These tools use exis.ng Linux capabili.es – No extra kernel bits, not even kernel debuginfo – Just Linux’s built-‐in 8race profiler – Demoed on Linux 3.2 • Solving real issues nowslide 76:
crace • Added by Steven Rostedt and others since 2.6.27 • Already enabled on our servers (3.2+) – CONFIG_FTRACE, CONFIG_FUNCTION_PROFILER, … – Use directly via /sys/kernel/debug/tracing • My front-‐end tools to aid usage – hjps://github.com/brendangregg/perf-‐tools – Unsupported hacks: see WARNINGs – Also see the trace-‐cmd front-‐end, as well as perf • lwn.net: “Ftrace: The Hidden Light Switch”slide 77:
My perf-‐tools (so far…)slide 78:
Tracing Summary • crace • perf_events • eBPF • SystemTap • ktap • LTTng • dtrace4linux • sysdigslide 79:
perf_events • aka “perf” command • In Linux. Add from linux-‐tools-‐common, … • Powerful mulR-‐tool and profiler – interval sampling, CPU performance counter events – user and kernel dynamic tracing – kernel line tracing and local variables (debuginfo) – kernel filtering, and in-‐kernel counts (perf stat) • Not very programmable, yet – limited kernel summaries. May improve with eBPF.slide 80:
perf_events Example # perf record –e skb:consume_skb -ag! ^C[ perf record: Woken up 1 times to write data ]! [ perf record: Captured and wrote 0.065 MB perf.data (~2851 samples) ]! # perf report! [...]! 74.42% swapper [kernel.kallsyms] [k] consume_skb! --- consume_skb! arp_process! arp_rcv! __netif_receive_skb_core! __netif_receive_skb! netif_receive_skb! virtnet_poll! net_rx_action! __do_softirq! irq_exit! do_IRQ! ret_from_intr! default_idle! cpu_idle! start_secondary! […]! Summarizing stack traces for a tracepoint perf_events can do many things – hard to pick just one exampleslide 81:
eBPF • Extended BPF: programs on tracepoints – High performance filtering: JIT – In-‐kernel summaries: maps • Linux in 3.18? Enhance perf_events/crace/…? # ./bitesize 1! writing bpf-5 ->gt; /sys/kernel/debug/tracing/events/block/block_rq_complete/filter! I/O sizes:! Kbytes : Count! 4 ->gt; 7 : 131! 8 ->gt; 15 : 32! in-‐kernel summary 16 ->gt; 31 : 1! 32 ->gt; 63 : 46! 64 ->gt; 127 : 0! 128 ->gt; 255 : 15! […]!slide 82:
SystemTap • Fully programmable, fully featured • Compiles tracing programs into kernel modules – Needs a compiler, and takes Rme • “Works great on Red Hat” – I keep trying on other distros and have hit trouble in the past; make sure you are on the latest version. – I’m liking it a bit more acer finding ways to use it without kernel debuginfo (a difficult requirement in our environment). Work in progress. • Ever be mainline?slide 83:
ktap • Sampling, staRc & dynamic tracing • Lightweight, simple. Uses bytecode. • Suited for embedded devices • Development appears suspended acer suggesRons to integrate with eBPF (which itself is in development) • ktap + eBPF would be awesome: easy, lightweight, fast. Likely?slide 84:
sysdig • sysdig: InnovaRve new tracer. Simple expressions: sysdig fd.type=file and evt.failed=true! sysdig evt.type=open and fd.name contains /etc! sysdig -p"%proc.name %fd.name" "evt.type=accept and proc.name!=httpd”! • Replacement for strace? (or “perf trace” will) • Programmable “chisels”. Eg, one of mine: # sysdig -c fileslower 1! TIME PROCESS 2014-04-13 20:40:43.973 cksum 2014-04-13 20:40:44.187 cksum 2014-04-13 20:40:44.689 cksum […]! TYPE read read read LAT(ms) FILE! 2 /mnt/partial.0.0! 1 /mnt/partial.0.0! 2 /mnt/partial.0.0! • Currently syscalls and user-‐level processing only. It is opRmized, but I’m not sure it can be enough for kernel tracingslide 85:
Present & Future • Present: – crace can serve many needs today – perf_events some more, esp. with debuginfo – ad hoc SystemTap, ktap, … as needed • Future: – crace/perf_events/ktap with eBPF, for a fully featured and mainline tracer? – One of the other tracers going mainline?slide 86:
The Tracing Landscape, Oct 2014 (less brutal) (my opinion) Ease of use sysdig perf stap crace (alpha) (brutal) dtrace4L. ktap (mature) Stage of Development eBPF Scope & Capabilityslide 87:
In Summaryslide 88:
In Summary… • Plus diagrams for benchmarking, tuning, tracing • Try to start with the quesRons (methodology), to help guide your use of the tools • I hopefully turned some unknown unknowns into known unknownsslide 89:
References & Links Systems Performance: Enterprise and the Cloud, PrenRce Hall, 2014 hjp://www.brendangregg.com/linuxperf.html hjp://www.brendangregg.com/perf.html#FlameGraphs nicstat: hjp://sourceforge.net/projects/nicstat/ Rptop: hjp://Rptop.gforge.inria.fr/ • Tiptop: Hardware Performance Counters for the Masses, Erven Rohou, Inria Research Report 7789, Nov 2011. crace & perf-‐tools • hjps://github.com/brendangregg/perf-‐tools • hjp://lwn.net/ArRcles/608497/ MSR tools: hjps://github.com/brendangregg/msr-‐cloud-‐tools pcstat: hjps://github.com/tobert/pcstat eBPF: hjp://lwn.net/ArRcles/603983/ ktap: hjp://www.ktap.org/ SystemTap: hjps://sourceware.org/systemtap/ sysdig: hjp://www.sysdig.org/ hjp://www.slideshare.net/brendangregg/linux-‐performance-‐analysis-‐and-‐tools Tux by Larry Ewing; Linux® is the registered trademark of Linus Torvalds in the U.S. and other countries.slide 90:
Thanks • QuesRons? • hjp://slideshare.net/brendangregg • hjp://www.brendangregg.com • bgregg@ne8lix.com • @brendangregg