Velocity 2015: linux perf tools
Video: https://www.youtube.com/watch?v=FJW8nGV4jxY and https://www.youtube.com/watch?v=zrr2nUln9KkTutorial slides for O'Reilly Velocity Santa Clara 2015, by Brendan Gregg.
Description: "There are many performance tools nowadays for Linux, but how do they all fit together, and when do we use them? This tutorial explains methodologies for using these tools, and provides a tour of four tool types: observability, benchmarking, tuning, and static tuning. Many tools will be discussed, including top, iostat, tcpdump, sar, perf_events, ftrace, SystemTap, sysdig, and others, as well observability frameworks in the Linux kernel: PMCs, tracepoints, kprobes, and uprobes.
This tutorial is updated and extended on an earlier talk that summarizes the Linux performance tool landscape. The value of this tutorial is not just learning that these tools exist and what they do, but hearing when and how they are used by a performance engineer to solve real world problems — important context that is typically not included in the standard documentation."
PDF: Velocity2015_LinuxPerfTools.pdf
Keywords (from pdftotext):
slide 1:
Linux Performance Tools Brendan Gregg Senior Performance Architect Performance Engineering Team bgregg@netflix.com @brendangreggslide 2:
This Tutorial • A tour of many Linux performance tools – To show you what can be done – With guidance for how to do it • This includes objectives, discussion, live demos – See the video of this tutorial Observability Benchmarking Tuning Staslide 3: • Massive AWS EC2 Linux cloud – 10s of thousands of cloud instances • FreeBSD for content delivery – ~33% of US Internet traffic at night • Over 50M subscribers – Recently launched in ANZ • Use Linux server tools as needed – After cloud monitoring (Atlas, etc.) and instance monitoring (Vector) toolsslide 4:Agenda • Methodologies • Tools • Tool Types: – Observability – Benchmarking – Tuning – Static • Profiling • Tracingslide 5:Methodologiesslide 6:Methodologies • Objectives: – Recognize the Streetlight Anti-Method – Perform the Workload Characterization Method – Perform the USE Method – Learn how to start with the questions, before using tools – Be aware of other methodologiesslide 7:My system is slow… DEMO DISCUSSIONslide 8:Methodologies • There are dozens of performance tools for Linux – Packages: sysstat, procps, coreutils, … – Commercial products • Methodologies can provide guidance for choosing and using tools effectively • A starting point, a process, and an ending pointslide 9:An#-‐Methodologies • The lack of a deliberate methodology…slide 10:Street Light Anslide 11: Drunk Man Anslide 12: Blame Someone Else Anslide 13: Actual Methodologies Problem Statement Method Workload Characterization Method USE Method Off-CPU Analysis CPU Profile Method RTFM Method Active Benchmarking (covered later) Static Performance Tuning (covered later)slide 14:Problem Statement Method 1. What makes you think there is a performance problem? 2. Has this system ever performed well? 3. What has changed recently? (Software? Hardware? Load?) 4. Can the performance degradation be expressed in terms of latency or run time? 5. Does the problem affect other people or applications (or is it just you)? 6. What is the environment? Software, hardware, instance types? Versions? Configuration?slide 15:Workload Characterizaslide 16: The USE Method • For every resource, check: Utilization Saturation Errors X Resource Uslide 17: USE Method for Hardware • For every resource, check: Utilization Saturation Errors • Including busses & interconnectsslide 18:Linux USE Method Example hOp://www.brendangregg.com/USEmethod/use-‐linux.htmlslide 19:Off-‐CPU Analysis On-‐CPU Profiling Off-‐CPU Profiling (everything else) Thread State Transislide 20: CPU Profile Method 1. Take a CPU profile 2. Understand all software in profile >gt; 1% • Discovers a wide range of performance issues by their CPU usage Flame Graph • Narrows software to studyslide 21:RTFM Method • How to understand performance tools or metrics: Man pages Books Web search Co-workers Prior talk slides/video (this one!) Support services Source code Experimentation Socialslide 22:Toolsslide 23:Tools • Objectives: – Perform the USE Method for resource utilization – Perform Workload Characterization for disks, network – Perform the CPU Profile Method using flame graphs – Have exposure to various observability tools: • Basic: vmstat, iostat, mpstat, ps, top, … • Intermediate: tcpdump, netstat, nicstat, pidstat, sar, … • Advanced: ss, slaptop, perf_events, … – Perform Active Benchmarking – Understand tuning risks – Perform Static Performance Tuningslide 24:Command Line Tools • Useful to study even if you never use them: GUIs and commercial products often use the same interfaces Kernel /proc, /sys, … $ vmstat 1! procs -----------memory---------- ---swap-- …! r b swpd free buff cache so …! 9 0 0 29549320 29252 9299060 2 0 0 29547876 29252 9299332 4 0 0 29548124 29252 9299460 5 0 0 29548840 29252 9299592slide 25:Tool Types Type Characteristic Observability Watch activity. Safe, usually, depending on resource overhead. Benchmarking Load test. Caution: production tests can cause issues due to contention. Tuning Change. Danger: changes could hurt performance, now or later with load. Static Check configuration. Should be safe.slide 26:Observability Toolsslide 27:How do you measure these?slide 28:Observability Tools: Basic uptime top (or htop) vmstat iostat mpstat freeslide 29:upslide 30:gt; # of CPUs, may mean CPU saturation – Don’t spend more than 5 seconds studying these top (or htop) • System and per-process interval summary: $ top - 18:50:26 up 7:43, 1 user, load average: 4.11, 4.91, 5.22! Tasks: 209 total, 1 running, 206 sleeping, 0 stopped, 2 zombie! Cpu(s): 47.1%us, 4.0%sy, 0.0%ni, 48.4%id, 0.0%wa, 0.0%hi, 0.3%si, 0.2%st! Mem: 70197156k total, 44831072k used, 25366084k free, 36360k buffers! Swap: 0k total, 0k used, 0k free, 11873356k cached! PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5738 apiprod 0 62.6g 29g 352m S 417 44.2 2144:15 java 1386 apiprod 0 17452 1388 964 R 0 0.0 0:00.02 top 1 root 0 24340 2272 1340 S 0 0.0 0:01.51 init 2 root 0 S 0 0.0 0:00.00 kthreadd […]! • %CPU is summed across all CPUs • Can miss short-lived processes (atop won’t) • Can consume noticeable CPU to read /procslide 31:htopslide 32:ps • Process status listing (eg, “ASCII art forest”): $ ps -ef f! UID PID PPID […]! root root 28261 4546 prod 28287 28261 prod 28288 28287 prod 3156 28288 root root 4969 4965 […]! C STIME TTY STAT TIME CMD! 0 11:08 ? 0 17:24 ? 0 17:24 ? 0 17:24 pts/0 0 19:15 pts/0 0 11:08 ? 0 11:08 ? 0:00 /usr/sbin/sshd -D! 0:00 \_ sshd: prod [priv]! 0:00 \_ sshd: prod@pts/0 ! 0:00 \_ -bash! 0:00 \_ ps -ef f! 0:00 /bin/sh /usr/bin/svscanboot! 0:00 \_ svscan /etc/service! • Custom fields: $ ps -eo user,sz,rss,minflt,majflt,pcpu,args! USER RSS MINFLT MAJFLT %CPU COMMAND! root 6085 2272 11928 24 0.0 /sbin/init! […]!slide 33:vmstat • Virtual memory statistics and more: $ vmstat –Sm 1! procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----! r b swpd free buff cache cs us sy id wa! 8 0 12 25 34 0 0! 7 0 0 205 186 46 13 0 0! 8 0 8 210 435 39 21 0 0! 8 0 0 218 219 42 17 0 0! […]! • USAGE: vmstat [interval [count]] • First output line has some summary since boot values – Should be all; partial is confusing • High level CPU summary – “r” is runnable tasksslide 34:iostat • Block I/O (disk) stats. 1st output is since boot. $ iostat -xmdz 1! Linux 3.13.0-29 (db001-eb883efa) 08/18/2014 _x86_64_ Device: rrqm/s wrqm/s r/s w/s rMB/s xvda xvdb 0.00 15299.00 xvdc 0.00 15271.00 md0 0.00 31082.00 (16 CPU)! wMB/s \ ...! 0.00 / ...! 0.00 \ ...! 0.01 / ...! 0.01 \ ...! Workload • Very useful set of stats ... \ avgqu-sz ... / ... \ ... / ... \ await r_await w_await Resulslide 35: mpstat • Multi-processor statistics, per-CPU: $ mpstat –P ALL 1! […]! 08:06:43 PM CPU %usr 08:06:44 PM all 53.45 08:06:44 PM 0 49.49 08:06:44 PM 1 51.61 08:06:44 PM 2 58.16 08:06:44 PM 3 54.55 08:06:44 PM 4 47.42 08:06:44 PM 5 65.66 08:06:44 PM 6 50.00 […]! %nice %sys %iowait %irq %soft %steal %guest • Look for unbalanced workloads, hot CPUs. %idle! 42.26! 45.45! 41.94! 33.67! 40.40! 49.48! 31.31! 47.92!slide 36:free • Main memory usage: $ free -m! total Mem: -/+ buffers/cache: Swap: used free 3313! • buffers: block device I/O cache • cached: virtual page cache shared buffers cached! 527!slide 37:Latency is now much higher… DEMO DISCUSSIONslide 38:Observability Tools: Basicslide 39:Observability Tools: Intermediate strace tcpdump netstat nicstat pidstat swapon lsof sar (and collectl, dstat, etc.)slide 40:strace • System call tracer: $ strace –tttT –p 313! 1408393285.779746 getgroups(0, NULL) = 1slide 41:gt;! 1408393285.779873 getgroups(1, [0]) = 1 gt;! 1408393285.780797 close(3) = 0 gt;! 1408393285.781338 write(1, "LinuxCon 2014!\n", 15LinuxCon 2014!! ) = 15 gt;! • Eg, -ttt: time (us) since epoch; -T: syscall time (s) • Translates syscall args – Very helpful for solving system usage issues • Currently has massive overhead (ptrace based) – Can slow the target by >gt; 100x. Use extreme caution. tcpdump • Sniff network packets for post analysis: $ tcpdump -i eth0 -w /tmp/out.tcpdump! tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes! ^C7985 packets captured! 8996 packets received by filter! 1010 packets dropped by kernel! # tcpdump -nr /tmp/out.tcpdump | head reading from file /tmp/out.tcpdump, link-type EN10MB (Ethernet) 20:41:05.038437 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 18...! 20:41:05.038533 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 48...! 20:41:05.038584 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 96...! […]! • Study packet sequences with timestamps (us) • CPU overhead optimized (socket ring buffers), but can still be significant. Use caution.slide 42:netstat • Various network protocol statistics using -s: • A multi-tool: $ netstat –s! -i: interface stats -r: route table default: list conns • netstat -p: shows process details! • Per-second interval with -c […]! Tcp:! 736455 active connections openings! 176887 passive connection openings! 33 failed connection attempts! 1466 connection resets received! 3311 connections established! 91975192 segments received! 180415763 segments send out! 223685 segments retransmited! 2 bad segments received.! 39481 resets sent! […]! TcpExt:! 12377 invalid SYN cookies received! 2982 delayed acks sent! […]!slide 43:nicstat • Network interface stats, iostat-like output: $ ./nicstat 1! Time Int 21:21:43 21:21:43 eth0 Time Int 21:21:44 21:21:44 eth0 Time Int 21:21:45 21:21:45 eth0 […]! rKB/s rKB/s rKB/s wKB/s wKB/s wKB/s rPk/s rPk/s rPk/s wPk/s wPk/s wPk/s rAvs wAvs %Util 4915.4 4915.4 0.00 139.8 0.00 rAvs wAvs %Util 0.00 0.00 58.76 40441.3 0.00 rAvs wAvs %Util 4400.8 4400.8 0.00 54.99 2979.1 0.00 • Check network throughput and interface %util • I wrote this years ago; Tim Cook ported to Linux Sat! 0.00! 0.00! Sat! 0.00! 0.00! Sat! 0.00! 0.00!slide 44:pidstat • Very useful process stats. eg, by-thread, disk I/O: $ pidstat -t 1! Linux 3.2.0-54 (db002-91befe03) !08/18/2014 !_x86_64_ !(8 CPU)! 08:57:52 PM TGID TID %usr %system %guest %CPU 08:57:54 PM - 484.75 0.00 524.58 08:57:54 PM 08:57:54 PM 08:57:54 PM 08:57:54 PM […]! $ pidstat -d 1! […]! 08:58:27 PM PID kB_rd/s kB_wr/s kB_ccwr/s Command! 08:58:28 PM 0.00 java! […]! • I usually prefer this over top(1) CPU Command! java! |__java! |__java! |__java! |__java!slide 45:swapon • Show swap device usage: $ swapon -s! Filename /dev/sda3 Type partition Size • If you have swap enabled… Used Priority! -1!slide 46:lsof • More a debug tool, lsof(8) shows file descriptor usage, which for some apps, equals current active network connections: # lsof -iTCP -sTCP:ESTABLISHED! COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME! sshd 755 root 3r IPv4 13576887 0t0 TCP bgregg-test-i-f106:ssh->gt;prod100.netflix.com: 15241 (ESTABLISHED)! platforms 2614 app1 8u IPv4 0t0 TCP localhost:33868->gt;localhost:5433 (ESTABLISHED)! postgres 2648 app1 7u IPv4 0t0 TCP localhost:5433->gt;localhost:33868 (ESTABLISHED)! epic_plug 2857 app1 7u IPv4 0t0 TCP localhost:33885->gt;localhost:5433 (ESTABLISHED)! postgres 2892 app1 7u IPv4 0t0 TCP localhost:5433->gt;localhost:33885 (ESTABLISHED)! […]! • I’d prefer to: echo /proc/PID/fd | wc -l!slide 47:sar • System Activity Reporter. Many stats, eg: $ sar -n TCP,ETCP,DEV 1! Linux 3.2.55 (test-e4f1a80b) !08/18/2014 !_x86_64_ !(8 CPU)! 09:10:43 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s 09:10:44 PM 09:10:44 PM eth0 4114.00 4186.00 4537.46 28513.24 09:10:43 PM active/s passive/s iseg/s oseg/s! 09:10:44 PM 4107.00 22511.00! 09:10:43 PM atmptf/s estres/s retrans/s isegerr/s orsts/s! 09:10:44 PM 1.00! […]! • Archive or live mode: (interval [count]) • Well designed. Header naming convention, logical groups: TCP, ETCP, DEV, EDEV, … rxmcst/s! 0.00! 0.00!slide 48:Observability: sarslide 49:Other Tools • You may also use collectl, atop, dstat, or another measure-all tool • The tool isn’t important – it’s important to have a way to measure everything • In cloud environments, you are probably using a monitoring product, developed in-house or commercial. – We develop Atlas for cloud-wide monitoring, and Vector for instance-level analysis (both NetflixOSS) – Same method applies…slide 50:How does your monitoring tool measure these?slide 51:App is taking forever… DEMO DISCUSSIONslide 52:Observability Tools: Intermediateslide 53:Advanced Observability Tools • Misc: – ltrace, ss, iptraf, ethtool, snmpget, lldptool, iotop, blktrace, slabtop, /proc, pcstat • CPU Performance Counters: – perf_events, tiptop, rdmsr • Advanced Tracers: – perf_events, ftrace, eBPF, SystemTap, ktap, LTTng, dtrace4linux, sysdig • Some selected demos…slide 54:ss • More socket statistics: $ ss -mop! State Recv-Q Send-Q Local Address:Port Peer Address:Port CLOSE-WAIT 1 127.0.0.1:42295 127.0.0.1:28527 users:(("apacheLogParser",2702,3))! ! mem:(r1280,w0,f2816,t0)! ESTAB 127.0.0.1:5433 127.0.0.1:41312 timer:(keepalive,36min,0) users:(("postgres",2333,7))! ! mem:(r0,w0,f0,t0)! […]! $ ss –i! State Recv-Q Send-Q Local Address:Port Peer Address:Port CLOSE-WAIT 1 127.0.0.1:42295 127.0.0.1:28527 cubic wscale:6,6 rto:208 rtt:9/6 ato:40 cwnd:10 send 145.6Mbps rcv_space:32792! ESTAB 10.144.107.101:ssh 10.53.237.72:4532 cubic wscale:4,6 rto:268 rtt:71.5/3 ato:40 cwnd:10 send 1.5Mbps rcv_rtt:72 rcv_space:14480! […]!slide 55:iptrafslide 56:iotop • Block device I/O (disk) by process: $ iotop! Total DISK READ: TID PRIO USER 959 be/4 root 6641 be/4 root 1 be/4 root 2 be/4 root 3 be/4 root 4 be/4 root 5 be/4 root 6 rt/4 root […]! 50.47 M/s | Total DISK WRITE: 59.21 M/s! DISK READ DISK WRITE SWAPIN IO>gt; COMMAND 0.00 B/s 0.00 B/s 0.00 % 99.99 % [flush-202:1]! 50.47 M/s 82.60 M/s 0.00 % 32.51 % java –Dnop –X! 0.00 B/s 0.00 B/s 0.00 % 0.00 % init! 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kthreadd]! 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/0]! 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kworker/0:0]! 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kworker/u:0]! 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/0]! • Needs kernel support enabled – CONFIG_TASK_IO_ACCOUNTINGslide 57:slabtop • Kernel slab allocator memory usage: $ slabtop! Active / Total Objects (% used) : 4692768 / 4751161 (98.8%)! Active / Total Slabs (% used) : 129083 / 129083 (100.0%)! Active / Total Caches (% used) : 71 / 109 (65.1%)! Active / Total Size (% used) : 729966.22K / 738277.47K (98.9%)! Minimum / Average / Maximum Object : 0.01K / 0.16K / 8.00K! OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 3565575 3565575 100% 0.10K 91425 365700K buffer_head! 314916 314066 99% 0.19K 14996 59984K dentry! 184192 183751 99% 0.06K 11512K kmalloc-64! 138618 138618 100% 0.94K 130464K xfs_inode! 138602 138602 100% 0.21K 29968K xfs_ili! 102116 99012 96% 0.55K 58352K radix_tree_node! 97482 49093 50% 0.09K 9284K kmalloc-96! 22695 20777 91% 0.05K 1068K shared_policy_node! 21312 21312 100% 0.86K 18432K ext4_inode_cache! 16288 14601 89% 0.25K 4072K kmalloc-256! […]!slide 58:pcstat • Show page cache residency by file: # ./pcstat data0*! |----------+----------------+------------+-----------+---------|! | Name | Size | Pages | Cached | Percent |! |----------+----------------+------------+-----------+---------|! | data00 | 104857600 | 25600 | 25600 | 100.000 |! | data01 | 104857600 | 25600 | 25600 | 100.000 |! | data02 | 104857600 | 25600 | 4080 | 015.938 |! | data03 | 104857600 | 25600 | 25600 | 100.000 |! | data04 | 104857600 | 25600 | 16010 | 062.539 |! | data05 | 104857600 | 25600 | 0 | 000.000 |! |----------+----------------+------------+-----------+---------|! • Uses the mincore(2) syscall. Useful for database performance analysis.slide 59:perf_events • Provides the "perf" command • In Linux source code: tools/perf – Usually pkg added by linux-tools-common, etc. • Multi-tool with many capabilities – CPU profiling – PMC profiling – Static & dynamic tracing • Covered later in Profiling & Tracingslide 60:slide 61: rdmsr • Model Specific Registers (MSRs), unlike PMCs, can be read by default in Xen guests – Timestamp clock, temp, power, … – Use rdmsr(1) from the msr-tools package to read them – From https://github.com/brendangregg/msr-cloud-tools: ec2-guest# ./showboost! [...]! TIME C0_MCYC C0_ACYC UTIL RATIO 06:11:35 51% 116% 06:11:40 50% 115% 06:11:45 49% 115% [...]! ec2-guest# ./cputemp 1! CPU1 CPU2 CPU3 CPU4! 61 61 60 59! CPU Temperature 60 61 60 60! [...]! MHz! 2900! 2899! 2899! Real CPU MHzslide 62:More Advanced Tools… • Some others worth mentioning: Tool Descrip-on ltrace Library call tracer ethtool Mostly interface tuning; some stats snmpget SNMP network host staslide 63: Advanced Tracers • Many options on Linux: – perf_events, ftrace, eBPF, SystemTap, ktap, LTTng, dtrace4linux, sysdig • Most can do static and dynamic tracing – Static: pre-defined events (tracepoints) – Dynamic: instrument any software (kprobes, uprobes). Custom metrics on-demand. Catch all. • Many are in-developmentslide 64:Linux Observability Toolsslide 65:Linux Observability Toolsslide 66:Benchmarking Toolsslide 67:Benchmarking Tools • Multi: – UnixBench, lmbench, sysbench, perf bench • FS/disk: – dd, hdparm, fio • App/lib: – ab, wrk, jmeter, openssl • Networking: – ping, hping3, iperf, ttcp, traceroute, mtr, pcharslide 68:Benchmarking • ~100% of benchmarks are wrong • Results are usually misleading: you benchmark A, but actually measure B, and conclude you measured C • Common mistakes: – Testing the wrong target: eg, FS cache instead of disk – Choosing the wrong target: eg, disk instead of FS cache … doesn’t resemble real world usage – Invalid results: eg, bugs • The energy needed to refute benchmarks is multiple orders of magnitude bigger than to run themslide 69:Acslide 70: lmbench • CPU, memory, and kernel micro-benchmarks • Eg, memory latency by stride size: $ lat_mem_rd 100m 128 >gt; out.latencies! some R processing…! L2 cache Main Memory L1 cache L3 cacheslide 71:fio • FS or disk I/O micro-benchmarks $ fio --name=seqwrite --rw=write --bs=128k --size=122374m! […]! seqwrite: (groupid=0, jobs=1): err= 0: pid=22321! write: io=122374MB, bw=840951KB/s, iops=6569 , runt=149011msec! clat (usec): min=41 , max=133186 , avg=148.26, stdev=1287.17! lat (usec): min=44 , max=133188 , avg=151.11, stdev=1287.21! bw (KB/s) : min=10746, max=1983488, per=100.18%, avg=842503.94, stdev=262774.35! cpu : usr=2.67%, sys=43.46%, ctx=14284, majf=1, minf=24! IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >gt;=64=0.0%! submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >gt;=64=0.0%! complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >gt;=64=0.0%! issued r/w/d: total=0/978992/0, short=0/0/0! lat (usec): 50=0.02%, 100=98.30%, 250=1.06%, 500=0.01%, 750=0.01%! lat (usec): 1000=0.01%! lat (msec): 2=0.01%, 4=0.01%, 10=0.25%, 20=0.29%, 50=0.06%! lat (msec): 100=0.01%, 250=0.01%! • Results include basic latency distributionslide 72:pchar • Traceroute with bandwidth per hop! $ pchar 10.71.83.1! […]! 4: 10.110.80.1 (10.110.80.1)! Partial loss: 0 / 5 (0%)! Partial char: rtt = 9.351109 ms, (b = 0.004961 ms/B), r2 = 0.184105! stddev rtt = 4.967992, stddev b = 0.006029! Partial queueing: avg = 0.000000 ms (0 bytes)! Hop char: rtt = --.--- ms, bw = 1268.975773 Kbps! Hop queueing: avg = 0.000000 ms (0 bytes)! 5: 10.193.43.181 (10.193.43.181)! Partial loss: 0 / 5 (0%)! Partial char: rtt = 25.461597 ms, (b = 0.011934 ms/B), r2 = 0.228707! stddev rtt = 10.426112, stddev b = 0.012653! Partial queueing: avg = 0.000000 ms (0 bytes)! Hop char: rtt = 16.110487 ms, bw = 1147.210397 Kbps! Hop queueing: avg = 0.000000 ms (0 bytes)! […]! • Needs love. Based on pathchar (Linux 2.0.30).slide 73:Benchmarking Toolsslide 74:Tuning Toolsslide 75:Tuning Tools • Generic interfaces: – sysctl, /sys • Many areas have custom tuning tools: – Applications: their own config – CPU/scheduler: nice, renice, taskset, ulimit, chcpu – Storage I/O: tune2fs, ionice, hdparm, blockdev, … – Network: ethtool, tc, ip, route – Dynamic patching: stap, kpatchslide 76:Tuning Methods • Scientific Method: Question Hypothesis Prediction Test Analysis • Any observational or benchmarking tests you can try before tuning? • Consider risks, and see previous toolsslide 77:Tuning Toolsslide 78:Staslide 79: Staslide 80: CPU Types & Flags $ more /proc/cpuinfo ! processor! !: 0! vendor_id! !: GenuineIntel! cpu family !: 6! model !: 42! model name !: Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz! stepping ! !: 7! microcode! !: 0x1a! cpu MHz ! !: 1600.000! cache size !: 6144 KB! physical id !: 0! siblings ! !: 4! core id ! !: 0! cpu cores! !: 4! apicid !: 0! initial apicid!: 0! fpu ! !: yes! fpu_exception !: yes! cpuid level !: 13! wp ! !: yes! flags !: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc ar! ch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid! […]! CPU speed sslide 81: CPU Frequency Scaling • Kernel may be configured to dynamically modify CPU frequency: # cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies ! 3101000 3100000 2900000 2700000 2500000 2300000 2100000 1900000 1700000 1600000 ! # cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor ! ondemand! – See Documentation/cpu-freq/governors.txt, and scaling_governor == performance • Not to be confused with Intel Turbo Boost (which is H/W)slide 82:Storage Devices # cat /proc/scsi/scsi! Attached devices:! Host: scsi0 Channel: 00 Id: 00 Lun: 00! Vendor: ATA Model: ST3320413AS Rev: JC65! Type: Direct-Access ANSI SCSI revision: 05! Host: scsi1 Channel: 00 Id: 00 Lun: 00! Vendor: PLDS Model: DVD-RW DH16ABSH Rev: YL32! Type: CD-ROM ANSI SCSI revision: 05! # lsscsi! [0:0:0:0] disk ATA ST3320413AS JC65 /dev/sda ! [1:0:0:0] cd/dvd PLDS DVD-RW DH16ABSH YL32 /dev/sr0 ! • Micro-benchmarking disks (not file systems!) is also useful for understanding their characteristicsslide 83:Rouslide 84: etc… System messages: dmesg! Network interface config: ifconfig –a; ip link File system capacity: df -h! Volume config: mdadm --misc -D /dev/md0 ! Storage device info: smartctl! NUMA config: numactl -s; numactl -H! PCI info: lspci Installed kernel modules: lsmod! Root crontab config: crontab –l! Services: service --status-all!slide 85:Staslide 86: Profilingslide 87:Profiling • Objectives: – Profile CPU usage by stack sampling – Generate CPU flame graphs – Understand gotchas with stacks & symbolsslide 88:CPU Profiling • Record stacks at a timed interval: simple and effective – Pros: Low (deterministic) overhead – Cons: Coarse accuracy, but usually sufficient stack samples: syscall on-‐CPU off-‐CPU block interruptslide 89: perf_events • Introduced earlier: multi-tool, profiler. Provides "perf". usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]! The most commonly used perf commands are:! annotate Read perf.data (created by perf record) and display annotated code! archive Create archive with object files with build-ids found in perf.data file! bench General framework for benchmark suites! buildid-cache Manage build-id cache.! buildid-list List the buildids in a perf.data file! data Data file related processing! diff Read perf.data files and display the differential profile! evlist List the event names in a perf.data file! inject Filter to augment the events stream with additional information! kmem Tool to trace/measure kernel memory(slab) properties! kvm Tool to trace/measure kvm guest os! list List all symbolic event types! lock Analyze lock events! mem Profile memory accesses! record Run a command and record its profile into perf.data! report Read perf.data (created by perf record) and display the profile! sched Tool to trace/measure scheduler properties (latencies)! script Read perf.data (created by perf record) and display trace output! stat Run a command and gather performance counter statistics! test Runs sanity tests.! timechart Tool to visualize total system behavior during a workload! top System profiling tool.! trace strace inspired tool! probe Define new dynamic tracepoints! See 'perf help COMMAND' for more information on a specific command.!slide 90:perf_events: CPU profiling • Sampling full stack traces at 99 Hertz, for 30 secs: # perf record -F 99 -ag -- sleep 30! [ perf record: Woken up 9 times to write data ]! [ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ]! # perf report -n --stdio! 1.40% java [kernel.kallsyms] [k] _raw_spin_lock --- _raw_spin_lock! |--63.21%-- try_to_wake_up! |--63.91%-- default_wake_function! |--56.11%-- __wake_up_common! __wake_up_locked! ep_poll_callback! __wake_up_common! __wake_up_sync_key! |--59.19%-- sock_def_readable! […78,000 lines truncated…]!slide 91:perf_events: Full "report" Outputslide 92:… as a Flame Graphslide 93:perf_events: Flame Graphs git clone --depth 1 https://github.com/brendangregg/FlameGraph! cd FlameGraph! perf record -F 99 -a –g -- sleep 30! perf script | ./stackcollapse-perf.pl |./flamegraph.pl >gt; perf.svg! • Flame Graphs: – x-axis: alphabetical stack sort, to maximize merging – y-axis: stack depth – color: random, or hue can be a dimension (eg, diff) • Interpretation: – Top edge is on-CPU, beneath it is ancestry • Currently made from Perl + JavaScript & SVG • Easy to get working – http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.htmlslide 94:Mysterious CPU consumer… DEMO DISCUSSIONslide 95:perf_events: Workflow list events count events capture stacks perf list! perf stat! perf record! Typical Workflow perf.data text UI dump profile perf report! flame graph visualizaslide 96: perf_events: CPU Flame Graph Kernel TCP/IP Broken Java stacks (missing frame pointer) GC Locks Time Idle thread epollslide 97:perf_events: Mixed-Mode CPU Flame Graph Kernel JVM (C++) Fixed Java Stacks! (C)slide 98:perf_events: Gotchas • Stack traces and symbols often don't work. – Can be a significant project to fix them. It is worth it! • C: – stacks: --fno-omit-frame-pointer • Java: – stacks on x86: -XX:+PreserveFramePointer (JDK-8068945 for JDK9, JDK-8072465 for JDK8) – symbols: perf-map-agent (other solutions exist) • Node.js: – symbols: run v8 with --perf_basic_prof http://www.brendangregg.com/blog/2015-02-27/linux-profiling-at-netflix.htmlslide 99:perf_events: Counters • Performance Monitoring Counters (PMCs): $ perf list | grep –i hardware! cpu-cycles OR cycles stalled-cycles-frontend OR idle-cycles-frontend stalled-cycles-backend OR idle-cycles-backend instructions […]! branch-misses bus-cycles L1-dcache-loads L1-dcache-load-misses […]! rNNN (see 'perf list --help' on how to encode it) mem:slide 100:gt;[:access] [Hardware event]! [Hardware event]! [Hardware event]! [Hardware event]! [Hardware event]! [Hardware event]! [Hardware cache event]! [Hardware cache event]! [Raw hardware event … ! [Hardware breakpoint]! • Identify CPU cycle breakdowns, esp. stall types • PMCs not enabled by-default in clouds (yet) • Can be time-consuming to use (CPU manuals) – Please develop front-ends. Eg, tiptop. Tracingslide 101:Tracing • Objectives: – Understand frameworks: tracepoints, kprobes, uprobes – Understand mainline tracers: ftrace, perf_events, eBPF – Awareness of other tracers: SystemTap, LTTng, ktap, sysdig – Awareness of what tracing can accomplish (eg, perf-tools)slide 102:Tracing Frameworks: Tracepoints • Statically placed at logical places in the kernel • Provides key event details as a “format” stringslide 103:Tracing Frameworks: + probes • kprobes: dynamic kernel tracing – function calls, returns, line numbers • uprobes: dynamic user-level tracingslide 104:Linux Tracing Tools irace perf_events LTTng ktap eBPF dtrace4linux OEL DTrace SystemTap sysdig • Many choices (too many?), all still in developmentslide 105:Linux Tracing is Magic! irace perf_events LTTng ktap eBPF dtrace4linux OEL DTrace (Thanks Deirdré Straughan & General Zoi's Pony Creator) SystemTap sysdigslide 106:Choosing a Tracer • Some companies standardize on one tracer – eg, SystemTap, LTTng, …slide 107:Choosing a Tracer • My approach is: Study what Linux already has built-‐in (perf_events, irace, eBPF?) Y Is it sufficient? Purpose? live tracing, counslide 108: iraceslide 109:irace • Added by Steven Rostedt and others since 2.6.27 • Already enabled on our servers (3.2+) – CONFIG_FTRACE, CONFIG_FUNCTION_PROFILER, … – Use directly via /sys/kernel/debug/tracing: linux-4.0.0+# ls /sys/kernel/debug/tracing/! available_events max_graph_depth available_filter_functions options available_tracers per_cpu buffer_size_kb printk_formats buffer_total_size_kb README current_tracer saved_cmdlines dyn_ftrace_total_info saved_cmdlines_size enabled_functions set_event events set_ftrace_filter free_buffer set_ftrace_notrace function_profile_enabled set_ftrace_pid instances set_graph_function kprobe_events set_graph_notrace kprobe_profile snapshot! stack_max_size! stack_trace! stack_trace_filter! trace! trace_clock! trace_marker! trace_options! trace_pipe! trace_stat! tracing_cpumask! tracing_max_latency! tracing_on! tracing_thresh! – See Linux source: Documentation/trace/ftrace.txtslide 110:irace Front-‐Ends • Steven wrote a front-end: trace-cmd – Multi-tool, works well • I've developed the "perf-tools" front-ends – https://github.com/brendangregg/perf-tools – Both single & multi-purpose, Unix-like – Unsupported hacks: see WARNINGs • perf-tools: – single-purpose: iosnoop, iolatency, opensnoop – multi-tools: funccount, funcgraph, kprobeslide 111:iosnoop • Block I/O (disk) events with latency: # ./iosnoop –ts! Tracing block I/O. Ctrl-C to end.! STARTs ENDs COMM 5982800.302061 5982800.302679 supervise 5982800.302423 5982800.302842 supervise 5982800.304962 5982800.305446 supervise 5982800.305250 5982800.305676 supervise […]! PID TYPE DEV 202,1 202,1 202,1 202,1 BLOCK BYTES LATms! 0.62! 0.42! 0.48! 0.43! # ./iosnoop –h! USAGE: iosnoop [-hQst] [-d device] [-i iotype] [-p PID] [-n name] [duration]! -d device # device string (eg, "202,1)! -i iotype # match type (eg, '*R*' for all reads)! -n name # process name to match on I/O issue! -p PID # PID to match on I/O issue! # include queueing time in LATms! # include start time of I/O (s)! # include completion time of I/O (s)! # this usage message! duration # duration seconds, and use buffers! […]!slide 112:iolatency • Block I/O (disk) latency distributions: # ./iolatency ! Tracing block I/O. Output every 1 seconds. Ctrl-C to end.! >gt;=(ms) ..slide 113:gt; 1 : 2104 |######################################|! 1 ->gt; 2 : 280 |###### 2 ->gt; 4 : 2 4 ->gt; 8 : 0 8 ->gt; 16 : 202 |#### >gt;=(ms) .. gt; 1 : 1144 |######################################|! 1 ->gt; 2 : 267 |######### 2 ->gt; 4 : 10 4 ->gt; 8 : 5 8 ->gt; 16 : 248 |######### 16 ->gt; 32 : 601 |#################### 32 ->gt; 64 : 117 |#### […]! opensnoop • Trace open() syscalls showing filenames: # ./opensnoop -t! Tracing open()s. Ctrl-C to end.! TIMEs COMM PID postgres postgres postgres postgres postgres postgres postgres svstat svstat stat stat stat stat stat stat […]! FD FILE! 0x8 /proc/self/oom_adj! 0x5 global/pg_filenode.map! 0x5 global/pg_internal.init! 0x5 base/16384/PG_VERSION! 0x5 base/16384/pg_filenode.map! 0x5 base/16384/pg_internal.init! 0x5 base/16384/11725! 0x4 supervise/ok! 0x4 supervise/status! 0x3 /etc/ld.so.cache! 0x3 /lib/x86_64-linux-gnu/libselinux…! 0x3 /lib/x86_64-linux-gnu/libc.so.6! 0x3 /lib/x86_64-linux-gnu/libdl.so.2! 0x3 /proc/filesystems! 0x3 /etc/nsswitch.conf!slide 114:tpoint • Who is creating disk I/O, and of what type? # ./tpoint -H block:block_rq_insert! tracepoint Tracing block:block_rq_insert. Ctrl-C to end.! type size (sectors) # tracer: nop! dev offset TASK-PID CPU# TIMESTAMP FUNCTION! | | flush-9:0-9318 [013] 1936182.007914: block_rq_insert: 202,16 W 0 () 160186560 + 8 [flush-9:0]! flush-9:0-9318 [013] 1936182.007939: block_rq_insert: 202,16 W 0 () 280100936 + 8 [flush-9:0]! java-9469 [014] 1936182.316184: block_rq_insert: 202,1 R 0 () 1319592 + 72 [java]! java-9469 [000] 1936182.331270: block_rq_insert: 202,1 R 0 () 1125744 + 8 [java]! java-9469 [000] 1936182.341418: block_rq_insert: 202,1 R 0 () 2699008 + 88 [java]! java-9469 [000] 1936182.341419: block_rq_insert: 202,1 R 0 () 2699096 + 88 [java]! java-9469 [000] 1936182.341419: block_rq_insert: 202,1 R 0 () 2699184 + 32 [java]! java-9469 [000] 1936182.345870: block_rq_insert: 202,1 R 0 () 1320304 + 24 [java]! java-9469 [000] 1936182.351590: block_rq_insert: 202,1 R 0 () 1716848 + 16 [java]! ^C! Ending tracing...! • tpoint traces a given tracepoint. -H prints the header.slide 115:tpoint -‐l # ./tpoint -l! block:block_bio_backmerge! block:block_bio_bounce! block:block_bio_complete! block:block_bio_frontmerge! block:block_bio_queue! block:block_bio_remap! block:block_getrq! block:block_plug! block:block_rq_abort! block:block_rq_complete! block:block_rq_insert! block:block_rq_issue! block:block_rq_remap! block:block_rq_requeue! […]! # ./tpoint –l | wc –l! 1257! Lisslide 116: funccount • Count a kernel function call rate: # ./funccount -i 1 'bio_*'! Tracing "bio_*"... Ctrl-C to end.! FUNC COUNT! bio_attempt_back_merge 26! bio_get_nr_vecs 361! bio_alloc 536! bio_alloc_bioset 536! bio_endio 536! bio_free 536! bio_fs_destructor 536! bio_init 536! bio_integrity_enabled 536! bio_put 729! bio_add_page 1004! [...]! Counts are in-‐kernel, for low overhead – -i: set an output interval (seconds), otherwise until Ctrl-Cslide 117:funcgraph • Trace a graph of kernel code flow: # ./funcgraph -Htp 5363 vfs_read! Tracing "vfs_read" for PID 5363... Ctrl-C to end.! # tracer: function_graph! TIME CPU DURATION FUNCTION CALLS! 4346366.073832 | | vfs_read() {! 4346366.073834 | rw_verify_area() {! 4346366.073834 | security_file_permission() {! 4346366.073834 | apparmor_file_permission() {! 4346366.073835 | 0.153 us common_file_perm();! 4346366.073836 | 0.947 us 4346366.073836 | 0.066 us __fsnotify_parent();! 4346366.073836 | 0.080 us fsnotify();! 4346366.073837 | 2.174 us 4346366.073837 | 2.656 us 4346366.073837 | tty_read() {! 4346366.073837 | 0.060 us tty_paranoia_check();! […]!slide 118:kprobe • Dynamically trace a kernel function call or return, with variables, and in-kernel filtering: # ./kprobe 'p:open do_sys_open filename=+0(%si):string' 'filename ~ "*stat"'! Tracing kprobe myopen. Ctrl-C to end.! postgres-1172 [000] d... 6594028.787166: open: (do_sys_open +0x0/0x220) filename="pg_stat_tmp/pgstat.stat"! postgres-1172 [001] d... 6594028.797410: open: (do_sys_open +0x0/0x220) filename="pg_stat_tmp/pgstat.stat"! postgres-1172 [001] d... 6594028.797467: open: (do_sys_open +0x0/0x220) filename="pg_stat_tmp/pgstat.stat”! ^C! Ending tracing...! • Add -s for stack traces; -p for PID filter in-kernel. • Quickly confirm kernel behavior; eg: did a tunable take effect?slide 119:perf-‐tools (so far…)slide 120:perf-‐tools (so far…)slide 121:perf_eventsslide 122:perf_events • Powerful profiler (covered earlier) • … and tracer: – User-level and kernel dynamic tracing – Kernel line tracing and local variables (debuginfo) – Kernel filtering expressions – Efficient in-kernel counts (perf stat) • Not very programmable, yet – Limited kernel summaries. Should improve with eBPF.slide 123:perf_events Lisslide 124: Linux Event Sourcesslide 125:perf_events Tracing Tracepoints # perf record -e block:block_rq_complete -a sleep 10! [ perf record: Woken up 1 times to write data ]! [ perf record: Captured and wrote 0.428 MB perf.data (~18687 samples) ]! # perf script! run 30339 [000] 2083345.722767: block:block_rq_complete: 202,1 W () 12984648 + 8 [0]! run 30339 [000] 2083345.722857: block:block_rq_complete: 202,1 W () 12986336 + 8 [0]! run 30339 [000] 2083345.723180: block:block_rq_complete: 202,1 W () 12986528 + 8 [0]! swapper 0 [000] 2083345.723489: block:block_rq_complete: 202,1 W () 12986496 + 8 [0]! swapper 0 [000] 2083346.745840: block:block_rq_complete: 202,1 WS () 1052984 + 144 [0]! supervise 30342 [000] 2083346.746571: block:block_rq_complete: 202,1 WS () 1053128 + 8 [0]! supervise 30342 [000] 2083346.746663: block:block_rq_complete: 202,1 W () 12986608 + 8 [0]! run 30342 [000] 2083346.747003: block:block_rq_complete: 202,1 W () 12986832 + 8 [0]! [...]! • If -g is used in "perf record", stack traces are included • If "perf script" output is too verbose, try "perf report", or making a flame graphslide 126:perf_events Report # perf record –e skb:consume_skb -ag! ^C[ perf record: Woken up 1 times to write data ]! [ perf record: Captured and wrote 0.065 MB perf.data (~2851 samples) ]! # perf report -n --stdio! [...]! 74.42% swapper [kernel.kallsyms] [k] consume_skb! --- consume_skb! arp_process! arp_rcv! Summarizing stack __netif_receive_skb_core! __netif_receive_skb! traces for a tracepoint netif_receive_skb! virtnet_poll! net_rx_action! __do_softirq! irq_exit! do_IRQ! ret_from_intr! default_idle! cpu_idle! start_secondary! […]!slide 127:eBPFslide 128:eBPF • Extended BPF: programs on tracepoints – High performance filtering: JIT – In-kernel summaries: maps – Developed by Alexei Starovoitov (Plumgrid) • Currently being integrated in parts (Linux 3.18, 4.1, …) # ./bitehist! Tracing block device I/O... Interval 5 secs. Ctrl-C to end.! kbytes : count distribution! 0 ->gt; 1 : 3 2 ->gt; 3 : 0 4 ->gt; 7 : 3395 |************************************* |! 8 ->gt; 15 : 1 16 ->gt; 31 : 2 32 ->gt; 63 : 738 |******* 64 ->gt; 127 : 3 in-‐kernel ummary 128 ->gt; 255 : 1slide 129:eBPF • Example in-kernel latency heat map:slide 130:Other Tracersslide 131:SystemTap • Fully programmable, fully featured – Including access to user-level tracepoints • Compiles tracing programs into kernel modules – Needs a compiler, and takes time • “Works great on Red Hat” – I keep trying on other distros and have hit trouble in the past – Make sure you are on the latest version (>gt;=2.6) • "Works great with kernel debuginfo" – Suited for kernel developers who have it handy – A difficult requirement for our cloud environmentslide 132:"Lightweight" SystemTap • SystemTap can be used without kernel debuginfo – Makes life harder, but some tasks are still possible – providers: nd_syscall, kprobe.function, kernel.trace, … – args via: int_arg(), uint_arg(), pointer_arg(), user_string() • Something I've experimented with. Examples: – https://github.com/brendangregg/systemtap-lwtools/ # stap -e 'global a; probe nd_syscall.write { aslide 133: ktap • Was a very promising new Linux tracer: – Sampling, static & dynamic tracing – Lightweight, simple. Uses bytecode. – Suited for embedded devices • Development suspended while eBPF integrates • Will it restart?slide 134:sysdig • sysdig: Innovative new tracer. Simple expressions: sysdig fd.type=file and evt.failed=true! sysdig evt.type=open and fd.name contains /etc! sysdig -p"%proc.name %fd.name" "evt.type=accept and proc.name!=httpd”! • Replacement for strace? (or “perf trace” will) • Programmable “chisels”. Eg, one of mine: # sysdig -c fileslower 1! TIME PROCESS 2014-04-13 20:40:43.973 cksum 2014-04-13 20:40:44.187 cksum 2014-04-13 20:40:44.689 cksum […]! TYPE read read read LAT(ms) FILE! 2 /mnt/partial.0.0! 1 /mnt/partial.0.0! 2 /mnt/partial.0.0! • Currently syscalls and user-level processing only – I'm not sure it can be optimized enough for kernel tracing, unless it adopts eBPF for in-kernel processing & summariesslide 135:Present & Future • Present: – ftrace & perf_events solving many needs today: • PMC profiling, stack profiling, tracepoint & dynamic tracing, … • Expected Future: – eBPF for kernel summaries & advanced programs – eBPF perf integration to improve ease of use • Possible Future: – eBPF high level language (ktap?) – ftrace/eBPF integration – Other tracer eBPF integration (SystemTap, LTTng, sysdig?) – One of the other tracers going mainline?slide 136:The Tracing Landscape, May 2015 (less brutal) (my opinion) Ease of use sysdig perf stap irace (alpha) (brutal) dtrace4L. ktap (mature) Stage of Development eBPF Scope & Capabilityslide 137:In Summary…slide 138:Methodologies Summary • Objectives: – Recognize the Streetlight Anti-Method – Perform the Workload Characterization Method – Perform the USE Method – Be aware of other methodologies Try to start with the questions (methodology), to help guide your use of the toolsslide 139:Tools Summary • Objectives: – Perform the USE Method for resource utilization – Perform Workload Characterization for disks, network – Have exposure to various observability tools: • Basic: vmstat, iostat, mpstat, ps, top, … • Intermediate: tcpdump, netstat, nicstat, pidstat, sar, … • Advanced: ss, slaptop, perf_events, … – Perform Active Benchmarking – Understand tuning risks – Perform Static Performance Tuning Print out the tools diagrams for your office wallslide 140:Profiling & Tracing Summary • Objectives: – Profile CPU usage by stack sampling – Generate CPU flame graphs – Understand gotchas with stacks & symbols – Understand frameworks: tracepoints, kprobes, uprobes – Understand mainline tracers: ftrace, perf_events, eBPF – Awareness of other tracers: SystemTap, LTTng, ktap, sysdig – Awareness of what tracing can accomplish (eg, perf-tools) I've hopefully turned some unknown unknowns into known unknownsslide 141:References & Links Systems Performance: Enterprise and the Cloud, Prenslide 142: Thanks Questions? http://slideshare.net/brendangregg http://www.brendangregg.com bgregg@netflix.com @brendangregg