PerconaLive 2016: Linux Systems Performance
Talk for PerconaLive 2016 by Brendan Gregg.Video: https://www.youtube.com/watch?v=CbmEDXq7es0
Description: "Systems performance provides a different perspective for analysis and tuning, and can help you find performance wins for your databases, applications, and the kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes six important areas of Linux systems performance in 50 minutes: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events), static tracing (tracepoints), and dynamic tracing (kprobes, uprobes), and much advice about what is and isn't important to learn. This talk is aimed at everyone: DBAs, developers, operations, etc, and in any environment running Linux, bare-metal or the cloud."
next prev 1/72 | |
next prev 2/72 | |
next prev 3/72 | |
next prev 4/72 | |
next prev 5/72 | |
next prev 6/72 | |
next prev 7/72 | |
next prev 8/72 | |
next prev 9/72 | |
next prev 10/72 | |
next prev 11/72 | |
next prev 12/72 | |
next prev 13/72 | |
next prev 14/72 | |
next prev 15/72 | |
next prev 16/72 | |
next prev 17/72 | |
next prev 18/72 | |
next prev 19/72 | |
next prev 20/72 | |
next prev 21/72 | |
next prev 22/72 | |
next prev 23/72 | |
next prev 24/72 | |
next prev 25/72 | |
next prev 26/72 | |
next prev 27/72 | |
next prev 28/72 | |
next prev 29/72 | |
next prev 30/72 | |
next prev 31/72 | |
next prev 32/72 | |
next prev 33/72 | |
next prev 34/72 | |
next prev 35/72 | |
next prev 36/72 | |
next prev 37/72 | |
next prev 38/72 | |
next prev 39/72 | |
next prev 40/72 | |
next prev 41/72 | |
next prev 42/72 | |
next prev 43/72 | |
next prev 44/72 | |
next prev 45/72 | |
next prev 46/72 | |
next prev 47/72 | |
next prev 48/72 | |
next prev 49/72 | |
next prev 50/72 | |
next prev 51/72 | |
next prev 52/72 | |
next prev 53/72 | |
next prev 54/72 | |
next prev 55/72 | |
next prev 56/72 | |
next prev 57/72 | |
next prev 58/72 | |
next prev 59/72 | |
next prev 60/72 | |
next prev 61/72 | |
next prev 62/72 | |
next prev 63/72 | |
next prev 64/72 | |
next prev 65/72 | |
next prev 66/72 | |
next prev 67/72 | |
next prev 68/72 | |
next prev 69/72 | |
next prev 70/72 | |
next prev 71/72 | |
next prev 72/72 |
PDF: Percona2016_LinuxSystemsPerf.pdf
Keywords (from pdftotext):
slide 1:
Apr, 2016 Linux Systems Performance Brendan Gregg Senior Performance Architectslide 2:
Systems Performance in 50 minsslide 3:
slide 4:
Agenda A brief discussion of 6 facets of Linux performance: Observability Methodologies Benchmarking Profiling Tracing Tuning Audience: Everyone (DBAs, developers, operations, …)slide 5:
1. Observabilityslide 6:
How do you measure these?slide 7:
Linux Observability Toolsslide 8:
Observability Tools • Tools showcase common metrics – Learning Linux tools is useful even if you never use them: the same metrics are in GUIs • We usually use these metrics via: – Netflix Atlas: cloud-wide monitoring – Netflix Vector: instance analysis • Linux has many tools – Plus many extra kernel sources of data that lack tools, are harder to use, and are practically undocumented • Some tool examples…slide 9:
upGme • One way to print load averages: $ uptime 07:42:06 up 8:16, 1 user, load average: 2.27, 2.84, 2.91 • A measure of resource demand: CPUs + disks – Other OSes only show CPUs: easier to interpret • Exponentially-damped moving averages • Time constants of 1, 5, and 15 minutes – Historic trend without the line graph • Load >gt; # of CPUs, may mean CPU saturation – Don’t spend more than 5 seconds studying theseslide 10:
top (or htop) • System and per-process interval summary: $ top - 18:50:26 up 7:43, 1 user, load average: 4.11, 4.91, 5.22 Tasks: 209 total, 1 running, 206 sleeping, 0 stopped, 2 zombie Cpu(s): 47.1%us, 4.0%sy, 0.0%ni, 48.4%id, 0.0%wa, 0.0%hi, 0.3%si, 0.2%st Mem: 70197156k total, 44831072k used, 25366084k free, 36360k buffers Swap: 0k total, 0k used, 0k free, 11873356k cached PID USER 5738 apiprod 1386 apiprod 1 root 2 root […] NI VIRT RES SHR S %CPU %MEM 0 62.6g 29g 352m S 417 44.2 0 17452 1388 964 R 0 0.0 0 24340 2272 1340 S 0 0.0 0 S 0 0.0 TIME+ COMMAND 2144:15 java 0:00.02 top 0:01.51 init 0:00.00 kthreadd • %CPU is summed across all CPUs • Can miss short-lived processes (atop won’t) • Can consume noticeable CPU to read /procslide 11:
htopslide 12:
vmstat • Virtual memory statistics and more: $ vmstat –Sm 1 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---r b swpd free buff cache cs us sy id wa 8 0 12 25 34 0 0 7 0 0 205 186 46 13 0 0 8 0 8 210 435 39 21 0 0 8 0 0 218 219 42 17 0 0 […] • USAGE: vmstat [interval [count]] • First output line has some summary since boot values – Should be all; partial is confusing • High level CPU summary – “r” is runnable tasksslide 13:
iostat • Block I/O (disk) stats. 1st output is since boot. $ iostat -xmdz 1 Linux 3.13.0-29 (db001-eb883efa) Device: xvda xvdb xvdc md0 rrqm/s 08/18/2014 wrqm/s r/s 0.00 15299.00 0.00 15271.00 0.00 31082.00 w/s _x86_64_ rMB/s (16 CPU) wMB/s \ ... 0.00 / ... 0.00 \ ... 0.01 / ... 0.01 \ ... Workload • Very useful set of stats ... \ avgqu-sz ... / ... \ ... / ... \ await r_await w_await ResulGng Performance svctm %utilslide 14:
free • Main memory usage: $ free -m total Mem: -/+ buffers/cache: Swap: used free • buffers: block device I/O cache • cached: virtual page cache shared buffers cachedslide 15:
strace • System call tracer: $ strace –tttT –p 313 1408393285.779746 getgroups(0, NULL) = 1slide 16:gt; 1408393285.779873 getgroups(1, [0]) = 1 gt; 1408393285.780797 close(3) = 0 gt; 1408393285.781338 write(1, "LinuxCon 2014!\n", 15LinuxCon 2014! ) = 15 gt; • Eg, -ttt: time (us) since epoch; -T: syscall time (s) • Translates syscall args – Very helpful for solving system usage issues • Currently has massive overhead (ptrace based) – Can slow the target by >gt; 100x. Use extreme caution.
tcpdump • Sniff network packets for post analysis: $ tcpdump -i eth0 -w /tmp/out.tcpdump tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes ^C7985 packets captured 8996 packets received by filter 1010 packets dropped by kernel # tcpdump -nr /tmp/out.tcpdump | head reading from file /tmp/out.tcpdump, link-type EN10MB (Ethernet) 20:41:05.038437 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 18... 20:41:05.038533 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 48... 20:41:05.038584 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 96... […] • Study packet sequences with timestamps (us) • CPU overhead optimized (socket ring buffers), but can still be significant. Use caution.slide 17:
netstat • Various network protocol statistics using -s: • A multi-tool: $ netstat –s -i: interface stats -r: route table default: list conns • netstat -p: shows process details! • Per-second interval with -c […] Tcp: 736455 active connections openings 176887 passive connection openings 33 failed connection attempts 1466 connection resets received 3311 connections established 91975192 segments received 180415763 segments send out 223685 segments retransmited 2 bad segments received. 39481 resets sent […] TcpExt: 12377 invalid SYN cookies received 2982 delayed acks sent […]slide 18:
slabtop • Kernel slab allocator memory usage: $ slabtop Active / Total Objects (% used) : 4692768 / 4751161 (98.8%) Active / Total Slabs (% used) : 129083 / 129083 (100.0%) Active / Total Caches (% used) : 71 / 109 (65.1%) Active / Total Size (% used) : 729966.22K / 738277.47K (98.9%) Minimum / Average / Maximum Object : 0.01K / 0.16K / 8.00K OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 3565575 3565575 100% 0.10K 91425 365700K buffer_head 314916 314066 99% 0.19K 14996 59984K dentry 184192 183751 99% 0.06K 11512K kmalloc-64 138618 138618 100% 0.94K 130464K xfs_inode 138602 138602 100% 0.21K 29968K xfs_ili 102116 99012 96% 0.55K 58352K radix_tree_node 97482 49093 50% 0.09K 9284K kmalloc-96 22695 20777 91% 0.05K 1068K shared_policy_node 21312 21312 100% 0.86K 18432K ext4_inode_cache 16288 14601 89% 0.25K 4072K kmalloc-256 […]slide 19:
pcstat • Show page cache residency by file: # ./pcstat data0* |----------+----------------+------------+-----------+---------| | Name | Size | Pages | Cached | Percent | |----------+----------------+------------+-----------+---------| | data00 | 104857600 | 25600 | 25600 | 100.000 | | data01 | 104857600 | 25600 | 25600 | 100.000 | | data02 | 104857600 | 25600 | 4080 | 015.938 | | data03 | 104857600 | 25600 | 25600 | 100.000 | | data04 | 104857600 | 25600 | 16010 | 062.539 | | data05 | 104857600 | 25600 | 0 | 000.000 | |----------+----------------+------------+-----------+---------| • Uses the mincore(2) syscall. Useful for database performance analysis.slide 20:
perf_events • Provides the "perf" command • In Linux source code: tools/perf – Usually pkg added by linux-tools-common, etc. • Multi-tool with many capabilities – CPU profiling – PMC profiling – Static & dynamic tracing • Covered later in Profiling & Tracingslide 21:
Where do you start?...and stop?slide 22:
2. Methodologiesslide 23:
An#-‐Methodologies • The lack of a deliberate methodology… • Street Light Anti-Method: – 1. Pick observability tools that are • Familiar • Found on the Internet • Found at random – 2. Run tools – 3. Look for obvious issues • Drunk Man Anti-Method: – Tune things at random until the problem goes awayslide 24:
Methodologies Linux Performance Analysis in 60 seconds The USE method CPU Profile Method Resource Analysis Workload Analysis Others include: – Workload characterization – Drill-down analysis – Off-CPU analysis – Static performance tuning – 5 whys – …slide 25:
Linux Perf Analysis in 60s 1. uptime 2. dmesg | tail 3. vmstat 1 4. mpstat -P ALL 1 5. pidstat 1 6. iostat -xz 1 7. free -m 8. sar -n DEV 1 9. sar -n TCP,ETCP 1 10. topslide 26:
Linux Perf Analysis in 60s 1. uptime 2. dmesg | tail 3. vmstat 1 4. mpstat -P ALL 1 5. pidstat 1 6. iostat -xz 1 7. free -m 8. sar -n DEV 1 9. sar -n TCP,ETCP 1 10. top load averages kernel errors overall stats by Gme CPU balance process usage disk I/O memory usage network I/O TCP stats check overview hTp://techblog.neVlix.com/2015/11/linux-‐performance-‐analysis-‐in-‐60s.htmlslide 27:
The USE Method • For every resource, check: Utilization Saturation Errors X Resource UGlizaGon (%) • Definitions: – Utilization: busy time – Saturation: queue length or queued time – Errors: easy to interpret (objective) • Helps if you have a functional (block) diagram of your system / software / environment, showing all resources Start with the questions, then find the toolsslide 28:
USE Method for Hardware • For every resource, check: Utilization Saturation Errors • Including busses & interconnectsslide 29:
(hTp://www.brendangregg.com/USEmethod/use-‐linux.html)slide 30:
CPU Profile Method 1. Take a CPU profile 2. Understand all software in profile >gt; 1% • Discovers a wide range of issues by their CPU usage – Directly: CPU consumers – Indirectly: initialization of I/O, locks, times, ... • Narrows target of study to only its running components Flame Graphslide 31:
Resource Analysis • Typical approach for system performance analysis: begin with system tools & metrics Workload • Pros: – Generic – Aids resource perf tuning ApplicaGon Libraries System • Cons: – Uneven coverage – False positives System Calls Kernel Hardware Analysisslide 32:
Workload Analysis • Begin with application metrics & context • Pros: – Accurate, proportional metrics – App context • Cons: – App specific – Difficult to dig from app to resource Workload ApplicaGon Libraries System System Calls Kernel Hardware Analysisslide 33:
3. Benchmarkingslide 34:
~100% of benchmarks are wrongslide 35:
Benchmarking • Apart from observational analysis, benchmarking is a useful form of experimental analysis – Try observational first; benchmarks can perturb • However, benchmarking is error prone: – Testing the wrong target: eg, FS cache instead of disk – Choosing the wrong target: eg, disk instead of FS cache … doesn’t resemble real world usage – Invalid results: eg, bugs – Misleading results: you benchmark A, but actually measure B, and conclude you measured C • The energy needed to refute benchmarks is multiple orders of magnitude bigger than to run themslide 36:
Benchmark Examples • Micro benchmarks: – File system maximum cached read operations/sec – Network maximum throughput • Macro (application) benchmarks: – Simulated application maximum request rate • Bad benchmarks: – gitpid() in a tight loop – Context switch timingslide 37:
The Benchmark Paradox • Benchmarking is used for product evaluations – Eg, evaluating cloud vendors • The Benchmark Paradox: – If your product’s chances of winning a benchmark are 50/50, you’ll usually lose – http://www.brendangregg.com/blog/2014-05-03/the-benchmark-paradox.html • Solving this seeming paradox (and benchmarking in general)…slide 38:
For any given benchmark result, ask: why isn’t it 10x?slide 39:
AcGve Benchmarking • Root cause performance analysis while the benchmark is still running – Use the earlier observability tools – Identify the limiter (or suspected limiter) and include it with the benchmark results – Answer: why not 10x? • This takes time, but uncovers most mistakesslide 40:
4. Profilingslide 41:
Profiling • Can you do this? “As an experiment to investigate the performance of the resulting TCP/IP implementation ... the 11/750 is CPU saturated, but the 11/780 has about 30% idle time. The time spent in the system processing the data is spread out among handling for the Ethernet (20%), IP packet processing (10%), TCP processing (30%), checksumming (25%), and user system call handling (15%), with no single part of the handling dominating the time in the system.”slide 42:
Profiling • Can you do this? “As an experiment to investigate the performance of the resulting TCP/IP implementation ... the 11/750 is CPU saturated, but the 11/780 has about 30% idle time. The time spent in the system processing the data is spread out among handling for the Ethernet (20%), IP packet processing (10%), TCP processing (30%), checksumming (25%), and user system call handling (15%), with no single part of the handling dominating the time in the system.” – Bill Joy, 1981, TCP-‐IP Digest, Vol 1 #6 hTps://www.rfc-‐editor.org/rfc/museum/tcp-‐ip-‐digest/tcp-‐ip-‐digest.v1n6.1slide 43:
perf_events • Introduced earlier: multi-tool, profiler. Provides "perf". usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS] The most commonly used perf commands are: annotate Read perf.data (created by perf record) and display annotated code archive Create archive with object files with build-ids found in perf.data file bench General framework for benchmark suites buildid-cache Manage build-id cache. buildid-list List the buildids in a perf.data file data Data file related processing diff Read perf.data files and display the differential profile evlist List the event names in a perf.data file inject Filter to augment the events stream with additional information kmem Tool to trace/measure kernel memory(slab) properties kvm Tool to trace/measure kvm guest os list List all symbolic event types lock Analyze lock events mem Profile memory accesses record Run a command and record its profile into perf.data report Read perf.data (created by perf record) and display the profile sched Tool to trace/measure scheduler properties (latencies) script Read perf.data (created by perf record) and display trace output stat Run a command and gather performance counter statistics test Runs sanity tests. timechart Tool to visualize total system behavior during a workload top System profiling tool. trace strace inspired tool probe Define new dynamic tracepoints See 'perf help COMMAND' for more information on a specific command.slide 44:
perf_events: CPU profiling • Sampling full stack traces at 99 Hertz, for 30 secs: # perf record -F 99 -ag -- sleep 30 [ perf record: Woken up 9 times to write data ] [ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ] # perf report -n --stdio 1.40% java [kernel.kallsyms] [k] _raw_spin_lock --- _raw_spin_lock |--63.21%-- try_to_wake_up |--63.91%-- default_wake_function |--56.11%-- __wake_up_common __wake_up_locked ep_poll_callback __wake_up_common __wake_up_sync_key |--59.19%-- sock_def_readable […78,000 lines truncated…]slide 45:
perf_events: Full "report" Outputslide 46:
… as a Flame Graphslide 47:
perf_events: Flame Graphs git clone --depth 1 https://github.com/brendangregg/FlameGraph cd FlameGraph perf record -F 99 -a –g -- sleep 30 perf script | ./stackcollapse-perf.pl |./flamegraph.pl >gt; perf.svg • Flame Graphs: – x-axis: alphabetical stack sort, to maximize merging – y-axis: stack depth – color: random, or hue can be a dimension (eg, diff) • Interpretation: – top edge is on-CPU, beneath it is ancestry • Easy to get working – http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html • Also in Mar/Apr issue of ACMQslide 48:
• The first ever flame graph was generated for MySQL • This background is the output of a DTrace CPU profile, which only printed unique stacks with counts Size of one stackslide 49:
1 Flame Graph: MySQL … same dataslide 50:
perf_events: Workflow list events count events capture stacks perf list perf stat perf record Typical Workflow perf.data text UI dump profile perf report flame graph visualizaGon perf script stackcollapse-perf.pl flamegraph.plslide 51:
perf_events: Counters • Performance Monitoring Counters (PMCs): $ perf list | grep –i hardware cpu-cycles OR cycles stalled-cycles-frontend OR idle-cycles-frontend stalled-cycles-backend OR idle-cycles-backend instructions […] branch-misses bus-cycles L1-dcache-loads L1-dcache-load-misses […] rNNN (see 'perf list --help' on how to encode it) mem:slide 52:gt;[:access] [Hardware event] [Hardware event] [Hardware event] [Hardware event] [Hardware event] [Hardware event] [Hardware cache event] [Hardware cache event] [Raw hardware event … [Hardware breakpoint] • Identify CPU cycle breakdowns, esp. stall types • PMCs not enabled by-default in clouds (yet) • Can be time-consuming to use (CPU manuals)
5. Tracingslide 53:
Linux Event Sourcesslide 54:
Tracing Stack add-‐on tools: trace-cmd, perf-‐tools, bcc, … front-‐end tools: tracing frameworks: back-‐end instrumentaGon: perf jrace, perf_events, BPF tracepoints, kprobes, uprobes in Linuxslide 55:
jrace • Added by Steven Rostedt and others since 2.6.27 – CONFIG_FTRACE, CONFIG_FUNCTION_PROFILER, … – See Linux source: Documentation/trace/ftrace.txt – A collection of powerful features, good for hacking • Use directly via /sys/kernel/debug/tracing (not easy): linux-4.0.0+# ls /sys/kernel/debug/tracing/ available_events max_graph_depth available_filter_functions options available_tracers per_cpu buffer_size_kb printk_formats […] stack_max_size stack_trace stack_trace_filter trace • Or via front-ends: – Steven's trace-cmd – my perf-tools: iosnoop, iolatency, funccount, kprobe, …slide 56:
jrace: perf-‐tools iosnoop • Block I/O (disk) events with latency: # ./iosnoop –ts Tracing block I/O. Ctrl-C to end. STARTs ENDs COMM 5982800.302061 5982800.302679 supervise 5982800.302423 5982800.302842 supervise 5982800.304962 5982800.305446 supervise 5982800.305250 5982800.305676 supervise […] PID TYPE DEV 202,1 202,1 202,1 202,1 BLOCK BYTES LATms # ./iosnoop –h USAGE: iosnoop [-hQst] [-d device] [-i iotype] [-p PID] [-n name] [duration] -d device # device string (eg, "202,1) -i iotype # match type (eg, '*R*' for all reads) -n name # process name to match on I/O issue -p PID # PID to match on I/O issue # include queueing time in LATms # include start time of I/O (s) # include completion time of I/O (s) # this usage message duration # duration seconds, and use buffers […]slide 57:
jrace: perf-‐tools iolatency • Block I/O (disk) latency distributions: # ./iolatency Tracing block I/O. Output every 1 seconds. Ctrl-C to end. >gt;=(ms) ..slide 58:gt; 1 1 ->gt; 2 2 ->gt; 4 4 ->gt; 8 8 ->gt; 16 : I/O : 2104 : 280 : 2 : 0 : 202 |Distribution |######################################| |###### |#### >gt;=(ms) .. gt; 1 1 ->gt; 2 2 ->gt; 4 4 ->gt; 8 8 ->gt; 16 16 ->gt; 32 32 ->gt; 64 […] : I/O : 1144 : 267 : 10 : 5 : 248 : 601 : 117 |Distribution |######################################| |######### |######### |#################### |####
jrace: perf-‐tools funccount • Count a kernel function call rate: # ./funccount -i 1 'bio_*' Tracing "bio_*"... Ctrl-C to end. FUNC bio_attempt_back_merge bio_get_nr_vecs bio_alloc bio_alloc_bioset bio_endio bio_free bio_fs_destructor bio_init bio_integrity_enabled bio_put bio_add_page COUNT Counts are in-‐kernel, for low overhead [...] – -i: set an output interval (seconds), otherwise until Ctrl-Cslide 59:
jrace: perf-‐tools uprobe • Dynamically trace user-level functions; eg, MySQL: # ./uprobe 'p:dispatch_command /opt/mysql/bin/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcj +0(%dx):string' Tracing uprobe dispatch_command (p:dispatch_command /opt/mysql/bin/mysqld:0x2dbd40 +0(%dx):string). Ctrl-C to end. mysqld-2855 [001] d... 19956674.509085: dispatch_command: (0x6dbd40) arg1="show tables" mysqld-2855 [001] d... 19956675.541155: dispatch_command: (0x6dbd40) arg1="SELECT * FROM numbers where number >gt; 32000" – Filter on string match; eg, "SELECT": # ./uprobe 'p:dispatch_command /opt/mysql/bin/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcj cmd=+0(%dx):string' 'cmd ~ "SELECT*"' Tracing uprobe dispatch_command (p:dispatch_command /opt/mysql/bin/mysqld:0x2dbd40 cmd= +0(%dx):string). Ctrl-C to end. mysqld-2855 [001] d... 19956754.619958: dispatch_command: (0x6dbd40) cmd="SELECT * FROM numbers where number >gt; 32000" mysqld-2855 [001] d... 19956755.060125: dispatch_command: (0x6dbd40) cmd="SELECT * FROM numbers where number >gt; 32000" • Ok for hacking, but not friendly; need perf_events/BPFslide 60:
perf_events • Powerful profiler (covered earlier) and tracer: – User-level and kernel dynamic tracing – Kernel line tracing and local variables (debuginfo) – Kernel filtering expressions – Efficient in-kernel counts (perf stat) • Intended as the official Linux tracer/profiler • Becoming more programmable with BPF support (2016) – Search lkml for "perf" and "BPF"slide 61:
perf_events: LisGng Events # perf list cpu-cycles OR cycles instructions cache-references cache-misses branch-instructions OR branches […] skb:kfree_skb skb:consume_skb skb:skb_copy_datagram_iovec net:net_dev_xmit net:net_dev_queue net:netif_receive_skb net:netif_rx […] block:block_rq_abort block:block_rq_requeue block:block_rq_complete block:block_rq_insert block:block_rq_issue block:block_bio_bounce block:block_bio_complete […] [Hardware event] [Hardware event] [Hardware event] [Hardware event] [Hardware event] [Tracepoint event] [Tracepoint event] [Tracepoint event] [Tracepoint event] [Tracepoint event] [Tracepoint event] [Tracepoint event] [Tracepoint event] [Tracepoint event] [Tracepoint event] [Tracepoint event] [Tracepoint event] [Tracepoint event] [Tracepoint event]slide 62:
perf_events: Tracing Tracepoints # perf record -e block:block_rq_complete -a sleep 10 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.428 MB perf.data (~18687 samples) ] # perf script run 30339 [000] 2083345.722767: block:block_rq_complete: 202,1 W () 12984648 + 8 [0] run 30339 [000] 2083345.722857: block:block_rq_complete: 202,1 W () 12986336 + 8 [0] run 30339 [000] 2083345.723180: block:block_rq_complete: 202,1 W () 12986528 + 8 [0] swapper 0 [000] 2083345.723489: block:block_rq_complete: 202,1 W () 12986496 + 8 [0] swapper 0 [000] 2083346.745840: block:block_rq_complete: 202,1 WS () 1052984 + 144 [0] supervise 30342 [000] 2083346.746571: block:block_rq_complete: 202,1 WS () 1053128 + 8 [0] supervise 30342 [000] 2083346.746663: block:block_rq_complete: 202,1 W () 12986608 + 8 [0] run 30342 [000] 2083346.747003: block:block_rq_complete: 202,1 W () 12986832 + 8 [0] [...] • If -g is used in "perf record", stack traces are included • If "perf script" output is too verbose, try "perf report", or making a flame graphslide 63:
BPF • Enhanced Berkeley Packet Filter, now just "BPF" – Enhancements added in Linux 3.15, 3.19, 4.1, 4.5, … • Provides programmatic tracing – measure latency, custom histograms, … System dashboards of 2017+ should look very differentslide 64:
BPF: bcc ext4slower • ext4 operations slower than the threshold: # ./ext4slower 1 Tracing ext4 operations slower than 1 ms TIME COMM PID T BYTES OFF_KB 06:49:17 bash R 128 06:49:17 cksum R 39552 06:49:17 cksum R 96 06:49:17 cksum R 96 06:49:17 cksum R 10320 06:49:17 cksum R 65536 06:49:17 cksum R 55400 06:49:17 cksum R 36792 06:49:17 cksum R 15008 […] LAT(ms) FILENAME 7.75 cksum 1.34 [ 5.36 2to3-2.7 14.94 2to3-3.4 6.82 411toppm 4.01 a2p 8.77 ab 16.34 aclocal-1.14 19.31 acpi_listen • Better indicator of application pain than disk I/O • Measures & filters in-kernel for efficiency using BPF – https://github.com/iovisor/bccslide 65:
BPF: bcc tools (early 2016)slide 66:
6. Tuningslide 67:
Ubuntu Trusty Tuning: Early 2016 (1/2) CPU schedtool –B PID disable Ubuntu apport (crash reporter) also upgrade to Xenial (better HT scheduling) Virtual Memory vm.swappiness = 0 # from 60 kernel.numa_balancing = 0 # temp workaround Huge Pages echo never >gt; /sys/kernel/mm/transparent_hugepage/enabled File System vm.dirty_ratio = 80 # from 40 vm.dirty_background_ratio = 5 # from 10 vm.dirty_expire_centisecs = 12000 # from 3000 mount -o defaults,noatime,discard,nobarrier … Storage I/O /sys/block/*/queue/rq_affinity /sys/block/*/queue/scheduler noop /sys/block/*/queue/nr_requestsslide 68:
Ubuntu Trusty Tuning: Early 2016 (2/2) Storage (continued) /sys/block/*/queue/read_ahead_kb 256 mdadm –chunk=64 … Networking net.core.somaxconn = 1000 net.core.netdev_max_backlog = 5000 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_wmem = 4096 12582912 16777216 net.ipv4.tcp_rmem = 4096 12582912 16777216 net.ipv4.tcp_max_syn_backlog = 8096 net.ipv4.tcp_slow_start_after_idle = 0 net.ipv4.tcp_tw_reuse = 1 net.ipv4.ip_local_port_range = 10240 65535 net.ipv4.tcp_abort_on_overflow = 1 # maybe Hypervisor (Xen) echo tsc >gt; /sys/devices/…/current_clocksource Plus PVHVM (HVM), SR-IOV, …slide 69:
Summary A brief discussion of 6 facets of Linux performance: Observability Methodologies Benchmarking Profiling Tracing Tuningslide 70:
Takeaways Some things to print out for your office wall: uptime dmesg -T | tail vmstat 1 mpstat -P ALL 1 pidstat 1 iostat -xz 1 free -m sar -n DEV 1 sar -n TCP,ETCP 1 topslide 71:
More Links Netflix Tech Blog on Linux: Linux Performance: https://www.kernel.org/doc/Documentation/trace/ftrace.txt https://github.com/brendangregg/perf-tools USE Method Linux: https://perf.wiki.kernel.org/index.php/Main_Page http://www.brendangregg.com/perf.html Linux ftrace: http://www.brendangregg.com/linuxperf.html Linux perf_events: http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html http://techblog.netflix.com/2015/08/netflix-at-velocity-2015-linux.html http://www.brendangregg.com/USEmethod/use-linux.html Flame Graphs: http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html http://queue.acm.org/detail.cfm?id=2927301slide 72:
Thanks Questions? http://slideshare.net/brendangregg http://www.brendangregg.com bgregg@netflix.com @brendangregg