USENIX ATC 2017: Visualizing Performance with Flame Graphs
Video: https://www.youtube.com/watch?v=D53T1Ejig1QTalk by Brendan Gregg for USENIX ATC 2017.
"Flame graphs are a simple stack trace visualization that helps answer an everyday problem: how is software consuming resources, especially CPUs, and how did this change since the last software version? Flame graphs have been adopted by many languages, products, and companies, including Netflix, and have become a standard tool for performance analysis. They were published in "The Flame Graph" article in the June 2016 issue of Communications of the ACM, by their creator, Brendan Gregg.
This talk describes the background for this work, and the challenges encountered when profiling stack traces and resolving symbols for different languages, including for just-in-time compiler runtimes. Instructions will be included generating mixed-mode flame graphs on Linux, and examples from our use at Netflix with Java. Advanced flame graph types will be described, including differential, off-CPU, chain graphs, memory, and TCP events. Finally, future work and unsolved problems in this area will be discussed."
next prev 1/66 | |
next prev 2/66 | |
next prev 3/66 | |
next prev 4/66 | |
next prev 5/66 | |
next prev 6/66 | |
next prev 7/66 | |
next prev 8/66 | |
next prev 9/66 | |
next prev 10/66 | |
next prev 11/66 | |
next prev 12/66 | |
next prev 13/66 | |
next prev 14/66 | |
next prev 15/66 | |
next prev 16/66 | |
next prev 17/66 | |
next prev 18/66 | |
next prev 19/66 | |
next prev 20/66 | |
next prev 21/66 | |
next prev 22/66 | |
next prev 23/66 | |
next prev 24/66 | |
next prev 25/66 | |
next prev 26/66 | |
next prev 27/66 | |
next prev 28/66 | |
next prev 29/66 | |
next prev 30/66 | |
next prev 31/66 | |
next prev 32/66 | |
next prev 33/66 | |
next prev 34/66 | |
next prev 35/66 | |
next prev 36/66 | |
next prev 37/66 | |
next prev 38/66 | |
next prev 39/66 | |
next prev 40/66 | |
next prev 41/66 | |
next prev 42/66 | |
next prev 43/66 | |
next prev 44/66 | |
next prev 45/66 | |
next prev 46/66 | |
next prev 47/66 | |
next prev 48/66 | |
next prev 49/66 | |
next prev 50/66 | |
next prev 51/66 | |
next prev 52/66 | |
next prev 53/66 | |
next prev 54/66 | |
next prev 55/66 | |
next prev 56/66 | |
next prev 57/66 | |
next prev 58/66 | |
next prev 59/66 | |
next prev 60/66 | |
next prev 61/66 | |
next prev 62/66 | |
next prev 63/66 | |
next prev 64/66 | |
next prev 65/66 | |
next prev 66/66 |
PDF: USENIX_ATC2017_flamegraphs.pdf
Keywords (from pdftotext):
slide 1:
2017 USENIX Annual Technical Conference Visualizing Performance with Flame Graphs Brendan Gregg Senior Performance Architect Jul 2017slide 2:
Visualize CPU -me consumed by all so5ware Kernel Java User-levelslide 3:
Agenda 1. CPU Flame graphs 2. Fixing Stacks & Symbols 3. Advanced flame graphsslide 4:
Take aways 1. Interpret CPU flame graphs 2. Understand piHalls with stack traces and symbols 3. Discover opportuniKes for future developmentslide 5:
slide 6:
Case Study Exception handling consuming CPUslide 7:
Summary CPU PROFILINGslide 8:
CPU Profiling • Record stacks at a timed interval: simple and effective – Pros: Low (deterministic) overhead – Cons: Coarse accuracy, but usually sufficient stack samples: syscall on-CPU time off-CPU block interruptslide 9:
Stack Traces • A code path snapshot. e.g., from jstack(1): $ jstack 1819 […] "main" prio=10 tid=0x00007ff304009000 nid=0x7361 runnable [0x00007ff30d4f9000] java.lang.Thread.State: RUNNABLE at Func_abc.func_c(Func_abc.java:6) at Func_abc.func_b(Func_abc.java:16) at Func_abc.func_a(Func_abc.java:23) at Func_abc.main(Func_abc.java:27) running parent g.parent g.g.parentslide 10:
System Profilers • Linux – perf_events (aka "perf") • Oracle Solaris – DTrace • OS X – Instruments • Windows – XPerf, WPA (which now has flame graphs!) • And many others…slide 11:
Linux perf_events • Standard Linux profiler – Provides the perf command (mulK-tool) – Usually pkg added by linux-tools-common, etc. • Many event sources: Timer-based sampling Hardware events Tracepoints Dynamic tracing • Can sample stacks of (almost) everything on CPU Can miss hard interrupt ISRs, but these should be near-zero. They can be measured if needed (I wrote my own tools).slide 12:
perf Profiling # perf record -F 99 -ag -- sleep 30 [ perf record: Woken up 9 times to write data ] [ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ] # perf report -n -stdio […] # Overhead Samples Command Shared Object Symbol # ........ ............ ....... ................. ............................. 20.42% bash [kernel.kallsyms] [k] xen_hypercall_xen_version --- xen_hypercall_xen_version check_events |--44.13%-- syscall_trace_enter tracesys |--35.58%-- __GI___libc_fcntl |--65.26%-- do_redirection_internal do_redirections execute_builtin_or_function execute_simple_command [… ~13,000 lines truncated …] call tree summaryslide 13:
Full perf report Outputslide 14:
… as a Flame Graphslide 15:
Flame Graph Summary • Visualizes a collecKon of stack traces – x-axis: alphabeKcal stack sort, to maximize merging – y-axis: stack depth – color: random (default), or a dimension • Currently made from Perl + SVG + JavaScript – hBps://github.com/brendangregg/FlameGraph – Takes input from many different profilers – MulKple d3 versions are being developed • References: hcp://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html hcp://queue.acm.org/detail.cfm?id=2927301 "The Flame Graph" CACM, June 2016slide 16:
Flame Graph InterpretaKon g() e() f() d() c() i() b() h() a()slide 17:
Flame Graph InterpretaKon (1/3) Top edge shows who is running on-CPU, and how much (width) g() e() f() d() c() i() b() h() a()slide 18:
Flame Graph InterpretaKon (2/3) Top-down shows ancestry e.g., from g(): g() e() f() d() c() i() b() h() a()slide 19:
Flame Graph InterpretaKon (3/3) Widths are proporKonal to presence in samples e.g., comparing b() to h() (incl. children) g() e() f() d() c() i() b() h() a()slide 20:
Mixed-Mode Flame Graphs • Hues: – green == JIT (eg, Java) – aqua == inlined • if included – red == user-level* – orange == kernel – yellow == C++ • Intensity: – Randomized to differenKate frames – Or hashed on funcKon name * new palece uses red for kernel modules too Mixed-Mode Java JVM (C++) Kernelslide 21:
DifferenKal Flame Graphs • Hues: Differential – red == more samples – blue == less samples • Intensity: – Degree of difference • Compares two profiles • Can show other metrics: e.g., CPI • Other types exist – flamegraphdiff more lessslide 22:
Icicle Graph top (leaf) mergeslide 23:
Flame Graph Search • Color: magenta to show matched frames search buttonslide 24:
Flame Charts • Final note: these are useful, but are not flame graphs • Flame charts: x-axis is time • Flame graphs: x-axis is population (maximize merging) from Chrome dev toolsslide 25:
PiHalls and fixes STACK TRACINGslide 26:
Broken Stack Traces are Common Because: A. Profilers use frame pointer walking by default B. Compilers reuse the frame pointer register as a general purpose register: a (usually very small) performance opKmizaKon. # perf record –F 99 –a –g – sleep 30 # perf script […] java 4579 cpu-clock: 7f417908c10b [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f41792fc65f [unknown] (/tmp/perf-4458.map) a2d53351ff7da603 [unknown] ([unknown]) […]slide 27:
… as a Flame Graph Broken Java stacks (missing frame pointer)slide 28:
Fixing Stack Walking A. Frame pointer-based – Fix by disabling that compiler opKmizaKon: gcc's -fno-omit-frame-pointer – Pros: simple, supported by many tools – Cons: might cost a licle extra CPU B. Debug info (DWARF) walking Cons: costs disk space, and not supported by all profilers. Even possible with JIT? C. JIT runKme walkers – Pros: include more internals, such as inlined frames – Cons: limited to applicaKon internals - no kernel D. Last branch recordslide 29:
Fixing Java Stack Traces # perf script […] java 4579 cpu-clock: 7f417908c10b [unknown] (/tmp/… java 4579 cpu-clock: 7f41792fc65f [unknown] (/tmp/… a2d53351ff7da603 [unknown] ([unkn… […] I prototyped JVM frame pointers. Oracle rewrote it and added it to Java as -XX:+PreserveFramePointer (JDK 8 u60b19) # perf script […] java 8131 cpu-clock: 7fff76f2dce1 [unknown] ([vdso]) 7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm… 7fd301861e46 [unknown] (/tmp/perf-8131.map) 7fd30184def8 [unknown] (/tmp/perf-8131.map) 7fd30174f544 [unknown] (/tmp/perf-8131.map) 7fd30175d3a8 [unknown] (/tmp/perf-8131.map) 7fd30166d51c [unknown] (/tmp/perf-8131.map) 7fd301750f34 [unknown] (/tmp/perf-8131.map) 7fd3016c2280 [unknown] (/tmp/perf-8131.map) 7fd301b02ec0 [unknown] (/tmp/perf-8131.map) 7fd3016f9888 [unknown] (/tmp/perf-8131.map) 7fd3016ece04 [unknown] (/tmp/perf-8131.map) 7fd30177783c [unknown] (/tmp/perf-8131.map) 7fd301600aa8 [unknown] (/tmp/perf-8131.map) 7fd301a4484c [unknown] (/tmp/perf-8131.map) 7fd3010072e0 [unknown] (/tmp/perf-8131.map) 7fd301007325 [unknown] (/tmp/perf-8131.map) 7fd301007325 [unknown] (/tmp/perf-8131.map) 7fd3010004e7 [unknown] (/tmp/perf-8131.map) 7fd3171df76a JavaCalls::call_helper(JavaValue*,… 7fd3171dce44 JavaCalls::call_virtual(JavaValue*… 7fd3171dd43a JavaCalls::call_virtual(JavaValue*… 7fd31721b6ce thread_entry(JavaThread*, Thread*)… 7fd3175389e0 JavaThread::thread_main_inner() (/… 7fd317538cb2 JavaThread::run() (/usr/lib/jvm/nf… 7fd3173f6f52 java_start(Thread*) (/usr/lib/jvm/… 7fd317a7e182 start_thread (/lib/x86_64-linux-gn…slide 30:
Fixed Stacks Flame Graph Java stacks (but no symbols, yet)slide 31:
Inlining • Many frames may be missing (inlined) – Flame graph may sKll make enough sense • Inlining can oqen be be tuned – e.g. Java's -XX:-Inline to disable, but can be 80% slower – Java's -XX:MaxInlineSize and -XX:InlineSmallCode can be tuned a licle to reveal more frames: can even improve performance! • RunKmes can un-inline on demand – So that excepKon stack traces make sense – e.g. Java's perf-map-agent can un-inline (unfoldall opKon)slide 32:
Stack Depth • perf had a 127 frame limit • Now tunable in Linux 4.8 – sysctl -w kernel.perf_event_max_stack=512 – Thanks Arnaldo Carvalho de Melo! broken stacks A Java microservice with a stack depth of >gt; 900 perf_event_max_stack=1024slide 33:
Fixing SYMBOLSslide 34:
Fixing NaKve Symbols A. Add a -dbgsym package, if available B. Recompile from sourceslide 35:
Fixing JIT Symbols (Java, Node.js, …) • Just-in-Kme runKmes don't have a pre-compiled symbol table • So Linux perf looks for an externally provided JIT symbol file: /tmp/perf-PID.map # perf script Failed to open /tmp/perf-8131.map, continuing without symbols […] java 8131 cpu-clock: 7fff76f2dce1 [unknown] ([vdso]) 7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm… 7fd301861e46 [unknown] (/tmp/perf-8131.map) […] • This can be created by runKmes; eg, Java's perf-map-agentslide 36:
Fixed Stacks & Symbols Java Mixed-Mode Flame Graph Kernel Java JVMslide 37:
Stacks & Symbols (zoom)slide 38:
Symbol Churn • For JIT runKmes, symbols can change during a profile • Symbols may be mistranslated by perf's map snapshot • SoluKons: A. Take a before & aqer snapshot, and compare B. perf's new support for Kmestamped symbol logsslide 39:
Containers • perf can't find any symbol sources – Unless you copy them into the host • I'm tesKng Krister Johansen's fix, hopefully for Linux 4.13 – lkml: "[PATCH Kp/perf/core 0/7] namespace tracing improvements"slide 40:
For Linux INSTRUCTIONSslide 41:
Linux CPU Flame Graphs Linux 2.6+, via perf.data and perf script: git clone --depth 1 https://github.com/brendangregg/FlameGraph cd FlameGraph perf record -F 99 -a –g -- sleep 30 perf script | ./stackcollapse-perf.pl |./flamegraph.pl >gt; perf.svg Linux 4.5+ can use folded output Skips the CPU-costly stackcollapse-perf.pl step; see: hcp://www.brendangregg.com/blog/2016-04-30/linux-perf-folded.html Linux 4.9+, via BPF: git clone --depth 1 https://github.com/brendangregg/FlameGraph git clone --depth 1 https://github.com/iovisor/bcc ./bcc/tools/profile.py -dF 99 30 | ./FlameGraph/flamegraph.pl >gt; perf.svg Most efficient: no perf.data file, summarizes in-kernelslide 42:
Linux Profiling OpKmizaKons Linux 2.6 Linux 4.5 Linux 4.9 capture stacks capture stacks count stacks (BPF) perf record perf record profile.py write samples write samples perf.data perf.data reads samples reads samples perf script write text stackcollapse-perf.pl folded output flamegraph.pl perf report –g folded folded output folded report awk folded output flamegraph.pl flamegraph.plslide 43:
Language/RunKme InstrucKons • Each may have special stack/symbol instrucKons – Java, Node.js, Python, Ruby, C++, Go, … • I'm documenKng some on: – hcp://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html – Also try an Internet searchslide 44:
GUI AutomaKon Eg, NeHlix Vector (self-service UI): Flame Graphs Should be open sourced; you may also build/buy your ownslide 45:
Future Work ADVANCED FLAME GRAPHSslide 46:
Flame graphs can be generated for stack traces from any Linux event sourceslide 47:
Page Faults • Show what triggered main memory (resident) to grow: # perf record -e page-faults -p PID -g -- sleep 120 • "fault" as (physical) main memory is allocated on-demand, when a virtual page is first populated • Low overhead tool to solve some types of memory leak RES column in top(1) grows becauseslide 48:
Other Memory Sources hcp://www.brendangregg.com/FlameGraphs/memoryflamegraphs.htmlslide 49:
Context Switches • Show why Java blocked and stopped running on-CPU: # perf record -e context-switches -p PID -g -- sleep 5 • IdenKfies locks, I/O, sleeps If code path shouldn't block and looks random, it's an involuntary context switch. I could filter these, but you should have solved them beforehand (CPU load). • e.g., was used to understand framework differences: rxNetty Tomcat epoll futex sys_poll futexslide 50:
Disk I/O Requests • Shows who issued disk I/O (sync reads & writes): # perf record -e block:block_rq_insert -a -g -- sleep 60 • e.g.: page faults in GC? This JVM has swapped out!:slide 51:
TCP Events • TCP transmit, using dynamic tracing: # perf probe tcp_sendmsg # perf record -e probe:tcp_sendmsg -a -g -- sleep 1; jmaps # perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso,trace >gt; out.stacks # perf probe --del tcp_sendmsg • Note: can be high overhead for high packet rates – For the current perf trace, dump, post-process cycle • Can also trace TCP connect & accept – Lower frequency, therefore lower overhead • TCP receive is async – Could trace via socket read TCP sendsslide 52:
CPU Cache Misses • In this example, sampling via Last Level Cache loads: # perf record -e LLC-loads -c 10000 -a -g -- sleep 5; jmaps # perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso >gt; out.stacks • -c is the count (samples once per count) • Use other CPU counters to sample hits, misses, stallsslide 53:
CPI Flame Graph • Cycles Per InstrucKon – red == instrucKon heavy – blue == cycle heavy (likely memory stall cycles) zoomed:slide 54:
Off-CPU Analysis Off-CPU analysis is the study of blocking states, or the code-path (stack trace) that led to these statesslide 55:
Off-CPU Time Flame Graph Stack depth Off-CPU time More info hcp://www.brendangregg.com/blog/2016-02-01/linux-wakeup-offwake-profiling.htmlslide 56:
Off-CPU Time (zoomed): tar(1) directory read from disk file read from disk fstat from disk path read from disk pipe write Only showing kernel stacks in this exampleslide 57:
CPU + Off-CPU Flame Graphs: See Everything CPU Off-CPU hcp://www.brendangregg.com/flamegraphs.htmlslide 58:
Off-CPU Time (zoomed): gzip(1) The off-CPU stack trace often doesn't show the root cause of latency. What is gzip blocked on?slide 59:
Off-Wake Time Flame Graph Uses Linux enhanced BPF to merge off-CPU and waker stack in kernel contextslide 60:
Off-Wake Time Flame Graph (zoomed) Waker task Waker stack Stack DirecKon Wokeup Blocked stack Blocked taskslide 61:
Chain Graphs Walking the chain of wakeup stacks to reach root causeslide 62:
Hot Cold Flame Graphs Includes both CPU & Off-CPU (or chain) stacks in one flame graph However, Off-CPU Kme oqen dominates: threads waiKng or polling hcp://www.brendangregg.com/FlameGraphs/hotcoldflamegraphs.htmlslide 63:
Flame Graph Diff hcps://github.com/corpaul/flamegraphdiffslide 64:
Take aways 1. Interpret CPU flame graphs 2. Understand piHalls with stack traces and symbols 3. Discover opportuniKes for future developmentslide 65:
Links & References Flame Graphs – "The Flame Graph" Communica-ons of the ACM, Vol. 56, No. 6 (June 2016) – hcp://queue.acm.org/detail.cfm?id=2927301 – hcp://www.brendangregg.com/flamegraphs.html ->gt; hBp://www.brendangregg.com/flamegraphs.html#Updates – hcp://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html – hcp://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html – hcp://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html – hcp://techblog.neHlix.com/2015/07/java-in-flames.html – hcp://techblog.neHlix.com/2016/04/saving-13-million-computaKonal-minutes.html – hcp://techblog.neHlix.com/2014/11/nodejs-in-flames.html – hcp://www.brendangregg.com/blog/2014-11-09/differenKal-flame-graphs.html – hcp://www.brendangregg.com/blog/2016-01-20/ebpf-offcpu-flame-graph.html – hcp://www.brendangregg.com/blog/2016-02-01/linux-wakeup-offwake-profiling.html – hcp://www.brendangregg.com/blog/2016-02-05/ebpf-chaingraph-prototype.html – hcp://corpaul.github.io/flamegraphdiff/ Linux perf_events – hcps://perf.wiki.kernel.org/index.php/Main_Page – hcp://www.brendangregg.com/perf.html NeHlix Vector hcps://github.com/neHlix/vector hcp://techblog.neHlix.com/2015/04/introducing-vector-neHlixs-on-host.htmlslide 66:
2017 USENIX Annual Technical Conference Thank You – QuesKons? – hcp://www.brendangregg.com – hcp://slideshare.net/brendangregg – bgregg@neHlix.com – @brendangregg Next topic: Performance Superpowers with Enhanced BPF