Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

USENIX LISA 2013: Blazing Performance with Flame Graphs

Delivered as plenary talk at USENIX LISA 2013 by Brendan Gregg.

Video: https://www.youtube.com/watch?v=nZfNehCzGdw and https://www.usenix.org/conference/lisa13/technical-sessions/plenary/gregg

Description: ""How did we ever analyze performance before Flame Graphs?" This new visualization invented by Brendan can help you quickly understand application and kernel performance, especially CPU usage, where stacks (call graphs) can be sampled and then visualized as an interactive flame graph. Flame Graphs are now used for a growing variety of targets: for applications and kernels on Linux, SmartOS, Mac OS X, and Windows; for languages including C, C++, node.js, ruby, and Lua; and in WebKit Web Inspector. This talk will explain them and provide use cases and new visualizations for other event types, including I/O, memory usage, and latency."

next
prev
1/170
next
prev
2/170
next
prev
3/170
next
prev
4/170
next
prev
5/170
next
prev
6/170
next
prev
7/170
next
prev
8/170
next
prev
9/170
next
prev
10/170
next
prev
11/170
next
prev
12/170
next
prev
13/170
next
prev
14/170
next
prev
15/170
next
prev
16/170
next
prev
17/170
next
prev
18/170
next
prev
19/170
next
prev
20/170
next
prev
21/170
next
prev
22/170
next
prev
23/170
next
prev
24/170
next
prev
25/170
next
prev
26/170
next
prev
27/170
next
prev
28/170
next
prev
29/170
next
prev
30/170
next
prev
31/170
next
prev
32/170
next
prev
33/170
next
prev
34/170
next
prev
35/170
next
prev
36/170
next
prev
37/170
next
prev
38/170
next
prev
39/170
next
prev
40/170
next
prev
41/170
next
prev
42/170
next
prev
43/170
next
prev
44/170
next
prev
45/170
next
prev
46/170
next
prev
47/170
next
prev
48/170
next
prev
49/170
next
prev
50/170
next
prev
51/170
next
prev
52/170
next
prev
53/170
next
prev
54/170
next
prev
55/170
next
prev
56/170
next
prev
57/170
next
prev
58/170
next
prev
59/170
next
prev
60/170
next
prev
61/170
next
prev
62/170
next
prev
63/170
next
prev
64/170
next
prev
65/170
next
prev
66/170
next
prev
67/170
next
prev
68/170
next
prev
69/170
next
prev
70/170
next
prev
71/170
next
prev
72/170
next
prev
73/170
next
prev
74/170
next
prev
75/170
next
prev
76/170
next
prev
77/170
next
prev
78/170
next
prev
79/170
next
prev
80/170
next
prev
81/170
next
prev
82/170
next
prev
83/170
next
prev
84/170
next
prev
85/170
next
prev
86/170
next
prev
87/170
next
prev
88/170
next
prev
89/170
next
prev
90/170
next
prev
91/170
next
prev
92/170
next
prev
93/170
next
prev
94/170
next
prev
95/170
next
prev
96/170
next
prev
97/170
next
prev
98/170
next
prev
99/170
next
prev
100/170
next
prev
101/170
next
prev
102/170
next
prev
103/170
next
prev
104/170
next
prev
105/170
next
prev
106/170
next
prev
107/170
next
prev
108/170
next
prev
109/170
next
prev
110/170
next
prev
111/170
next
prev
112/170
next
prev
113/170
next
prev
114/170
next
prev
115/170
next
prev
116/170
next
prev
117/170
next
prev
118/170
next
prev
119/170
next
prev
120/170
next
prev
121/170
next
prev
122/170
next
prev
123/170
next
prev
124/170
next
prev
125/170
next
prev
126/170
next
prev
127/170
next
prev
128/170
next
prev
129/170
next
prev
130/170
next
prev
131/170
next
prev
132/170
next
prev
133/170
next
prev
134/170
next
prev
135/170
next
prev
136/170
next
prev
137/170
next
prev
138/170
next
prev
139/170
next
prev
140/170
next
prev
141/170
next
prev
142/170
next
prev
143/170
next
prev
144/170
next
prev
145/170
next
prev
146/170
next
prev
147/170
next
prev
148/170
next
prev
149/170
next
prev
150/170
next
prev
151/170
next
prev
152/170
next
prev
153/170
next
prev
154/170
next
prev
155/170
next
prev
156/170
next
prev
157/170
next
prev
158/170
next
prev
159/170
next
prev
160/170
next
prev
161/170
next
prev
162/170
next
prev
163/170
next
prev
164/170
next
prev
165/170
next
prev
166/170
next
prev
167/170
next
prev
168/170
next
prev
169/170
next
prev
170/170

PDF: LISA13_Flame_Graphs.pdf

Keywords (from pdftotext):

slide 1:
    Blazing Performance
    with
    Flame Graphs
    Brendan Gregg
    
slide 2:
    An Interactive Visualization for Stack Traces
    
slide 3:
    My Previous Visualizations Include
    • Latency Heat Maps (and other heat map types), including:
    • Quotes from LISA'13 yesterday:
    • "Heat maps are a wonderful thing, use them" – Caskey Dickson
    • "If you do distributed systems, you need this" – Theo Schlossnagle
    • I did heat maps and visualizations in my LISA'10 talk
    
slide 4:
    Audience
    • This is for developers, sysadmins, support staff, and
    performance engineers
    • This is a skill-up for everyone: beginners to experts
    • This helps analyze all software: kernels and applications
    
slide 5:
    whoami
    • G’Day, I’m Brendan
    • Recipient of the LISA 2013 Award
    for Outstanding Achievement
    in System Administration!
    (Thank you!)
    • Work/Research: tools,
    methodologies, visualizations
    • Author of Systems Performance,
    primary author of DTrace
    (Prentice Hall, 2011)
    • Lead Performance Engineer
    @joyent; also teach classes:
    Cloud Perf coming up: http://www.joyent.com/developers/training-services
    
slide 6:
    Joyent
    • High-Performance Cloud Infrastructure
    • Public/private cloud provider
    • OS-Virtualization for bare metal performance
    • KVM for Linux guests
    • Core developers of
    SmartOS and node.js
    • Office walls decorated
    with Flame Graphs:
    
slide 7:
    Agenda: Two Talks in One
    • 1. CPU Flame Graphs
    • Example
    • Background
    • Flame Graphs
    • Generation
    • Types: CPU
    • 2. Advanced Flame Graphs
    • Types: Memory, I/O, Off-CPU, Hot/Cold, Wakeup
    • Developments
    • SVG demos: https://github.com/brendangregg/FlameGraph/demos
    
slide 8:
    CPU Flame Graphs
    
slide 9:
    Example
    
slide 10:
    Example
    • As a short example, I’ll describe the real world performance
    issue that led me to create flame graphs
    • Then I’ll explain them in detail
    
slide 11:
    Example: The Problem
    • A production MySQL database had poor performance
    • It was a heavy CPU consumer, so I used a CPU profiler to see
    why. It sampled stack traces at timed intervals
    • The profiler condensed its output by only printing unique
    stacks along with their occurrence counts, sorted by count
    • The following shows the profiler command and the two most
    frequently sampled stacks...
    
slide 12:
    Example: CPU Profiling
    # dtrace -x ustackframes=100 -n 'profile-997 /execname == "mysqld"/ {
    @[ustack()] = count(); } tick-60s { exit(0); }'
    dtrace: description 'profile-997 ' matched 2 probes
    CPU
    FUNCTION:NAME
    1 75195
    :tick-60s
    [...]
    libc.so.1`__priocntlset+0xa
    libc.so.1`getparam+0x83
    libc.so.1`pthread_getschedparam+0x3c
    libc.so.1`pthread_setschedprio+0x1f
    mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x9ab
    mysqld`_Z10do_commandP3THD+0x198
    mysqld`handle_one_connection+0x1a6
    libc.so.1`_thrp_setup+0x8d
    libc.so.1`_lwp_start
    mysqld`_Z13add_to_statusP17system_status_varS0_+0x47
    mysqld`_Z22calc_sum_of_all_statusP17system_status_var+0x67
    mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x1222
    mysqld`_Z10do_commandP3THD+0x198
    mysqld`handle_one_connection+0x1a6
    libc.so.1`_thrp_setup+0x8d
    libc.so.1`_lwp_start
    
slide 13:
    Example: CPU Profiling
    # dtrace -x ustackframes=100 -n 'profile-997 /execname == "mysqld"/ {
    @[ustack()] = count(); } tick-60s { exit(0); }'
    dtrace: description 'profile-997 ' matched 2 probes
    CPU
    FUNCTION:NAME
    Profiling
    1 75195
    :tick-60s
    Command
    [...]
    libc.so.1`__priocntlset+0xa
    (DTrace)
    libc.so.1`getparam+0x83
    libc.so.1`pthread_getschedparam+0x3c
    libc.so.1`pthread_setschedprio+0x1f
    mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x9ab
    mysqld`_Z10do_commandP3THD+0x198
    mysqld`handle_one_connection+0x1a6
    libc.so.1`_thrp_setup+0x8d
    libc.so.1`_lwp_start
    Stack
    Trace
    mysqld`_Z13add_to_statusP17system_status_varS0_+0x47
    mysqld`_Z22calc_sum_of_all_statusP17system_status_var+0x67
    mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x1222
    mysqld`_Z10do_commandP3THD+0x198
    mysqld`handle_one_connection+0x1a6
    libc.so.1`_thrp_setup+0x8d
    libc.so.1`_lwp_start
    # of occurrences
    
slide 14:
    Example: Profile Data
    • Over 500,000 lines were elided from that output (“[...]”)
    • Full output looks like this...
    
slide 15:
    Example: Profile Data
    60 seconds of on-CPU MySQL
    
slide 16:
    Example: Profile Data
    First
    Stack
    Size of
    One Stack
    Last
    Stack
    27,053
    Unique
    60 seconds
    ofStacks
    on-CPU MySQL
    
slide 17:
    Example: Profile Data
    • The most frequent stack, printed last, shows CPU usage in
    add_to_status(), which is from the “show status” command.
    Is that to blame?
    • Hard to tell – it only accounts for 
slide 18:
    Example:Visualizations
    • To understand this profile data quickly, I created visualization
    that worked very well, named “Flame Graph” for its
    resemblance to fire (also as it was showing a “hot” CPU issue)
    Profile Data.txt
    Flame Graph.svg
    some
    Perl
    
slide 19:
    Example: Flame Graph
    Same profile data
    
slide 20:
    Example: Flame Graph
    Same profile data
    Where CPU is
    really consumed
    The
    "show
    status"
    Stack
    All Stack Samples
    One Stack
    Sample
    
slide 21:
    Example: Flame Graph
    • All data in one picture
    • Interactive using JavaScript and a browser: mouse overs
    • Stack elements that are frequent can be seen, read, and
    compared visually. Frame width is relative to sample count
    • CPU usage was now understood properly and quickly,
    leading to a 40% performance win
    
slide 22:
    Background
    
slide 23:
    Background: Stack Frame
    • A stack frame shows a location in code
    • Profilers usually show them on a single line. Eg:
    libc.so.1`mutex_trylock_adaptive+0x112
    module
    function
    instruction
    offset
    
slide 24:
    Background: Stack Trace
    • A stack trace is a list of frames. Their index is the stack depth:
    current
    libc.so.1`mutex_trylock_adaptive+0x112 24
    parent
    parent
    libc.so.1`mutex_lock_impl+0x165
    parent
    grand
    parent
    libc.so.1`mutex_lock+0xc
    Stack
    Depth
    [...]
    libc.so.1`_lwp_start
    
slide 25:
    Background: Stack Trace
    • One full stack:
    libc.so.1`mutex_trylock_adaptive+0x112
    libc.so.1`mutex_lock_impl+0x165
    libc.so.1`mutex_lock+0xc
    mysqld`key_cache_read+0x741
    mysqld`_mi_fetch_keypage+0x48
    mysqld`w_search+0x84
    mysqld`_mi_ck_write_btree+0xa5
    mysqld`mi_write+0x344
    mysqld`ha_myisam::write_row+0x43
    mysqld`handler::ha_write_row+0x8d
    mysqld`end_write+0x1a3
    mysqld`evaluate_join_record+0x11e
    mysqld`sub_select+0x86
    mysqld`do_select+0xd9
    mysqld`JOIN::exec+0x482
    mysqld`mysql_select+0x30e
    mysqld`handle_select+0x17d
    mysqld`execute_sqlcom_select+0xa6
    mysqld`mysql_execute_command+0x124b
    mysqld`mysql_parse+0x3e1
    mysqld`dispatch_command+0x1619
    mysqld`do_handle_one_connection+0x1e5
    mysqld`handle_one_connection+0x4c
    libc.so.1`_thrp_setup+0xbc
    libc.so.1`_lwp_start
    
slide 26:
    Background: Stack Trace
    • Read top-down or bottom-up, and look for key functions
    libc.so.1`mutex_trylock_adaptive+0x112
    libc.so.1`mutex_lock_impl+0x165
    libc.so.1`mutex_lock+0xc
    mysqld`key_cache_read+0x741
    mysqld`_mi_fetch_keypage+0x48
    mysqld`w_search+0x84
    mysqld`_mi_ck_write_btree+0xa5
    mysqld`mi_write+0x344
    mysqld`ha_myisam::write_row+0x43
    mysqld`handler::ha_write_row+0x8d
    mysqld`end_write+0x1a3
    mysqld`evaluate_join_record+0x11e
    mysqld`sub_select+0x86
    mysqld`do_select+0xd9
    mysqld`JOIN::exec+0x482
    mysqld`mysql_select+0x30e
    mysqld`handle_select+0x17d
    mysqld`execute_sqlcom_select+0xa6
    mysqld`mysql_execute_command+0x124b
    mysqld`mysql_parse+0x3e1
    mysqld`dispatch_command+0x1619
    mysqld`do_handle_one_connection+0x1e5
    mysqld`handle_one_connection+0x4c
    libc.so.1`_thrp_setup+0xbc
    libc.so.1`_lwp_start
    Ancestry
    Code Path
    
slide 27:
    Background: Stack Modes
    • Two types of stacks can be profiled:
    • user-level for applications (user mode)
    • kernel-level for the kernel (kernel mode)
    • During a system call, an application may have both
    
slide 28:
    Background: Software Internals
    • You don’t need to be a programmer to understand stacks.
    • Some function names are self explanatory, others require
    source code browsing (if available). Not as bad as it sounds:
    • MySQL has ~15,000 functions in >gt; 0.5 million lines of code
    • The earlier stack has 20 MySQL functions. To understand
    them, you may need to browse only 0.13%
    (20 / 15000) of the code. Might take hours, but it is doable.
    • If you have C++ signatures, you can use a demangler first:
    mysqld`_ZN4JOIN4execEv+0x482
    gc++filt, demangler.com
    mysqld`JOIN::exec()+0x482
    
slide 29:
    Background: Stack Visualization
    • Stack frames can be visualized as rectangles (boxes)
    • Function names can be truncated to fit
    • In this case, color is chosen randomly (from a warm palette)
    to differentiate adjacent frames
    libc.so.1`mutex_trylock_adaptive+0x112
    libc.so.1`mutex_trylock_...
    libc.so.1`mutex_lock_impl+0x165
    libc.so.1`mutex_lock_imp...
    libc.so.1`mutex_lock+0xc
    libc.so.1`mutex_lock+0xc
    mysqld`key_cache_read+0x741
    mysqld`key_cache_read+0x741
    • A stack trace becomes a column of colored rectangles
    
slide 30:
    Background: Time Series Stacks
    • Time series ordering allows time-based pattern identification
    • However, stacks can change thousands of times per second
    Stack
    Depth
    Time (seconds)
    
slide 31:
    Background: Time Series Stacks
    • Time series ordering allows time-based pattern identification
    • However, stacks can change thousands of times per second
    One Stack
    Sample
    Stack
    Depth
    Time (seconds)
    
slide 32:
    Background: Frame Merging
    • When zoomed out, stacks appear as narrow stripes
    • Adjacent identical functions can be merged to improve
    readability, eg:
    mu...
    mu...
    ge...
    muex_tryl...
    ge...
    mu...
    mu...
    mu...
    mutex_lock_impl()
    mu...
    mu...
    mu...
    mutex_lock()
    ke...
    ke...
    ke...
    key_cache_read()
    • This sometimes works: eg, a repetitive single threaded app
    • Often does not (previous slide already did this), due to code
    execution between samples or parallel thread execution
    
slide 33:
    Background: Frame Merging
    • Time-series ordering isn’t necessary for the primary use case:
    identify the most common (“hottest”) code path or paths
    • By using a different x-axis sort order, frame merging can be
    greatly improved...
    
slide 34:
    Flame Graphs
    
slide 35:
    Flame Graphs
    • Flame Graphs sort stacks alphabetically. This sort is applied
    from the bottom frame upwards. This increases merging and
    visualizes code paths.
    Stack
    Depth
    Alphabet
    
slide 36:
    Flame Graphs: Definition
    • Each box represents a function (a merged stack frame)
    • y-axis shows stack depth
    • top function led directly to the profiling event
    • everything beneath it is ancestry (explains why)
    • x-axis spans the sample population, sorted alphabetically
    • Box width is proportional to the total time a function was
    profiled directly or its children were profiled
    • All threads can be shown in the same Flame Graph (the
    default), or as separate per-thread Flame Graphs
    • Flame Graphs can be interactive: mouse over for details
    
slide 37:
    Flame Graphs:Variations
    • Profile data can be anything: CPU, I/O, memory, ...
    • Naming suggestion: [event] [units] Flame Graph
    • Eg: "FS Latency Flame Graph"
    • By default, Flame Graphs == CPU Sample Flame Graphs
    • Colors can be used for another dimension
    • by default, random colors are used to differentiate boxes
    • --hash for hash-based on function name
    • Distribution applications can be shown in the same Flame
    Graph (merge samples from multiple systems)
    
slide 38:
    Flame Graphs: A Simple Example
    • A CPU Sample Flame Graph:
    f()
    d()
    e()
    c()
    h()
    b()
    g()
    a()
    • I’ll illustrate how these are read by posing various questions
    
slide 39:
    Flame Graphs: How to Read
    • A CPU Sample Flame Graph:
    f()
    d()
    e()
    c()
    h()
    b()
    g()
    a()
    • Q: which function is on-CPU the most?
    
slide 40:
    Flame Graphs: How to Read
    • A CPU Sample Flame Graph:
    f()
    d()
    top edge shows
    who is on-CPU
    directly
    e()
    c()
    h()
    b()
    g()
    a()
    • Q: which function is on-CPU the most?
    • A: f()
    e() is on-CPU a
    little, but its runtime
    is mostly spent in f(),
    which is on-CPU directly
    
slide 41:
    Flame Graphs: How to Read
    • A CPU Sample Flame Graph:
    f()
    d()
    e()
    c()
    h()
    b()
    g()
    a()
    • Q: why is f() on-CPU?
    
slide 42:
    Flame Graphs: How to Read
    • A CPU Sample Flame Graph:
    f() was called by e()
    e() was called by c()
    f()
    d()
    e()
    ancestry
    c()
    h()
    b()
    g()
    a()
    • Q: why is f() on-CPU?
    • A: a() → b() → c() → e() → f()
    
slide 43:
    Flame Graphs: How to Read
    • A CPU Sample Flame Graph:
    f()
    d()
    e()
    c()
    h()
    b()
    g()
    a()
    • Q: how does b() compare to g()?
    
slide 44:
    Flame Graphs: How to Read
    • A CPU Sample Flame Graph:
    visually compare
    lengths
    f()
    d()
    e()
    c()
    h()
    b()
    g()
    a()
    • Q: how does b() compare to g()?
    • A: b() looks like it is running (present) about 10 times more
    often than g()
    
slide 45:
    Flame Graphs: How to Read
    • A CPU Sample Flame Graph:
    ... or mouse over
    f()
    d()
    e()
    c()
    h()
    b()
    g()
    a()
    status line
    or tool tip:
    b() is 90%
    • Q: how does b() compare to g()?
    • A: for interactive Flame Graphs, mouse over shows b() is
    90%, g() is 10%
    
slide 46:
    Flame Graphs: How to Read
    • A CPU Sample Flame Graph:
    ... or mouse over
    f()
    d()
    e()
    c()
    h()
    b()
    g()
    a()
    status line
    or tool tip:
    g() is 10%
    • Q: how does b() compare to g()?
    • A: for interactive Flame Graphs, mouse over shows b() is
    90%, g() is 10%
    
slide 47:
    Flame Graphs: How to Read
    • A CPU Sample Flame Graph:
    f()
    d()
    e()
    c()
    h()
    b()
    g()
    a()
    • Q: why are we running f()?
    
slide 48:
    Flame Graphs: How to Read
    • A CPU Sample Flame Graph:
    look for
    branches
    f()
    d()
    e()
    c()
    h()
    b()
    g()
    a()
    • Q: why are we running f()?
    • A: code path branches can reveal key functions:
    • a() choose the b() path
    • c() choose the e() path
    
slide 49:
    Flame Graphs: Example 1
    • Customer alerting software periodically checks a log, however,
    it is taking too long (minutes).
    • It includes grep(1) of an ~18 Mbyte log file, which takes
    around 10 minutes!
    • grep(1) appears to be on-CPU for this time. Why?
    
slide 50:
    Flame Graphs: Example 1
    • CPU Sample Flame Graph for grep(1) user-level stacks:
    
slide 51:
    Flame Graphs: Example 1
    • CPU Sample Flame Graph for grep(1) user-level stacks:
    UTF8?
    • 82% of samples are in check_multibyte_string() or its children.
    This seems odd as the log file is plain ASCII.
    • And why is UTF8 on the scene? ... Oh, LANG=en_US.UTF-8
    
slide 52:
    Flame Graphs: Example 1
    • CPU Sample Flame Graph for grep(1) user-level stacks:
    UTF8?
    • Switching to LANG=C improved performance by 2000x
    • A simple example, but I did spot this from the raw profiler text
    before the Flame Graph. You really need Flame Graphs when
    the text gets too long and unwieldy.
    
slide 53:
    Flame Graphs: Example 2
    • A potential customer benchmarks disk I/O on a cloud instance.
    The performance is not as fast as hoped.
    • The host has new hardware and software. Issues with the new
    type of disks is suspected.
    
slide 54:
    Flame Graphs: Example 2
    • A potential customer benchmarks disk I/O on a cloud instance.
    The performance is not as fast as hoped.
    • The host has new hardware and software. Issues with the new
    type of disks is suspected.
    • I take a look, and notice CPU time in the kernel is modest.
    • I’d normally assume this was I/O overheads and not profile it
    yet, instead beginning with I/O latency analysis.
    • But Flame Graphs make it easy, and it may be useful to see
    what code paths (illumos kernel) are on the table.
    
slide 55:
    Flame Graphs: Example 2
    
slide 56:
    Flame Graphs: Example 2
    • 24% in tsc_read()? Time Stamp Counter? Checking ancestry...
    
slide 57:
    Flame Graphs: Example 2
    • 62% in zfs_zone_io_throttle? Oh, we had forgotten that this
    new platform had ZFS I/O throttles turned on by default!
    
slide 58:
    Flame Graphs: Example 3
    • Application performance is about half that of a competitor
    • Everything is believed identical (H/W, application, config,
    workload) except for the OS and kernel
    • Application is CPU busy, nearly 100% in user-mode. How can
    the kernel cause a 2x delta when the app isn't in kernel-mode?
    • Flame graphs on both platforms for user-mode were created:
    • Linux, using perf
    • SmartOS, using DTrace
    • Added flamegraph.pl --hash option for consistent function
    colors (not random), aiding comparisons
    
slide 59:
    Flame Graphs: Example 3
    Extra Function:
    UnzipDocid()
    Linux
    SmartOS
    • Function label formats are different, but that's just due to
    different profilers/stackcollapse.pl's (should fix this)
    • Widths slighly different, but we already know perf differs
    • Extra function? This is executing different application software!
    SphDocID_t
    UnzipDocid ()
    { return UnzipOffset(); }
    • Actually, a different compiler option was eliding this function
    
slide 60:
    Flame Graphs: More Examples
    • Flame Graphs are typically
    more detailed, like the earlier
    MySQL example
    • Next, how to generate them,
    then more examples
    
slide 61:
    Generation
    
slide 62:
    Generation
    • I’ll describe the original Perl version I wrote and shared on
    github:
    • https://github.com/brendangregg/FlameGraph
    • There are other great Flame Graph implementations with
    different features and usage, which I’ll cover in the last section
    
slide 63:
    Generation: Steps
    • 1. Profile event of interest
    • 2. stackcollapse.pl
    • 3. flamegraph.pl
    
slide 64:
    Generation: Overview
    • Full command line example. This uses DTrace for CPU
    profiling of the kernel:
    # dtrace -x stackframes=100 -n 'profile-997 /arg0/ {
    @[stack()] = count(); } tick-60s { exit(0); }' -o out.stacks
    # stackcollapse.pl gt; out.folded
    # flamegraph.pl gt; out.svg
    • Then, open out.svg in a browser
    • Intermediate files could be avoided (piping), but they can be
    handy for some manual processing if needed (eg, using vi)
    
slide 65:
    Generation: Profiling Data
    • The profile data, at a minimum, is a series of stack traces
    • These can also include stack trace counts. Eg:
    mysqld`_Z13add_to_statusP17system_status_varS0_+0x47
    mysqld`_Z22calc_sum_of_all_statusP17system_status_var+0x67
    mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x1222
    mysqld`_Z10do_commandP3THD+0x198
    mysqld`handle_one_connection+0x1a6
    libc.so.1`_thrp_setup+0x8d
    libc.so.1`_lwp_start
    # of occurrences for this stack
    • This example is from DTrace, which prints a series of these.
    The format of each group is: stack, count, newline
    • Your profiler needs to print full (not truncated) stacks, with
    symbols. This may be step 0: get the profiler to work!
    
slide 66:
    Generation: Profiling Tools
    • Solaris/FreeBSD/SmartOS/...:
    • DTrace
    • Linux:
    • perf, SystemTap
    • OS X:
    • Instruments
    • Windows:
    • Xperf.exe
    
slide 67:
    Generation: Profiling Examples: DTrace
    • CPU profile kernel stacks at 997 Hertz, for 60 secs:
    # dtrace -x stackframes=100 -n 'profile-997 /arg0/ {
    @[stack()] = count(); } tick-60s { exit(0); }' -o out.kern_stacks
    • CPU profile user-level stacks for PID 12345 at 99 Hertz, 60s:
    # dtrace -x ustackframes=100 -n 'profile-97 /PID == 12345 && arg1/ {
    @[ustack()] = count(); } tick-60s { exit(0); }' -o out.user_stacks
    • Should also work on Mac OS X, but is pending some fixes
    preventing stack walking (use Instruments instead)
    • Should work for Linux one day with the DTrace ports
    
slide 68:
    Generation: Profiling Examples: perf
    • CPU profile full stacks at 97 Hertz, for 60 secs:
    # perf record -a -g -F 97 sleep 60
    # perf script >gt; out.stacks
    • Need debug symbol packages installed (*dbgsym), otherwise
    stack frames may show as hexidecimal
    • May need compilers to cooperate (-fno-omit-frame-pointer)
    • Has both user and kernel stacks, and the kernel idle thread.
    Can filter the idle thread after stackcollapse-perf.pl using:
    # stackcollapse-perf.pl 
slide 69:
    Generation: Profiling Examples: SystemTap
    • CPU profile kernel stacks at 100 Hertz, for 60 secs:
    # stap -s 32 -D MAXTRACE=100 -D MAXSTRINGLEN=4096 -D MAXMAPENTRIES=10240 \
    -D MAXACTION=10000 -D STP_OVERLOAD_THRESHOLD=5000000000 --all-modules \
    -ve 'global s; probe timer.profile { s[backtrace()] gt; out.kern_stacks
    • Need debug symbol packages installed (*dbgsym), otherwise
    stack frames may show as hexidecimal
    • May need compilers to cooperate (-fno-omit-frame-pointer)
    
slide 70:
    Generation: Dynamic Languages
    • C or C++ are usually easy to profile, runtime environments
    (JVM, node.js, ...) are usually not, typically a way to show
    program stacks and not just runtime internals.
    • Eg, DTrace’s ustack helper for node.js:
    0xfc618bc0
    0xfc61bd62
    0xfe870841
    0xfc61c1f3
    0xfc617685
    0xfe870841
    0xfc6154d7
    0xfe870e1a
    [...]
    libc.so.1`gettimeofday+0x7
    Date at position
    gt;>gt;
    gt;>gt;
    (anon) as exports.active at timers.js position 7590
    (anon) as Socket._write at net.js position 21336
    (anon) as Socket.write at net.js position 19714
    gt;>gt;
    (anon) as OutgoingMessage._writeRaw at http.js p...
    (anon) as OutgoingMessage._send at http.js posit...
    gt;>gt;
    (anon) as OutgoingMessage.end at http.js pos...
    [...]
    http://dtrace.org/blogs/dap/2012/01/05/where-does-your-node-program-spend-its-time/
    
slide 71:
    Generation: stackcollapse.pl
    • Converts profile data into a single line records
    • Variants exist for DTrace, perf, SystemTap, Instruments, Xperf
    • Eg, DTrace:
    unix`i86_mwait+0xd
    unix`cpu_idle_mwait+0xf1
    unix`idle+0x114
    unix`thread_start+0x8
    # stackcollapse.pl gt; out.folded
    unix`thread_start;unix`idle;unix`cpu_idle_mwait;unix`i86_mwait 19486
    
slide 72:
    Generation: stackcollapse.pl
    • Converts profile data into a single line records
    • Variants exist for DTrace, perf, SystemTap, Instruments, Xperf
    • Eg, DTrace:
    unix`i86_mwait+0xd
    unix`cpu_idle_mwait+0xf1
    unix`idle+0x114
    unix`thread_start+0x8
    # stackcollapse.pl gt; out.folded
    unix`thread_start;unix`idle;unix`cpu_idle_mwait;unix`i86_mwait 19486
    stack trace, frames are ‘;’ delimited
    count
    
slide 73:
    Generation: stackcollapse.pl
    • Full output is many lines, one line per stack
    • Bonus: can be grepped
    # ./stackcollapse-stap.pl out.stacks | grep ext4fs_dirhash
    system_call_fastpath;sys_getdents;vfs_readdir;ext4_readdir;ext4_htree_fill_
    tree;htree_dirblock_to_tree;ext4fs_dirhash 100
    system_call_fastpath;sys_getdents;vfs_readdir;ext4_readdir;ext4_htree_fill_
    tree;htree_dirblock_to_tree;ext4fs_dirhash;half_md4_transform 505
    system_call_fastpath;sys_getdents;vfs_readdir;ext4_readdir;ext4_htree_fill_
    tree;htree_dirblock_to_tree;ext4fs_dirhash;str2hashbuf_signed 353
    [...]
    • That shows all stacks containing ext4fs_dirhash(); useful
    debug aid by itself
    • grep can also be used to filter stacks before Flame Graphs
    • eg: grep -v cpu_idle
    
slide 74:
    Generation: Final Output
    • Desires:
    • Full control of output
    • High density detail
    • Portable: easily viewable
    • Interactive
    
slide 75:
    Generation: Final Output
    • Desires:
    • Full control of output
    • High density detail
    • Portable: easily viewable
    PNG
    SVG+JS
    • Interactive
    • SVG+JS: Scalable Vector Graphics with embedded JavaScript
    • Common standards, and supported by web browsers
    • Can print poster size (scalable); but loses interactivity!
    • Can be emitted by a simple Perl program...
    
slide 76:
    Generation: flamegraph.pl
    • Converts folded stacks into an interactive SVG. Eg:
    # flamegraph.pl --titletext="Flame Graph: MySQL" out.folded >gt; graph.svg
    • Options:
    --titletext
    change the title text (default is “Flame Graph”)
    --width
    width of image (default is 1200)
    --height
    height of each frame (default is 16)
    --minwidth
    omit functions smaller than this width (default is 0.1 pixels)
    --fonttype
    font type (default “Verdana”)
    --fontsize
    font size (default 12)
    --countname
    count type label (default “samples”)
    --nametype
    name type label (default “Function:”)
    --colors
    color palette: "hot", "mem", "io"
    --hash
    colors are keyed by function name hash
    
slide 77:
    Types
    
slide 78:
    Types
    • CPU
    • Memory
    • Off-CPU
    • More
    
slide 79:
    CPU
    
slide 80:
    CPU
    • Measure code paths that consume CPU
    • Helps us understand and optimize CPU usage, improving
    performance and scalability
    • Commonly performed by sampling CPU stack traces at a
    timed interval (eg, 100 Hertz for every 10 ms), on all CPUs
    • DTrace/perf/SystemTap examples shown earlier
    • Can also be performed by tracing function execution
    
slide 81:
    CPU: Sampling
    CPU stack sampling:
    syscall
    On-CPU
    Off-CPU
    block . . . . . . . . . interrupt
    user-level
    kernel
    
slide 82:
    CPU: Tracing
    CPU function tracing:
    syscall
    On-CPU
    Off-CPU
    block . . . . . . . . . interrupt
    user-level
    kernel
    
slide 83:
    CPU: Profiling
    • Sampling:
    • Coarse but usually effective
    • Can also be low overhead, depending on the stack type
    and sample rate, which is fixed (eg, 100 Hz x CPU count)
    • Tracing:
    • Overheads can be too high, distorting results and hurting
    the target (eg, millions of trace events per second)
    • Most Flame Graphs are generated using stack sampling
    
slide 84:
    CPU: Profiling Results
    • Example results. Could you do this?
    As an experiment to investigate the performance of the
    resulting TCP/IP implementation ... the 11/750 is CPU
    saturated, but the 11/780 has about 30% idle time. The time
    spent in the system processing the data is spread out among
    handling for the Ethernet (20%), IP packet processing (10%),
    TCP processing (30%), checksumming (25%), and user
    system call handling (15%), with no single part of the handling
    dominating the time in the system.
    
slide 85:
    CPU: Profiling Results
    • Example results. Could you do this?
    As an experiment to investigate the performance of the
    resulting TCP/IP implementation ... the 11/750 is CPU
    saturated, but the 11/780 has about 30% idle time. The time
    spent in the system processing the data is spread out among
    handling for the Ethernet (20%), IP packet processing (10%),
    TCP processing (30%), checksumming (25%), and user
    system call handling (15%), with no single part of the handling
    dominating the time in the system.
    – Bill Joy, 1981, TCP-IP Digest, Vol 1 #6
    • An impressive report, that even today would be difficult to do
    • Flame Graphs make this a lot easier
    
slide 86:
    CPU: Another Example
    • A file system is archived using tar(1).
    • The files and directories are cached, and the run time is
    mostly on-CPU in the kernel (Linux). Where exactly?
    
slide 87:
    CPU: Another Example
    
slide 88:
    CPU: Another Example
    • 20% for reading directories
    
slide 89:
    CPU: Another Example
    • 54% for file statistics
    
slide 90:
    CPU: Another Example
    • Also good for learning kernel internals: browse the active code
    
slide 91:
    CPU: Recognition
    • Once you start profiling a target, you begin to recognize the
    common stacks and patterns
    • Linux getdents() ext4 path:
    • The next slides show similar
    example kernel-mode CPU
    Sample Flame Graphs
    
slide 92:
    CPU: Recognition: illumos localhost TCP
    • From a TCP localhost latency issue (illumos kernel):
    illumos
    fused-TCP
    receive
    illumos
    fused-TCP
    send
    
slide 93:
    CPU: Recognition: illumos IP DCE issue
    DCE
    lookup
    DCE
    lookup
    DCE
    lookup
    
slide 94:
    CPU: Recognition: Linux TCP send
    • Profiled from a KVM guest:
    Linux TCP
    sendmsg
    
slide 95:
    CPU: Recognition: Syscall Towers
    
slide 96:
    CPU: Recognition: Syscall Towers
    lstat()
    open()
    close()
    writev()
    pollsys()
    read()
    write()
    stat()
    stat64()
    bnx
    xmit
    sendfile()
    bnx
    recv
    ip fanout
    receive
    
slide 97:
    CPU: Both Stacks
    • Apart from showing either user- or kernel-level stacks, both
    can be included by stacking kernel on top of user
    • Linux perf does this by default
    • DTrace can by aggregating @[stack(), ustack()]
    • The different stacks can be highlighted in different ways:
    • different colors or hues
    • separator: flamegraph.pl will color gray any functions
    called "-", which can be inserted as stack separators
    • Kernel stacks are only present during syscalls or interrupts
    
slide 98:
    CPU: Both Stacks Example: KVM/qemu
    user
    only
    kernel
    stack
    user
    stack
    
slide 99:
    Advanced Flame Graphs
    
slide 100:
    Other Targets
    • Apart from CPU samples, stack traces can be collected for
    any event; eg:
    • disk, network, or FS I/O
    • CPU events, including cache misses
    • lock contention and holds
    • memory allocation
    • Other values, instead of sample counts, can also be used:
    • latency
    • bytes
    • The next sections demonstrate memory allocation, I/O tracing,
    and then all blocking types via off-CPU tracing
    
slide 101:
    Memory
    
slide 102:
    Memory
    • Analyze memory growth or leaks by tracing one of the
    following memory events:
    • 1. Allocator functions: malloc(), free()
    • 2. brk() syscall
    • 3. mmap() syscall
    • 4. Page faults
    • Instead of stacks and
    sample counts,
    measure stacks
    with byte counts
    • Merging shows show total bytes by code path
    
slide 103:
    Memory: Four Targets
    
slide 104:
    Memory: Allocator
    • Trace malloc(), free(), realloc(), calloc(), ...
    • These operate on virtual memory
    • *alloc() stacks show why memory was first allocated (as
    opposed to populated): Memory Allocation Flame Graphs
    • With free()/realloc()/..., suspected memory leaks during tracing
    can be identified: Memory Leak Flame Graphs!
    • Down side: allocator functions are frequent, so tracing can
    slow the target somewhat (eg, 25%)
    • For comparison: Valgrind memcheck is more thorough, but its
    CPU simulation can slow the target 20 - 30x
    
slide 105:
    Memory: Allocator: malloc()
    • As a simple example, just tracing malloc() calls with user-level
    stacks and bytes requested, using DTrace:
    # dtrace -x ustackframes=100 -n 'pid$target::malloc:entry {
    @[ustack()] = sum(arg0); } tick-60s { exit(0); }' -p 529 -o out.malloc
    • malloc() Bytes Flame Graph:
    # stackcollapse.pl out.malloc | flamegraph.pl --title="malloc() bytes" \
    --countname="bytes" --colors=mem >gt; out.malloc.svg
    • The options customize the title, countname, and color palette
    
slide 106:
    Memory: Allocator: malloc()
    
slide 107:
    Memory: Allocator: Leaks
    • Yichun Zhang developed Memory Leak Flame Graphs using
    SystemTap to trace allocator functions, and applied them to
    leaks in Nginx (web server):
    
slide 108:
    Memory: brk()
    • Many apps grow their virtual memory size using brk(), which
    sets the heap pointer
    • A stack trace on brk() shows what triggered growth
    • Eg, this script (brkbytes.d) traces brk() growth for “mysqld”:
    #!/usr/sbin/dtrace -s
    inline string target = "mysqld";
    uint brk[int];
    syscall::brk:entry /execname == target/ { self->gt;p = arg0; }
    syscall::brk:return /arg0 == 0 && self->gt;p && brk[pid]/ {
    @[ustack()] = sum(self->gt;p - brk[pid]);
    syscall::brk:return /arg0 == 0 && self->gt;p/ { brk[pid] = self->gt;p; }
    syscall::brk:return /self->gt;p/ { self->gt;p = 0; }
    
slide 109:
    Memory: brk(): Heap Expansion
    # ./brkbytes.d -n 'tick-60s { exit(0); }' >gt; out.brk
    # stackcollapse.pl out.brk | flamegraph.pl --countname="bytes" \
    --title="Heap Expansion Flame Graph" --colors=mem >gt; out.brk.svg
    
slide 110:
    Memory: brk()
    • brk() tracing has low overhead: these calls are typically
    infrequent
    • Reasons for brk():
    • A memory growth code path
    • A memory leak code path
    • An innocent application code path, that happened to spillover the current heap size
    • Asynchronous allocator code path, that grew the
    application in response to diminishing free space
    
slide 111:
    Memory: mmap()
    • mmap() may be used by the application or it’s user-level
    allocator to map in large regions of virtual memory
    • It may be followed by munmap() to free the area, which can
    also be traced
    • Eg, mmap() tracing, similar to brk tracing, to show bytes and
    the stacks responsible:
    # dtrace -n 'syscall::mmap:entry /execname == "mysqld"/ {
    @[ustack()] = sum(arg1); }' -o out.mmap
    # stackcollapse.pl out.mmap | flamegraph.pl --countname="bytes" \
    --title="mmap() bytes Flame Graph" --colors=mem >gt; out.mmap.svg
    • This should be low overhead – depends on the frequency
    
slide 112:
    Memory: Page Faults
    • brk() and mmap() expand virtual memory
    • Page faults expand physical memory (RSS). This is demandbased allocation, deferring mapping to the actual write
    • Tracing page faults show the stack responsible for consuming
    (writing to) memory:
    # dtrace -x ustackframes=100 -n 'vminfo:::as_fault /execname == "mysqld"/ {
    @[ustack()] = count(); } tick-60s { exit(0); }' >gt; out.fault
    # stackcollapse.pl out.mysqld_fault01 | flamegraph.pl --countname=pages \
    --title="Page Fault Flame Graph" --colors=mem >gt; mysqld_fault.svg
    
slide 113:
    Memory: Page Faults
    
slide 114:
    I/O
    
slide 115:
    I/O
    • Show time spent in I/O, eg, storage I/O
    • Measure I/O completion events with stacks and their latency;
    merging to show total time waiting by code path
    Application
    system calls
    VFS
    Logical I/O:
    Measure here for user stacks,
    and real application latency
    Block Device Interface
    Disks
    Physical I/O:
    Measure here for kernel stacks,
    and disk I/O latency
    
slide 116:
    I/O: Logical I/O Laency
    • For example, ZFS call latency using DTrace (zfsustack.d):
    #!/usr/sbin/dtrace -s
    #pragma D option quiet
    #pragma D option ustackframes=100
    fbt::zfs_read:entry, fbt::zfs_write:entry,
    fbt::zfs_readdir:entry, fbt::zfs_getattr:entry,
    fbt::zfs_setattr:entry
    self->gt;start = timestamp;
    Timestamp from
    function start (entry)
    fbt::zfs_read:return, fbt::zfs_write:return,
    fbt::zfs_readdir:return, fbt::zfs_getattr:return,
    fbt::zfs_setattr:return
    /self->gt;start/
    this->gt;time = timestamp - self->gt;start;
    @[ustack(), execname] = sum(this->gt;time);
    self->gt;start = 0;
    ... to function end (return)
    dtrace:::END
    printa("%k%s\n%@d\n", @);
    
slide 117:
    I/O: Logical I/O Laency
    • Making an I/O Time Flame Graph:
    # ./zfsustacks.d -n 'tick-10s { exit(0); }' -o out.iostacks
    # stackcollapse.pl out.iostacks | awk '{ print $1, $2 / 1000000 }' | \
    flamegraph.pl --title="FS I/O Time Flame Graph" --color=io \
    --countname=ms --width=500 >gt; out.iostacks.svg
    • DTrace script measures all processes, for 10 seconds
    • awk to covert ns to ms
    
slide 118:
    I/O: Time Flame Graph: gzip
    • gzip(1) waits more time in write()s than read()s
    
slide 119:
    I/O: Time Flame Graph: MySQL
    
slide 120:
    I/O: Flame Graphs
    • I/O latency tracing: hugely useful
    • But once you pick an I/O type, there usually isn't that many
    different code paths calling it
    • Flame Graphs are nice, but often not necessary
    
slide 121:
    Off-CPU
    
slide 122:
    Off-CPU
    Off-CPU tracing:
    off-CPU
    on-CPU
    syscall
    On-CPU
    Off-CPU
    block . . . . . . . . . interrupt
    user-level
    kernel
    
slide 123:
    Off-CPU: Performance Analysis
    • Generic approach for all blocking events, including I/O
    • An advanced performance analysis methodology:
    • http://dtrace.org/blogs/brendan/2011/07/08/off-cpu-performance-analysis/
    • Counterpart to (on-)CPU profiling
    • Measure time a thread spent off-CPU, along with stacks
    • Off-CPU reasons:
    • Waiting (sleeping) on I/O, locks, timers
    • Runnable waiting for CPU
    • Runnable waiting for page/swap-ins
    • The stack trace will explain which
    
slide 124:
    Off-CPU: Time Flame Graphs
    • Off-CPU profiling data (durations and stacks) can be rendered
    as Off-CPU Time Flame Graphs
    • As this involves many more code paths, Flame Graphs are
    usually really useful
    • Yichun Zhang created these, and has been using them on
    Linux with SystemTap to collect the profile data. See:
    • http://agentzh.org/misc/slides/off-cpu-flame-graphs.pdf
    • Which describes their uses for Nginx performance analysis
    
slide 125:
    Off-CPU: Profiling
    • Example of off-CPU profiling for the bash shell:
    # dtrace -x ustackframes=100 -n '
    sched:::off-cpu /execname == "bash"/ { self->gt;ts = timestamp; }
    sched:::on-cpu /self->gt;ts/ {
    @[ustack()] = sum(timestamp - self->gt;ts); self->gt;ts = 0; }
    tick-30s { exit(0); }' -o out.offcpu
    • Traces time from when a thread switches off-CPU to when it
    returns on-CPU, with user-level stacks. ie, time blocked or
    sleeping
    • Off-CPU Time Flame Graph:
    # stackcollapse.pl gt; out.offcpu.svg
    • This uses awk to convert nanoseconds into milliseconds
    
slide 126:
    Off-CPU: Bash Shell
    
slide 127:
    Off-CPU: Bash Shell
    waiting for
    child processes
    waiting for
    keystrokes
    
slide 128:
    Off-CPU: Bash Shell
    • For that simple example, the trace
    data was so short it could have
    just been read (54 lines, 4 unique
    stacks):
    • For multithreaded applications,
    idle thread time can dominate
    • For example, an idle MySQL
    server...
    libc.so.1`__forkx+0xb
    libc.so.1`fork+0x1d
    bash`make_child+0xb5
    bash`execute_simple_command+0xb02
    bash`execute_command_internal+0xae6
    bash`execute_command+0x45
    bash`reader_loop+0x240
    bash`main+0xaff
    bash`_start+0x83
    libc.so.1`syscall+0x13
    bash`file_status+0x19
    bash`find_in_path_element+0x3e
    bash`find_user_command_in_path+0x114
    bash`find_user_command_internal+0x6f
    bash`search_for_command+0x109
    bash`execute_simple_command+0xa97
    bash`execute_command_internal+0xae6
    bash`execute_command+0x45
    bash`reader_loop+0x240
    bash`main+0xaff
    bash`_start+0x83
    libc.so.1`__waitid+0x15
    libc.so.1`waitpid+0x65
    bash`waitchld+0x87
    bash`wait_for+0x2ce
    bash`execute_command_internal+0x1758
    bash`execute_command+0x45
    bash`reader_loop+0x240
    bash`main+0xaff
    bash`_start+0x83
    libc.so.1`__read+0x15
    bash`rl_getc+0x2b
    bash`rl_read_key+0x22d
    bash`readline_internal_char+0x113
    bash`readline+0x49
    bash`yy_readline_get+0x52
    bash`shell_getc+0xe1
    bash`read_token+0x6f
    bash`yyparse+0x4b9
    bash`parse_command+0x67
    bash`read_command+0x52
    bash`reader_loop+0xa5
    bash`main+0xaff
    bash`_start+0x83
    
slide 129:
    Off-CPU: MySQL Idle
    
slide 130:
    Off-CPU: MySQL Idle
    Columns from
    _thrp_setup
    are threads or
    thread groups
    MySQL gives thread routines
    descriptive names (thanks!)
    Mouse over each to identify
    (profiling time was 30s)
    
slide 131:
    Off-CPU: MySQL Idle
    buf_flush_page_cleaner_thread
    mysqld_main
    dict_stats_thread
    srv_monitor_thread
    fts_optimize_thread
    srv_master_thread
    io_handler_thread
    srv_error_monitor_thread
    lock_wait_timeout_thread pfs_spawn_thread
    mysqld Threads
    
slide 132:
    Off-CPU: MySQL Idle
    • Some thread columns are wider than the measurement time:
    evidence of multiple threads
    • This can be shown a number of ways. Eg, adding process
    name, PID, and TID to the top of each user stack:
    #!/usr/sbin/dtrace -s
    #pragma D option ustackframes=100
    sched:::off-cpu /execname == "mysqld"/ { self->gt;ts = timestamp; }
    sched:::on-cpu
    /self->gt;ts/
    @[execname, pid, curlwpsinfo->gt;pr_lwpid, ustack()] =
    sum(timestamp - self->gt;ts);
    self->gt;ts = 0;
    dtrace:::END { printa("\n%s-%d/%d%k%@d\n", @); }
    
slide 133:
    Off-CPU: MySQL Idle
    1 thread
    many threads
    2 threads
    4 threads doing work
    (less idle)
    thread ID (TID)
    
slide 134:
    Off-CPU: Challenges
    • Including multiple threads in one Flame Graph might still be
    confusing. Separate Flame Graphs for each can be created
    • Off-CPU stacks often don't explain themselves:
    • This is blocked on a conditional variable. The real reason it is
    blocked and taking time isn't visible here
    • Now lets look at a busy MySQL server, which presents
    another challenge...
    
slide 135:
    Off-CPU: MySQL Busy
    net_read_packet() ->gt; pollsys()
    idle threads
    
slide 136:
    Off-CPU: MySQL Busy
    random narrow
    stacks during
    work, with no
    reason to
    sleep?
    
slide 137:
    Off-CPU: MySQL Busy
    • Those were user-level stacks only. The kernel-level stack,
    which can be included, will usually explain what happened
    • eg, involuntary context switch due to time slice expired
    • Those paths are likely hot in the CPU Sample Flame Graph
    
slide 138:
    Hot/Cold
    
slide 139:
    Hot/Cold: Profiling
    On-CPU
    Profiling
    Off-CPU
    Profiling
    (everything else)
    Thread State Transition Diagram
    
slide 140:
    Hot/Cold: Profiling
    • Profiling both on-CPU and off-CPU stacks shows everything
    • In my LISA'12 talk I called this the Stack Profile Method:
    profile all stacks
    • Both on-CPU ("hot") and off-CPU ("cold") stacks can be
    included in the same Flame Graph, colored differently:
    Hot Cold Flame Graphs!
    • Merging multiple threads gets even weirder. Creating a
    separate graph per-thread makes much more sense, as
    comparisons to see how a thread's time is divided between
    on- and off-CPU activity
    • For example, a single web server thread with kernel stacks...
    
slide 141:
    Hot/Cold: Flame Graphs
    
slide 142:
    Hot/Cold: Flame Graphs
    On-CPU (!?)
    Off-CPU
    
slide 143:
    Hot/Cold: Challenges
    • Sadly, this often doesn't work well for two reasons:
    • 1. On-CPU time columns get compressed by off-CPU time
    • Previous example dominated by the idle path – waiting for
    a new connection – which is not very interesting!
    • Works better with zoomable Flame Graphs, but then we've
    lost the ability to see key details on first glance
    • Pairs of on-CPU and off-CPU Flame Graphs may be the
    best approach, giving both the full width
    • 2. Has the same challenge from off-CPU Flame Graphs:
    real reason for blocking may not be visible
    
slide 144:
    State of the Art
    • That was the end of Flame Graphs, but I can't stop here –
    we're so close
    • On + Off-CPU Flame Graphs can attack any issue
    • 1. The compressed problem is solvable via one or more of:
    • zoomable Flame Graphs
    • separate on- and off-CPU Flame Graphs
    • per-thread Flame Graphs
    • 2. How do we show the real reason for blocking?
    
slide 145:
    Wakeup Tracing
    Wakeup tracing:
    sleep
    wakeup
    user-level
    kernel
    On-CPU
    Off-CPU
    block . . . . . . . . . . . . . wakeup
    
slide 146:
    Tracing Wakeups
    • The systems knows who woke up who
    • Tracing who performed the wakeup – and their stack – can
    show the real reason for waiting
    • Wakeup Latency Flame Graph
    • Advanced activity
    • Consider overheads – might trace too much
    • Eg, consider ssh, starting with the Off CPU Time Flame Graph
    
slide 147:
    Off-CPU Time Flame Graph: ssh
    Waiting on a conditional variable
    But why did we wait this long?
    Object sleeping on
    
slide 148:
    Wakeup Latency Flame Graph: ssh
    
slide 149:
    Wakeup Latency Flame Graph: ssh
    These code paths,
    ... woke up
    these objects
    
slide 150:
    Tracing Wakeup, Example (DTrace)
    #!/usr/sbin/dtrace -s
    #pragma D option quiet
    #pragma D option ustackframes=100
    #pragma D option stackframes=100
    int related[uint64_t];
    This example targets sshd
    (previous example also matched
    vmstat, after discovering that
    sshd was blocked on vmstat,
    which it was: "vmstat 1")
    sched:::sleep
    /execname == "sshd"/
    ts[curlwpsinfo->gt;pr_addr] = timestamp;
    Time from sleep to wakeup
    sched:::wakeup
    /ts[args[0]->gt;pr_addr]/
    this->gt;d = timestamp - ts[args[0]->gt;pr_addr];
    @[args[1]->gt;pr_fname, args[1]->gt;pr_pid, args[0]->gt;pr_lwpid, args[0]->gt;pr_wchan,
    stack(), ustack(), execname, pid, curlwpsinfo->gt;pr_lwpid] = sum(this->gt;d);
    ts[args[0]->gt;pr_addr] = 0;
    Stack traces of who is doing the waking
    dtrace:::END
    printa("\n%s-%d/%d-%x%k-%k%s-%d/%d\n%@d\n", @);
    Aggregate if possible instead of dumping, to minimize overheads
    
slide 151:
    Following Stack Chains
    • 1st level of wakeups often not enough
    • Would like to programmatically follow multiple chains of
    wakeup stacks, and visualize them
    • I've discussed this with others before – it's a hard problem
    • The following is in development!: Chain Graph
    
slide 152:
    Chain Graph
    
slide 153:
    Chain Graph
    Wakeup Thread 2
    I wokeup
    Wakeup Thread 1
    I wokeup
    Wakeup Stacks
    why I waited
    Off CPU Stacks:
    why I blocked
    
slide 154:
    Chain Graph Visualization
    • New, experimental; check for later improvements
    • Stacks associated based on sleeping object address
    • Retains the value of relative widths equals latency
    • Wakeup stacks frames can be listed in reverse (may be less
    confusing when following towers bottom-up)
    • Towers can get very tall, tracing wakeups through different
    software threads, back to metal
    
slide 155:
    Following Wakeup Chains, Example (DTrace)
    #!/usr/sbin/dtrace -s
    #pragma D option quiet
    #pragma D option ustackframes=100
    #pragma D option stackframes=100
    int related[uint64_t];
    sched:::sleep
    /execname == "sshd" || related[curlwpsinfo->gt;pr_addr]/
    ts[curlwpsinfo->gt;pr_addr] = timestamp;
    sched:::wakeup
    /ts[args[0]->gt;pr_addr]/
    this->gt;d = timestamp - ts[args[0]->gt;pr_addr];
    @[args[1]->gt;pr_fname, args[1]->gt;pr_pid, args[0]->gt;pr_lwpid, args[0]->gt;pr_wchan,
    stack(), ustack(), execname, pid, curlwpsinfo->gt;pr_lwpid] = sum(this->gt;d);
    ts[args[0]->gt;pr_addr] = 0;
    related[curlwpsinfo->gt;pr_addr] = 1;
    dtrace:::END
    printa("\n%s-%d/%d-%x%k-%k%s-%d/%d\n%@d\n", @);
    Also follow who
    wakes up the waker
    
slide 156:
    Developments
    
slide 157:
    Developments
    • There have been many other great developments in the world
    of Flame Graphs. The following is a short tour.
    
slide 158:
    node.js Flame Graphs
    • Dave Pacheco developed the DTrace ustack helper for v8,
    and created Flame Graphs with node.js functions
    http://dtrace.org/blogs/dap/2012/01/05/where-does-your-node-program-spend-its-time/
    
slide 159:
    OS X Instruments Flame Graphs
    • Mark Probst developed a
    way to produce Flame
    Graphs from Instruments
    1. Use the Time Profile instrument
    2. Instrument ->gt; Export Track
    3. stackcollapse-instruments.pl
    4. flamegraphs.pl
    http://schani.wordpress.com/2012/11/16/flame-graphs-for-instruments/
    
slide 160:
    Ruby Flame Graphs
    • Sam Saffron developed Flame Graphs with the Ruby
    MiniProfiler
    • These stacks are very
    deep (many frames),
    so the function names
    have been dropped
    and only the rectangles
    are drawn
    • This preserves the
    value of seeing the
    big picture at first
    glance!
    http://samsaffron.com/archive/2013/03/19/flame-graphs-in-ruby-miniprofiler
    
slide 161:
    Windows Xperf Flame Graphs
    • Bruce Dawson developed Flame Graphs from Xperf data, and
    an xperf_to_collapsedstacks.py script
    Visual Studio CPU Usage
    http://randomascii.wordpress.com/2013/03/26/summarizing-xperf-cpu-usage-with-flame-graphs/
    
slide 162:
    WebKit Web Inspector Flame Charts
    • Available in Google Chrome developer tools, these show
    JavaScript CPU stacks as colored rectangles
    • Inspired by Flame Graphs but
    not the same: they show the
    passage of time on the x-axis!
    • This generally works here as:
    • the target is single threaded
    apps often with repetitive
    code paths
    • ability to zoom
    • Can a "Flame Graph" mode be
    provided for the same data?
    https://bugs.webkit.org/show_bug.cgi?id=111162
    
slide 163:
    Perl Devel::NYTProf Flame Graphs
    • Tim Bunce has been adding Flame Graph features, and
    included them in the Perl profiler: Devel::NYTProf
    http://blog.timbunce.org/2013/04/08/nytprof-v5-flaming-precision/
    
slide 164:
    Leak and Off-CPU Time Flame Graphs
    • Yichun Zhang (agentzh) has created Memory Leak and OffCPU Time Flame Graphs, and has given good talks to explain
    how Flame Graphs work
    http://agentzh.org/#Presentations
    http://agentzh.org/misc/slides/yapc-na-2013-flame-graphs.pdf
    http://www.youtube.com/watch?v=rxn7HoNrv9A
    http://agentzh.org/misc/slides/off-cpu-flame-graphs.pdf
    http://agentzh.org/misc/flamegraph/nginx-leaks-2013-10-08.svg
    https://github.com/agentzh/nginx-systemtap-toolkit
    ... these
    also provide
    examples of using
    SystemTap on Linux
    
slide 165:
    Color Schemes
    • Colors can be used to convey data, instead of the default
    random color scheme. This example from Dave Pacheco
    colors each function by its degree of direct on-CPU execution
    • A Flame Graph
    tool could let you
    select different
    color schemes
    • Another can be:
    color by a hash on
    the function name,
    so colors are
    consistent
    https://npmjs.org/package/stackvis
    
slide 166:
    Zoomable Flame Graphs
    • Dave Pacheco has also used d3 to provide click to zoom!
    Zoom
    https://npmjs.org/package/stackvis
    
slide 167:
    Flame Graph Differentials
    • Robert Mustacchi has been experimenting with showing the
    difference between two Flame Graphs, as a Flame Graph.
    Great potential for non-regression testing, and comparisons!
    
slide 168:
    Flame Graphs as a Service
    • Pedro Teixeira has a project for node.js Flame Graphs as a
    service: automatically generated for each github push!
    http://www.youtube.com/watch?v=sMohaWP5YqA
    
slide 169:
    References & Acknowledgements
    • Neelakanth Nadgir (realneel): developed SVGs using Ruby
    and JavaScript of time-series function trace data with stack
    levels, inspired by Roch's work
    • Roch Bourbonnais: developed Call Stack Analyzer, which
    produced similar time-series visualizations
    • Edward Tufte: inspired
    me to explore
    visualizations that show
    all the data at once, as
    Flame Graphs do
    • Thanks to all who have
    developed Flame
    Graphs further!
    realneel's function_call_graph.rb visualization
    
slide 170:
    Thank you!
    • Questions?
    • Homepage: http://www.brendangregg.com (links to everything)
    • Resources and further reading:
    • http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/: see "Updates"
    • http://dtrace.org/blogs/brendan/2012/03/17/linux-kernel-performance-flamegraphs/
    • http://dtrace.org/blogs/brendan/2013/08/16/memory-leak-growth-flame-graphs/
    • http://dtrace.org/blogs/brendan/2011/07/08/off-cpu-performance-analysis/
    • http://dtrace.org/blogs/dap/2012/01/05/where-does-your-node-program-spendits-time/