Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

Kernel Recipes 2017: Using Linux perf at Netflix

Video: https://www.youtube.com/watch?v=UVM3WX8Lq2k

Talk for Kernel Recipes 2017 by Brendan Gregg.

Description: "Linux perf is a crucial performance analysis tool at Netflix, and is used by a self-service GUI for generating CPU flame graphs and other reports. This sounds like an easy task, however, getting perf to work properly in VM guests running Java, Node.js, containers, and other software, has been at times a challenge. This talk summarizes Linux perf, how we use it at Netflix, the various gotchas we have encountered, and a summary of advanced features."

next
prev
1/79
next
prev
2/79
next
prev
3/79
next
prev
4/79
next
prev
5/79
next
prev
6/79
next
prev
7/79
next
prev
8/79
next
prev
9/79
next
prev
10/79
next
prev
11/79
next
prev
12/79
next
prev
13/79
next
prev
14/79
next
prev
15/79
next
prev
16/79
next
prev
17/79
next
prev
18/79
next
prev
19/79
next
prev
20/79
next
prev
21/79
next
prev
22/79
next
prev
23/79
next
prev
24/79
next
prev
25/79
next
prev
26/79
next
prev
27/79
next
prev
28/79
next
prev
29/79
next
prev
30/79
next
prev
31/79
next
prev
32/79
next
prev
33/79
next
prev
34/79
next
prev
35/79
next
prev
36/79
next
prev
37/79
next
prev
38/79
next
prev
39/79
next
prev
40/79
next
prev
41/79
next
prev
42/79
next
prev
43/79
next
prev
44/79
next
prev
45/79
next
prev
46/79
next
prev
47/79
next
prev
48/79
next
prev
49/79
next
prev
50/79
next
prev
51/79
next
prev
52/79
next
prev
53/79
next
prev
54/79
next
prev
55/79
next
prev
56/79
next
prev
57/79
next
prev
58/79
next
prev
59/79
next
prev
60/79
next
prev
61/79
next
prev
62/79
next
prev
63/79
next
prev
64/79
next
prev
65/79
next
prev
66/79
next
prev
67/79
next
prev
68/79
next
prev
69/79
next
prev
70/79
next
prev
71/79
next
prev
72/79
next
prev
73/79
next
prev
74/79
next
prev
75/79
next
prev
76/79
next
prev
77/79
next
prev
78/79
next
prev
79/79

PDF: KernelRecipes_Perf_Events.pdf

Keywords (from pdftotext):

slide 1:
    Using Linux perf
    at Netflix
    Brendan Gregg
    Senior Performance Architect
    Sep 2017
    
slide 2:
    Case Study: ZFS is ea/ng my CPU
    • Easy to debug using Ne9lix Vector & flame graphs
    • How I expected it to look:
    
slide 3:
    Case Study: ZFS is ea/ng my CPU (cont.)
    • How it really looked:
    Applica/on (truncated)
    38% kernel /me (why?)
    
slide 4:
    Case Study: ZFS is ea/ng my CPU (cont.)
    Zoomed:
    • ZFS ARC (adap/ve replacement cache) reclaim.
    • But… ZFS is not in use. No pools, datasets, or ARC buffers.
    • CPU /me is in random entropy, picking which (empty) list to evict.
    Bug: hUps://github.com/zfsonlinux/zfs/issues/6531
    
slide 5:
slide 6:
    Agenda
    1. Why Ne9lix Needs Linux Profiling
    2. perf Basics
    3. CPU Profiling & Gotchas
    Stacks (gcc, Java)
    Symbols (Node.js, Java)
    Guest PMCs
    PEBS
    Overheads
    4. perf Advanced
    
slide 7:
    1. Why Ne)lix Needs Linux Profiling
    
slide 8:
    Understand CPU usage quickly and completely
    
slide 9:
    Quickly
    Eg, Ne9lix Vector (self-service UI):
    Flame Graphs
    Heat Maps
    
slide 10:
    Completely
    CPU Flame Graph
    Kernel
    (C)
    User
    (C)
    JVM
    (C++)
    Java
    
slide 11:
    Why Linux perf?
    • Available
    – Linux, open source
    • Low overhead
    – Tunable sampling, ring buffers
    • Accurate
    – Applica/on-basic samplers don't know what's really RUNNING; eg, Java and epoll
    • No blind spots
    – See user, library, kernel with CPU sampling
    – With some work: hardirqs & SMI as well
    • No sample skew
    – Unlike Java safety point skew
    
slide 12:
    Why is this so important
    • We typically scale microservices based on %CPU
    – Small %CPU improvements can mean big $avings
    • CPU profiling is used by many ac/vi/es
    Explaining regressions in new sooware versions
    Incident response
    3rd party sooware evalua/ons
    Iden/fy performance tuning targets
    Part of CPU workload characteriza/on
    • perf does lots more, but we spend ~95% of our /me looking
    at CPU profiles, and 5% on everything else
    – With new BPF capabili/es (off-CPU analysis), that might go from 95 to 90%
    
slide 13:
    CPU profiling should be easy, but…
    JIT runtimes
    no frame pointers
    no debuginfo
    stale symbol maps
    container namespaces
    
slide 14:
    2. perf Basics
    
slide 15:
    perf (aka "perf_events")
    • The official Linux profiler
    – In the linux-tools-common package
    – Source code & docs in Linux: tools/perf
    • Supports many profiling/tracing features:
    CPU Performance Monitoring Counters (PMCs)
    Sta/cally defined tracepoints
    User and kernel dynamic tracing
    Kernel line and local variable tracing
    Efficient in-kernel counts and filters
    Stack tracing, libunwind
    Code annota/on
    • Some bugs in the past; has been stable for us
    perf_events
    ponycorn
    
slide 16:
    A Mul/tool of Subcommands
    # perf
    usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]
    The most commonly used perf commands are:
    annotate
    Read perf.data (created by perf record) and display annotated code
    archive
    Create archive with object files with build-ids found in perf.data file
    bench
    General framework for benchmark suites
    buildid-cache
    Manage build-id cache.
    buildid-list
    List the buildids in a perf.data file
    c2c
    Shared Data C2C/HITM Analyzer.
    config
    Get and set variables in a configuration file.
    data
    Data file related processing
    diff
    Read perf.data files and display the differential profile
    evlist
    List the event names in a perf.data file
    ftrace
    simple wrapper for kernel's ftrace functionality
    inject
    Filter to augment the events stream with additional information
    kallsyms
    Searches running kernel for symbols
    kmem
    Tool to trace/measure kernel memory properties
    kvm
    Tool to trace/measure kvm guest os
    list
    List all symbolic event types
    lock
    Analyze lock events
    mem
    Profile memory accesses
    record
    Run a command and record its profile into perf.data
    report
    Read perf.data (created by perf record) and display the profile
    sched
    Tool to trace/measure scheduler properties (latencies)
    script
    Read perf.data (created by perf record) and display trace output
    stat
    Run a command and gather performance counter statistics
    test
    Runs sanity tests.
    timechart
    Tool to visualize total system behavior during a workload
    top
    System profiling tool.
    probe
    Define new dynamic tracepoints
    trace
    strace inspired tool
    See 'perf help COMMAND' for more information on a specific command.
    from Linux 4.13
    
slide 17:
    perf Basic Workflow
    list ->gt; find events
    stat ->gt; count them
    record->gt; write event data to file
    report ->gt; browse summary
    script ->gt; event dump for post processing
    
slide 18:
    Basic Workflow Example
    # perf list sched:*
    […]
    sched:sched_process_exec
    [Tracepoint event]
    […]
    # perf stat -e sched:sched_process_exec -a -- sleep 10
    Performance counter stats for 'system wide':
    sched:sched_process_exec
    1. found an event of interest
    2. 19 per 10 sec is a very low
    rate, so safe to record
    10.001327817 seconds time elapsed
    # perf record -e sched:sched_process_exec -a -g -- sleep 10
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.212 MB perf.data (21 samples) ]
    # perf report -n --stdio
    # Children
    Self
    Samples Trace output
    # ........ ........ ............ .................................................
    4.76%
    4.76%
    1 filename=/bin/bash pid=7732 old_pid=7732
    ---_start
    return_from_SYSCALL_64
    do_syscall_64
    sys_execve
    do_execveat_common.isra.35
    […]
    # perf script
    sleep 7729 [003] 632804.699184: sched:sched_process_exec: filename=/bin/sleep pid=7729 old_pid=7729
    44b97e do_execveat_common.isra.35 (/lib/modules/4.13.0-rc1-virtual/build/vmlinux)
    44bc01 sys_execve (/lib/modules/4.13.0-rc1-virtual/build/vmlinux)
    203acb do_syscall_64 (/lib/modules/4.13.0-rc1-virtual/build/vmlinux)
    acd02b return_from_SYSCALL_64 (/lib/modules/4.13.0-rc1-virtual/build/vmlinux)
    c30 _start (/lib/x86_64-linux-gnu/ld-2.23.so)
    […]
    3. 21 samples captured
    4. summary style may be
    sufficient, or,
    5. script output in /me order
    
slide 19:
    perf stat/record Format
    • These have three main parts: ac/on, event, scope.
    • e.g., profiling on-CPU stack traces:
    AcCon: record stack traces
    perf record -F 99 -a -g -- sleep 10
    Scope: all CPUs
    Event: 99 Hertz
    Note: sleep 10 is a dummy command to set the dura/on
    
slide 20:
    perf Ac/ons
    • Count events (perf stat …)
    – Uses an efficient in-kernel counter, and prints the results
    • Sample events (perf record …)
    – Records details of every event to a dump file (perf.data)
    • Timestamp, CPU, PID, instruc/on pointer, …
    – This incurs higher overhead, rela/ve to the rate of events
    – Include the call graph (stack trace) using -g
    • Other ac/ons include:
    List events (perf list)
    Report from a perf.data file (perf report)
    Dump a perf.data file as text (perf script)
    top style profiling (perf top)
    
slide 21:
    perf Events
    • Custom /mers
    – e.g., 99 Hertz (samples per second)
    • Hardware events
    – CPU Performance Monitoring Counters (PMCs)
    • Tracepoints
    – Sta/cally defined in sooware
    • Dynamic tracing
    – Created using uprobes (user) or kprobes (kernel)
    – Can do kernel line tracing with local variables (needs kernel debuginfo)
    
slide 22:
    perf Events: Map
    
slide 23:
    perf Events: List
    # perf list
    List of pre-defined events (to be used in -e):
    cpu-cycles OR cycles
    instructions
    cache-references
    cache-misses
    branch-instructions OR branches
    branch-misses
    bus-cycles
    stalled-cycles-frontend OR idle-cycles-frontend
    stalled-cycles-backend OR idle-cycles-backend
    […]
    cpu-clock
    task-clock
    page-faults OR faults
    context-switches OR cs
    cpu-migrations OR migrations
    […]
    L1-dcache-loads
    L1-dcache-load-misses
    L1-dcache-stores
    […]
    skb:kfree_skb
    skb:consume_skb
    skb:skb_copy_datagram_iovec
    net:net_dev_xmit
    net:net_dev_queue
    net:netif_receive_skb
    net:netif_rx
    […]
    [Hardware event]
    [Hardware event]
    [Hardware event]
    [Hardware event]
    [Hardware event]
    [Hardware event]
    [Hardware event]
    [Hardware event]
    [Hardware event]
    [Software event]
    [Software event]
    [Software event]
    [Software event]
    [Software event]
    [Hardware cache event]
    [Hardware cache event]
    [Hardware cache event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    [Tracepoint event]
    
slide 24:
    perf Scope
    System-wide: all CPUs (-a)
    Target PID (-p PID)
    Target command (…)
    Specific CPUs (-c …)
    User-level only (gt;:u)
    Kernel-level only (gt;:k)
    A custom filter to match variables (--filter …)
    This cgroup (container) only (--cgroup …)
    
slide 25:
    One-Liners: Lis/ng Events
    # Listing all currently known events:
    perf list
    # Searching for "sched" tracepoints:
    perf list | grep sched
    # Listing sched tracepoints:
    perf list 'sched:*'
    Dozens of perf one-liners:
    hUp://www.brendangregg.com/perf.html#OneLiners
    
slide 26:
    One-Liners: Coun/ng Events
    # CPU counter statistics for the specified command:
    perf stat command
    # CPU counter statistics for the entire system, for 5 seconds:
    perf stat -a sleep 5
    # Detailed CPU counter statistics for the specified PID, until Ctrl-C:
    perf stat -dp PID
    # Various CPU last level cache statistics for the specified command:
    perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches command
    # Count system calls for the specified PID, until Ctrl-C:
    perf stat -e 'syscalls:sys_enter_*' -p PID
    # Count block device I/O events for the entire system, for 10 seconds:
    perf stat -e 'block:*' -a sleep 10
    # Show system calls by process, refreshing every 2 seconds:
    perf top -e raw_syscalls:sys_enter -ns comm
    
slide 27:
    One-Liners: Profiling Events
    # Sample on-CPU functions for the specified command, at 99 Hertz:
    perf record -F 99 command
    # Sample CPU stack traces for the specified PID, at 99 Hertz, for 10 seconds:
    perf record -F 99 -p PID -g -- sleep 10
    # Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds:
    perf record -F 99 -ag -- sleep 10
    # Sample CPU stacks, once every 10,000 Level 1 data cache misses, for 5 secs:
    perf record -e L1-dcache-load-misses -c 10000 -ag -- sleep 5
    # Sample CPU stack traces, once every 100 last level cache misses, for 5 secs:
    perf record -e LLC-load-misses -c 100 -ag -- sleep 5
    # Sample on-CPU kernel instructions, for 5 seconds:
    perf record -e cycles:k -a -- sleep 5
    # Sample on-CPU user instructions, for 5 seconds:
    perf record -e cycles:u -a -- sleep 5
    
slide 28:
    One-Liners: Repor/ng
    # Show perf.data in an ncurses browser (TUI) if possible:
    perf report
    # Show perf.data with a column for sample count:
    perf report -n
    # Show perf.data as a text report, with data coalesced and percentages:
    perf report --stdio
    # List all raw events from perf.data:
    perf script
    # List all raw events from perf.data, with customized fields:
    perf script -f comm,tid,pid,time,cpu,event,ip,sym,dso
    # Dump raw contents from perf.data as hex (for debugging):
    perf script -D
    # Disassemble and annotate instructions with percentages (needs debuginfo):
    perf annotate --stdio
    
slide 29:
    3. CPU Profiling
    
slide 30:
    CPU Profiling
    • Record stacks at a /med interval: simple and effec/ve
    – Pros: Low (determinis/c) overhead
    – Cons: Coarse accuracy, but usually sufficient
    stack
    samples:
    syscall
    on-CPU
    /me
    off-CPU
    block
    interrupt
    
slide 31:
    perf Record
    # perf record -F 99 -ag -- sleep 30
    [ perf record: Woken up 9 times to write data ]
    [ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ]
    # perf report -n --stdio
    1.40%
    java [kernel.kallsyms]
    [k] _raw_spin_lock
    --- _raw_spin_lock
    |--63.21%-- try_to_wake_up
    |--63.91%-- default_wake_function
    |--56.11%-- __wake_up_common
    __wake_up_locked
    ep_poll_callback
    __wake_up_common
    __wake_up_sync_key
    |--59.19%-- sock_def_readable
    […78,000 lines truncated…]
    Sampling full
    stack traces
    at 99 Hertz
    
slide 32:
    perf Repor/ng
    • perf report summarizes by combining common paths
    • Previous output truncated 78,000 lines of summary
    • The following is what a mere 8,000 lines looks like…
    
slide 33:
    perf report
    
slide 34:
    … as a Flame Graph
    
slide 35:
    Flame Graphs
    git clone --depth 1 https://github.com/brendangregg/FlameGraph
    cd FlameGraph
    perf record -F 99 -a –g -- sleep 30
    perf script | ./stackcollapse-perf.pl | ./flamegraph.pl >gt; perf.svg
    • Flame Graphs:
    – x-axis: alphabe/cal stack sort, to maximize merging
    – y-axis: stack depth
    – color: random, or hue can be a dimension
    • e.g., sooware type, or difference between two profiles for
    non-regression tes/ng ("differen/al flame graphs")
    – interpreta/on: top edge is on-CPU, beneath it is ancestry
    • Just a Perl program to convert perf stacks into SVG
    – Includes JavaScript: open in a browser for interac/vity
    • Easy to get working hUp://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
    
slide 36:
    flamegraph.pl Op/ons
    $ flamegraph.pl --help
    USAGE: flamegraph.pl [options] infile >gt; outfile.svg
    --title TEXT
    # change title text
    --subtitle TEXT # second level title (optional)
    --width NUM
    # width of image (default 1200)
    --height NUM
    # height of each frame (default 16)
    --minwidth NUM
    # omit smaller functions (default 0.1 pixels)
    --fonttype FONT # font type (default "Verdana")
    --fontsize NUM
    # font size (default 12)
    --countname TEXT # count type label (default "samples")
    --nametype TEXT # name type label (default "Function:")
    --colors PALETTE # set color palette. choices are: hot (default), mem,
    # io, wakeup, chain, java, js, perl, red, green, blue,
    # aqua, yellow, purple, orange
    --hash
    # colors are keyed by function name hash
    --cp
    # use consistent palette (palette.map)
    --reverse
    # generate stack-reversed flame graph
    --inverted
    # icicle graph
    --negate
    # switch differential hues (bluegt;red)
    --notes TEXT
    # add notes comment in SVG (for debugging)
    --help
    # this message
    eg,
    flamegraph.pl --title="Flame Graph: malloc()" trace.txt >gt; graph.svg
    
slide 37:
    perf Flame Graph Workflow (Linux 2.6+)
    list events
    count events
    capture stacks
    perf list
    perf stat
    perf record
    Typical
    Workflow
    perf.data
    text UI
    dump profile
    perf report
    flame graph
    visualiza/on
    perf script
    stackcollapse-perf.pl
    flamegraph.pl
    
slide 38:
    perf Flame Graph Workflow (Linux 4.5+)
    list events
    count events
    capture stacks
    perf list
    perf stat
    perf record
    Typical
    Workflow
    perf.data
    text UI
    dump
    perf report
    perf script
    flame graph
    visualiza/on
    summary
    perf report
    -g folded
    awk
    flamegraph.pl
    
slide 39:
    Flame Graph Op/miza/ons
    Linux 2.6
    Linux 4.5
    Linux 4.9
    capture stacks
    capture stacks
    count stacks (BPF)
    perf record
    perf record
    profile.py
    write samples
    perf.data
    reads samples
    perf script
    write text
    stackcollapse-perf.pl
    folded output
    flamegraph.pl
    write samples
    not perf
    perf.data
    reads samples
    perf report
    –g folded
    folded
    output
    folded report
    awk
    folded output
    flamegraph.pl
    flamegraph.pl
    
slide 40:
    Gotchas
    
slide 41:
    When we've tried to use perf
    Stacks don't work (missing)
    Symbols don't work (hex numbers)
    Instruc/on profiling looks bogus
    PMCs don't work in VM guests
    Container break things
    Overhead is too high
    
slide 42:
    How to really get started
    1. Get "perf" to work
    2. Get stack walking to work
    3. Fix symbol transla/on
    4. Get IPC to work
    5. Test perf under load
    Install perf-tools-common and
    perf-tools-`uname -r` packages;
    Or compile in the Linux source:
    tools/perf
    The "gotchas"…
    
slide 43:
    Gotcha #1 Broken Stacks
    perf record -F 99 -a –g -- sleep 30
    perf report -n --stdio
    1. Take a CPU profile
    2. Run perf report
    3. If stacks are ooen 
slide 44:
    Iden/fying Broken Stacks
    28.10%
    sed
    libc-2.19.so
    --- re_search_internal
    |--12.25%-- 0x3
    0x100007
    |--96.78%-- re_search_stub
    rpl_re_search
    match_regex
    do_subst
    execute_program
    process_files
    main
    __libc_start_main
    --3.22%-- rpl_re_search
    match_regex
    do_subst
    execute_program
    process_files
    main
    __libc_start_main
    [.] re_search_internal
    broken
    not broken
    
slide 45:
    Iden/fying Broken Stacks
    78.50%
    sed
    libc-2.19.so
    |--3.65%-- 0x7f2516d5d10d
    |--2.19%-- 0x7f2516d0332f
    |--1.22%-- 0x7f2516cffbd2
    |--1.22%-- 0x7f2516d5d5ad
    |--11.65%-- 0x40a447
    0x40659a
    0x408dd8
    0x408ed1
    0x402689
    0x7fa1cd08aec5
    |--1.33%-- 0x40a4a1
    |--60.01%-- 0x40659a
    0x408dd8
    0x408ed1
    0x402689
    0x7fa1cd08aec5
    [.] 0x00000000000dd7d4
    broken
    probably not broken
    missing symbols, but
    that's another problem
    
slide 46:
    Broken Stacks Flame Graph
    Broken Java stacks
    (missing frame pointer)
    Java == green
    system == red
    C++ == yellow
    
slide 47:
    Fixing Broken Stacks
    • Either:
    • Fix frame pointer-based stack walking (the default)
    Pros: simple, supports any system stack walker
    Cons: might cost a liUle extra CPU to make available
    • Use libunwind and DWARF: perf record -g dwarf
    Pros: more debug info
    Cons: not on older kernels, and inflates instance size
    … there's also ORC on the latest kernel
    • Applica/on support
    hUps://github.com/jvm-profiling-tools/async-profiler
    • Our current preference is (A), but (C) is also promising
    So how do we fix the frame pointer…
    
slide 48:
    gcc -fno-omit-frame-pointer
    • Once upon a time, x86 had fewer registers, and the frame
    pointer register was reused for general purpose to improve
    performance. This breaks system stack walking.
    • gcc provides -fno-omit-frame-pointer to fix this
    – Please make this the default in gcc!
    
slide 49:
    Java -XX:+PreserveFramePointer
    • I hacked frame pointers in the JVM (JDK-8068945) and Oracle rewrote
    it as -XX:+PreserveFramePointer. Lets perf do FP stack walks of Java.
    --- openjdk8clean/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-03-04…
    +++ openjdk8/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-11-07 …
    @@ -5236,6 +5236,7 @@
    // We always push rbp, so that on return to interpreter rbp, will be
    // restored correctly and we can correct the stack.
    push(rbp);
    mov(rbp, rsp);
    // Remove word for ebp
    framesize -= wordSize;
    Involved changes like this:
    fixing x86-64 func/on
    prologues
    --- openjdk8clean/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp …
    +++ openjdk8/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp …
    [...]
    • Costs some overhead to use. Usually 
slide 50:
    Broken Java Stacks
    # perf script
    […]
    java 4579 cpu-clock:
    ffffffff8172adff tracesys ([kernel.kallsyms])
    7f4183bad7ce pthread_cond_timedwait@@GLIBC_2…
    java
    4579 cpu-clock:
    7f417908c10b [unknown] (/tmp/perf-4458.map)
    java
    4579 cpu-clock:
    7f4179101c97 [unknown] (/tmp/perf-4458.map)
    java
    4579 cpu-clock:
    7f41792fc65f [unknown] (/tmp/perf-4458.map)
    a2d53351ff7da603 [unknown] ([unknown])
    java
    4579 cpu-clock:
    7f4179349aec [unknown] (/tmp/perf-4458.map)
    java
    4579 cpu-clock:
    7f4179101d0f [unknown] (/tmp/perf-4458.map)
    […]
    Check with "perf script" to see
    stack samples
    These are 1 or 2 levels deep (junk
    values)
    
slide 51:
    Fixed Java Stacks
    # perf script
    […]
    java 8131 cpu-clock:
    7fff76f2dce1 [unknown] ([vdso])
    7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…
    7fd301861e46 [unknown] (/tmp/perf-8131.map)
    7fd30184def8 [unknown] (/tmp/perf-8131.map)
    7fd30174f544 [unknown] (/tmp/perf-8131.map)
    7fd30175d3a8 [unknown] (/tmp/perf-8131.map)
    7fd30166d51c [unknown] (/tmp/perf-8131.map)
    7fd301750f34 [unknown] (/tmp/perf-8131.map)
    7fd3016c2280 [unknown] (/tmp/perf-8131.map)
    7fd301b02ec0 [unknown] (/tmp/perf-8131.map)
    7fd3016f9888 [unknown] (/tmp/perf-8131.map)
    7fd3016ece04 [unknown] (/tmp/perf-8131.map)
    7fd30177783c [unknown] (/tmp/perf-8131.map)
    7fd301600aa8 [unknown] (/tmp/perf-8131.map)
    7fd301a4484c [unknown] (/tmp/perf-8131.map)
    7fd3010072e0 [unknown] (/tmp/perf-8131.map)
    7fd301007325 [unknown] (/tmp/perf-8131.map)
    7fd301007325 [unknown] (/tmp/perf-8131.map)
    7fd3010004e7 [unknown] (/tmp/perf-8131.map)
    7fd3171df76a JavaCalls::call_helper(JavaValue*,…
    7fd3171dce44 JavaCalls::call_virtual(JavaValue*…
    7fd3171dd43a JavaCalls::call_virtual(JavaValue*…
    7fd31721b6ce thread_entry(JavaThread*, Thread*)…
    7fd3175389e0 JavaThread::thread_main_inner() (/…
    7fd317538cb2 JavaThread::run() (/usr/lib/jvm/nf…
    7fd3173f6f52 java_start(Thread*) (/usr/lib/jvm/…
    7fd317a7e182 start_thread (/lib/x86_64-linux-gn…
    With -XX:+PreserveFramePointer
    stacks are full, and go all the way to
    start_thread()
    This is what the CPUs are really
    running: inlined frames are not
    present
    
slide 52:
    Fixed Stacks Flame Graph
    Java
    (no symbols)
    
slide 53:
    Gotcha #2 Missing Symbols
    • Missing symbols should be obvious in perf report/script:
    71.79%
    12.06%
    sed
    sed
    |--11.65%-- 0x40a447
    0x40659a
    0x408dd8
    0x408ed1
    0x402689
    0x7fa1cd08aec5
    sed
    sed
    --- re_search_internal
    |--96.78%-- re_search_stub
    rpl_re_search
    match_regex
    do_subst
    execute_program
    process_files
    main
    __libc_start_main
    [.] 0x000000000001afc1
    broken
    [.] re_search_internal
    not broken
    
slide 54:
    Fixing Symbols
    • For installed packages:
    A. Add a -dbgsym package, if available
    B. Recompile from source
    • For JIT (Java, Node.js, …):
    A. Create a /tmp/perf-PID.map file. perf already looks for this
    Map format is "START SIZE symbolname"
    B. Or use a symbol loggers. Eg tools/perf/jvm/.
    # perf script
    Failed to open /tmp/perf-8131.map, continuing without symbols
    […]
    java 8131 cpu-clock:
    7fff76f2dce1 [unknown] ([vdso])
    7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…
    7fd301861e46 [unknown] (/tmp/perf-8131.map)
    […]
    
slide 55:
    Java Symbols
    • perf-map-agent
    – Agent aUaches and writes the map file on demand (previous versions aUached on Java
    start, and wrote con/nually)
    – hUps://github.com/jvm-profiling-tools/perf-map-agent
    (was hUps://github.com/jrudolph/perf-map-agent)
    • Automa/on: jmaps
    – We use scripts to find Java processes and dump their map files, paying aUen/on to file
    ownership etc
    – hUps://github.com/brendangregg/FlameGraph/blob/master/jmaps
    – Needs to run as close as possible to the profile, to minimize symbol churn
    # perf record -F 99 -a -g -- sleep 30; jmaps
    
slide 56:
    Java Flame Graph: Stacks & Symbols
    Kernel
    (C)
    flamegraph.pl --color=java
    Java
    User
    (C)
    JVM
    (C++)
    
slide 57:
    Java: Inlining
    A. Disabling inlining:
    -XX:-Inline
    Many more Java frames
    80% slower (in this case)
    May not be necessary: inlined flame
    graphs ooen make enough sense
    – Or tune -XX:MaxInlineSize and XX:InlineSmallCode to reveal more
    frames, without cos/ng much perf: can
    even go faster!
    B. Symbol agents can uninline
    – perf-map-agent unfoldall
    – We some/mes need and use this
    No inlining
    
slide 58:
    Node.js: Stacks & Symbols
    • Frame pointer stacks work
    • Symbols currently via a logger
    – --perf-basic-prof: everything. We found it can log over 1 Gbyte/day.
    – --perf-basic-prof-only-functions: tries to only log symbols we care about.
    • perf may not use the most recent symbol in the log
    – We /dy logs before using them:
    hUps://raw.githubusercontent.com/brendangregg/Misc/master/perf_events/
    perfmap/dy.pl
    • Future v8's may support on-demand symbol dumps
    
slide 59:
    Gotcha #3 Instruc/on Profiling
    # perf annotate -i perf.data.noplooper --stdio
    Percent |
    Source code & Disassembly of noplooper
    -------------------------------------------------------:
    Disassembly of section .text:
    00000000004004ed gt;:
    0.00 :
    4004ed:
    push
    %rbp
    0.00 :
    4004ee:
    mov
    %rsp,%rbp
    20.86 :
    4004f1:
    nop
    0.00 :
    4004f2:
    nop
    0.00 :
    4004f3:
    nop
    0.00 :
    4004f4:
    nop
    19.84 :
    4004f5:
    nop
    0.00 :
    4004f6:
    nop
    0.00 :
    4004f7:
    nop
    0.00 :
    4004f8:
    nop
    18.73 :
    4004f9:
    nop
    0.00 :
    4004fa:
    nop
    0.00 :
    4004fb:
    nop
    0.00 :
    4004fc:
    nop
    19.08 :
    4004fd:
    nop
    0.00 :
    4004fe:
    nop
    0.00 :
    4004ff:
    nop
    0.00 :
    400500:
    nop
    21.49 :
    400501:
    jmp
    4004f1 gt;
    16 NOPs in a loop
    Let's profile instruc/ons
    to see which are hot!
    (have I lost my mind?)
    
slide 60:
    Instruc/on Profiling
    • Even distribu/on (A)? Or something else?
    (A)
    (B)
    (C)
    (D)
    
slide 61:
    Instruc/on Profiling
    # perf annotate -i perf.data.noplooper --stdio
    Percent |
    Source code & Disassembly of noplooper
    -------------------------------------------------------:
    Disassembly of section .text:
    00000000004004ed gt;:
    0.00 :
    4004ed:
    push
    %rbp
    0.00 :
    4004ee:
    mov
    %rsp,%rbp
    20.86 :
    4004f1:
    nop
    0.00 :
    4004f2:
    nop
    0.00 :
    4004f3:
    nop
    0.00 :
    4004f4:
    nop
    19.84 :
    4004f5:
    nop
    0.00 :
    4004f6:
    nop
    0.00 :
    4004f7:
    nop
    0.00 :
    4004f8:
    nop
    18.73 :
    4004f9:
    nop
    0.00 :
    4004fa:
    nop
    0.00 :
    4004fb:
    nop
    0.00 :
    4004fc:
    nop
    19.08 :
    4004fd:
    nop
    0.00 :
    4004fe:
    nop
    Go home instruc/on pointer, you're drunk
    0.00 :
    4004ff:
    nop
    0.00 :
    400500:
    nop
    21.49 :
    400501:
    jmp
    4004f1 gt;
    
slide 62:
    PEBS
    • I believe this is due to parallel and out-of-order execu/on of
    micro-ops: the sampled IP is the resump/on instruc/on, not
    what is currently execu/ng. And skid.
    • PEBS may help: Intel's Precise Event Based Sampling
    • perf_events has support:
    – perf record -e cycles:pp
    – The 'p' can be specified mul/ple /mes:
    0 - SAMPLE_IP can have arbitrary skid
    1 - SAMPLE_IP must have constant skid
    2 - SAMPLE_IP requested to have 0 skid
    3 - SAMPLE_IP must have 0 skid
    – … from tools/perf/Documenta/on/perf-list.txt
    
slide 63:
    Gotcha #4 VM Guests
    • Using PMCs from most VM guests:
    # perf stat -a -d sleep 5
    Performance counter stats for 'system wide':
    10003.718595 task-clock (msec)
    323 context-switches
    17 cpu-migrations
    233 page-faults
    gt; cycles
    gt; stalled-cycles-frontend
    gt; stalled-cycles-backend
    gt; instructions
    gt; branches
    gt; branch-misses
    gt; L1-dcache-loads
    gt; L1-dcache-load-misses
    gt; LLC-loads
    gt; LLC-load-misses
    5.001607197 seconds time elapsed
    2.000 CPUs utilized
    0.032 K/sec
    0.002 K/sec
    0.023 K/sec
    [100.00%]
    [100.00%]
    [100.00%]
    
slide 64:
    VM Guest PMCs
    • Without PMCs, %CPU is ambiguous. We need IPC.
    – Can't measure instruc/ons per cycle (IPC), cache hits/misses, MMU/TLB events, etc.
    • Is fixable: eg, Xen can enable PMCs (vpmu boot op/on)
    – I added vpmu support for subsets, eg, vpmu=arch for Intel architectural set (7 PMCs only)
    – hUp://www.brendangregg.com/blog/2017-05-04/the-pmcs-of-ec2.html
    architectural
    set
    – Now available on the largest AWS EC2 instance types
    
slide 65:
    VM Guest MSRs
    • Model Specific Registers (MSRs) may be exposed when PMCs are not
    • BeUer than nothing. Can solve some issues.
    # ./showboost
    CPU MHz
    : 2500
    Turbo MHz
    : 2900 (10 active)
    Turbo Ratio : 116% (10 active)
    CPU 0 summary every 5 seconds...
    TIME
    17:28:03
    17:28:08
    17:28:13
    17:28:18
    17:28:23
    [...]
    C0_MCYC
    C0_ACYC
    UTIL
    33%
    35%
    36%
    37%
    38%
    RATIO
    116%
    116%
    116%
    115%
    115%
    MHz
    – showboost is from my msr-cloud-tools collec/on (on github)
    
slide 66:
    VM Guest PEBS
    • Not possible yet in Xen
    – please fix
    • DiUo for LBR, BTS, processor trace
    
slide 67:
    Gotcha #5 Containers
    • perf from the host can't find symbol files in different mount
    namespaces
    • We currently workaround it
    – hUp://blog.alicegoldfuss.com/making-flamegraphs-with-containerized-java/
    • Should be fixed in 4.14
    – Krister Johansen's patches
    
slide 68:
    Gotcha #6 Overhead
    • Overhead is rela/ve to the rate of events instrumented
    • perf stat does in-kernel counts: rela/vely low overhead
    • perf record writes perf.data, which has slightly higher
    CPU overhead, plus file system and disk I/O
    • Test before use
    – In the lab
    – Run perf stat to understand rate, before perf record
    • Also consider --filter, to filter events in-kernel
    
slide 69:
    4. perf Advanced
    
slide 70:
    perf for Tracing Events
    
slide 71:
    Tracepoints
    # perf record -e block:block_rq_insert -a
    ^C[ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.172 MB perf.data (~7527 samples) ]
    # perf script
    […]
    java 9940 [015] 1199510.044783: block_rq_insert: 202,1 R 0 () 4783360 + 88 [java]
    java 9940 [015] 1199510.044786: block_rq_insert: 202,1 R 0 () 4783448 + 88 [java]
    java 9940 [015] 1199510.044786: block_rq_insert: 202,1 R 0 () 4783536 + 24 [java]
    java 9940 [000] 1199510.065195: block_rq_insert: 202,1 R 0 () 4864088 + 88 [java]
    […]
    process PID [CPU] /mestamp: eventname:
    format string
    include/trace/events/block.h: java 9940 [015] 1199510.044783: block_rq_insert: 202,1 R 0 () 4783360 + 88 [java]
    DECLARE_EVENT_CLASS(block_rq,
    [...]
    TP_printk("%d,%d %s %u (%s) %llu + %u [%s]",
    kernel source
    MAJOR(__entry->gt;dev), MINOR(__entry->gt;dev),
    may be the
    __entry->gt;rwbs, __entry->gt;bytes, __get_str(cmd),
    (unsigned long long)__entry->gt;sector,
    only docs
    __entry->gt;nr_sector, __entry->gt;comm)
    Also see: cat /sys/kernel/debug/tracing/events/block/block_rq_insert/format
    
slide 72:
    One-Liners: Sta/c Tracing
    # Trace new processes, until Ctrl-C:
    perf record -e sched:sched_process_exec -a
    # Trace all context-switches with stack traces, for 1 second:
    perf record -e context-switches –ag -- sleep 1
    # Trace CPU migrations, for 10 seconds:
    perf record -e migrations -a -- sleep 10
    # Trace all connect()s with stack traces (outbound connections), until Ctrl-C:
    perf record -e syscalls:sys_enter_connect –ag
    # Trace all block device (disk I/O) requests with stack traces, until Ctrl-C:
    perf record -e block:block_rq_insert -ag
    # Trace all block device issues and completions (has timestamps), until Ctrl-C:
    perf record -e block:block_rq_issue -e block:block_rq_complete -a
    # Trace all block completions, of size at least 100 Kbytes, until Ctrl-C:
    perf record -e block:block_rq_complete --filter 'nr_sector >gt; 200'
    # Trace all block completions, synchronous writes only, until Ctrl-C:
    perf record -e block:block_rq_complete --filter 'rwbs == "WS"'
    # Trace all block completions, all types of writes, until Ctrl-C:
    perf record -e block:block_rq_complete --filter 'rwbs ~ "*W*"'
    # Trace all ext4 calls, and write to a non-ext4 location, until Ctrl-C:
    perf record -e 'ext4:*' -o /tmp/perf.data -a
    
slide 73:
    One-Liners: Dynamic Tracing
    # Add a tracepoint for the kernel tcp_sendmsg() function entry (--add optional):
    perf probe --add tcp_sendmsg
    # Remove the tcp_sendmsg() tracepoint (or use --del):
    perf probe -d tcp_sendmsg
    # Add a tracepoint for the kernel tcp_sendmsg() function return:
    perf probe 'tcp_sendmsg%return'
    # Show avail vars for the tcp_sendmsg(), plus external vars (needs debuginfo):
    perf probe -V tcp_sendmsg --externs
    # Show available line probes for tcp_sendmsg() (needs debuginfo):
    perf probe -L tcp_sendmsg
    # Add a tracepoint for tcp_sendmsg() line 81 with local var seglen (debuginfo):
    perf probe 'tcp_sendmsg:81 seglen'
    # Add a tracepoint for do_sys_open() with the filename as a string (debuginfo):
    perf probe 'do_sys_open filename:string'
    # Add a tracepoint for myfunc() return, and include the retval as a string:
    perf probe 'myfunc%return +0($retval):string'
    # Add a tracepoint for the user-level malloc() function from libc:
    perf probe -x /lib64/libc.so.6 malloc
    # List currently available dynamic probes:
    perf probe -l
    
slide 74:
    One-Liners: Advanced Dynamic Tracing
    # Add a tracepoint for tcp_sendmsg(), with three entry regs (platform specific):
    perf probe 'tcp_sendmsg %ax %dx %cx'
    # Add a tracepoint for tcp_sendmsg(), with an alias ("bytes") for %cx register:
    perf probe 'tcp_sendmsg bytes=%cx'
    # Trace previously created probe when the bytes (alias) var is greater than 100:
    perf record -e probe:tcp_sendmsg --filter 'bytes >gt; 100'
    # Add a tracepoint for tcp_sendmsg() return, and capture the return value:
    perf probe 'tcp_sendmsg%return $retval'
    # Add a tracepoint for tcp_sendmsg(), and "size" entry argument (debuginfo):
    perf probe 'tcp_sendmsg size'
    # Add a tracepoint for tcp_sendmsg(), with size and socket state (debuginfo):
    perf probe 'tcp_sendmsg size sk->gt;__sk_common.skc_state'
    # Trace previous probe when size >gt; 0, and state != TCP_ESTABLISHED(1) (debuginfo):
    perf record -e probe:tcp_sendmsg --filter 'size >gt; 0 && skc_state != 1' -a
    • Kernel debuginfo is an onerous requirement for the Netflix cloud
    • We can use registers instead (as above). But which registers?
    
slide 75:
    The RoseUa Stone of Registers
    One server instance with kernel debuginfo, and -nv (dry run, verbose):
    # perf probe -nv 'tcp_sendmsg size sk->gt;__sk_common.skc_state'
    […]
    Added new event:
    Writing event: p:probe/tcp_sendmsg tcp_sendmsg+0 size=%cx:u64 skc_state=+18(%si):u8
    probe:tcp_sendmsg (on tcp_sendmsg with size skc_state=sk->gt;__sk_common.skc_state)
    You can now use it in all perf tools, such as:
    perf record -e probe:tcp_sendmsg -aR sleep 1
    All other instances (of the same kernel version):
    Copy-n-paste!
    # perf probe 'tcp_sendmsg+0 size=%cx:u64 skc_state=+18(%si):u8'
    Failed to find path of kernel module.
    Added new event:
    probe:tcp_sendmsg
    (on tcp_sendmsg with size=%cx:u64 skc_state=+18(%si):u8)
    You can now use it in all perf tools, such as:
    perf record -e probe:tcp_sendmsg -aR sleep 1
    Masami Hiramatsu was investigating a way to better automate this
    
slide 76:
    perf Visualiza/ons: Block I/O Latency Heat Map
    • We automated this for analyzing disk I/O latency issues
    SSD I/O
    (fast, with queueing)
    HDD I/O
    (random, modes)
    hUp://www.brendangregg.com/blog/2014-07-01/perf-heat-maps.html
    
slide 77:
    There's s/ll a lot more to perf…
    Using PMCs
    perf scrip/ng interface
    perf + eBPF
    perf sched
    perf /mechart
    perf trace
    perf c2c (new!)
    perf orace (new!)
    
slide 78:
    Links & References
    perf_events
    Kernel source: tools/perf/DocumentaCon
    hUps://perf.wiki.kernel.org/index.php/Main_Page
    hUp://www.brendangregg.com/perf.html
    hUp://web.eece.maine.edu/~vweaver/projects/perf_events/
    Mailing list hUp://vger.kernel.org/vger-lists.html#linux-perf-users
    perf-tools: hUps://github.com/brendangregg/perf-tools
    PMU tools: hUps://github.com/andikleen/pmu-tools
    perf, orace, and more: hUp://www.brendangregg.com/linuxperf.html
    Java frame pointer patch
    hUp://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2014-December/016477.html
    hUps://bugs.openjdk.java.net/browse/JDK-8068945
    Node.js: hUp://techblog.ne9lix.com/2014/11/nodejs-in-flames.html
    Methodology: hUp://www.brendangregg.com/methodology.html
    Flame graphs: hUp://www.brendangregg.com/flamegraphs.html
    Heat maps: hUp://www.brendangregg.com/heatmaps.html
    eBPF: hUp://lwn.net/Ar/cles/603983/
    
slide 79:
    Thank You
    – Ques/ons?
    – hUp://www.brendangregg.com
    – hUp://slideshare.net/brendangregg
    – bgregg@ne9lix.com
    – @brendangregg