Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

USENIX LISA 2014: Linux Performance Analysis, New Tools and Old Secrets

Video: https://www.youtube.com/watch?v=R4IKeMQhM0Y

Talk for USENIX/LISA2014 by Brendan Gregg, Netflix.

Description: "At Netflix performance is crucial, and we use many high to low level tools to analyze our stack in different ways. In this talk, I will introduce new system observability tools we are using at Netflix, which I've ported from my DTraceToolkit, and are intended for our Linux 3.2 cloud instances. These show that Linux can do more than you may think, by using creative hacks and workarounds with existing kernel features (ftrace, perf_events). While these are solving issues on current versions of Linux, I'll also briefly summarize the future in this space: eBPF, ktap, SystemTap, sysdig, etc."

next
prev
1/75
next
prev
2/75
next
prev
3/75
next
prev
4/75
next
prev
5/75
next
prev
6/75
next
prev
7/75
next
prev
8/75
next
prev
9/75
next
prev
10/75
next
prev
11/75
next
prev
12/75
next
prev
13/75
next
prev
14/75
next
prev
15/75
next
prev
16/75
next
prev
17/75
next
prev
18/75
next
prev
19/75
next
prev
20/75
next
prev
21/75
next
prev
22/75
next
prev
23/75
next
prev
24/75
next
prev
25/75
next
prev
26/75
next
prev
27/75
next
prev
28/75
next
prev
29/75
next
prev
30/75
next
prev
31/75
next
prev
32/75
next
prev
33/75
next
prev
34/75
next
prev
35/75
next
prev
36/75
next
prev
37/75
next
prev
38/75
next
prev
39/75
next
prev
40/75
next
prev
41/75
next
prev
42/75
next
prev
43/75
next
prev
44/75
next
prev
45/75
next
prev
46/75
next
prev
47/75
next
prev
48/75
next
prev
49/75
next
prev
50/75
next
prev
51/75
next
prev
52/75
next
prev
53/75
next
prev
54/75
next
prev
55/75
next
prev
56/75
next
prev
57/75
next
prev
58/75
next
prev
59/75
next
prev
60/75
next
prev
61/75
next
prev
62/75
next
prev
63/75
next
prev
64/75
next
prev
65/75
next
prev
66/75
next
prev
67/75
next
prev
68/75
next
prev
69/75
next
prev
70/75
next
prev
71/75
next
prev
72/75
next
prev
73/75
next
prev
74/75
next
prev
75/75

PDF: LISA2014_LinuxPerfAnalysisNewTools.pdf

Keywords (from pdftotext):

slide 1:
    Linux Performance Analysis
    New Tools and Old Secrets
    Brendan Gregg
    Senior Performance Architect
    Performance Engineering Team
    bgregg@netflix.com
    @brendangregg
    
slide 2:
    Porting these to Linux…
    
slide 3:
    • Massive Amazon EC2 Linux cloud
    – Tens of thousands of instances
    – Autoscale by ~3k each day
    – CentOS and Ubuntu
    • FreeBSD for content delivery
    – Approx 33% of US Internet traffic at night
    • Performance is critical
    – Customer satisfaction: >gt;50M subscribers
    – $$$ price/performance
    – Develop tools for cloud-wide analysis; use
    server tools as needed
    
slide 4:
    Brendan Gregg
    • Senior Performance Architect, Netflix
    – Linux and FreeBSD performance
    – Performance Engineering team (@coburnw)
    • Recent work:
    – Linux perf-tools: ftrace & perf_events
    – Testing of other tracers: eBPF
    • Previously:
    – Performance of Linux, Solaris, ZFS, DBs,
    TCP/IP, hypervisors, …
    – Flame graphs, heat maps, methodologies,
    DTrace tools, DTraceToolkit
    
slide 5:
    Agenda
    Some one-liners
    Background
    Technology
    Tools
    
slide 6:
    1. Some one-liners
    (cut	
      to	
      the	
      chase!)	
      
    
slide 7:
    tpoint for disk I/O
    • Who is creating disk I/O, and of what type?
    # ./tpoint -H block:block_rq_insert!
    Tracing block:block_rq_insert. Ctrl-C to end.!
    # tracer: nop!
    TASK-PID
    CPU#
    TIMESTAMP FUNCTION!
    | |
    flush-9:0-9318 [013] 1936182.007914: block_rq_insert: 202,16 W 0 () 160186560 + 8 [flush-9:0]!
    flush-9:0-9318 [013] 1936182.007939: block_rq_insert: 202,16 W 0 () 280100936 + 8 [flush-9:0]!
    java-9469 [014] 1936182.316184: block_rq_insert: 202,1 R 0 () 1319592 + 72 [java]!
    java-9469 [000] 1936182.331270: block_rq_insert: 202,1 R 0 () 1125744 + 8 [java]!
    java-9469 [000] 1936182.341418: block_rq_insert: 202,1 R 0 () 2699008 + 88 [java]!
    java-9469 [000] 1936182.341419: block_rq_insert: 202,1 R 0 () 2699096 + 88 [java]!
    java-9469 [000] 1936182.341419: block_rq_insert: 202,1 R 0 () 2699184 + 32 [java]!
    java-9469 [000] 1936182.345870: block_rq_insert: 202,1 R 0 () 1320304 + 24 [java]!
    java-9469 [000] 1936182.351590: block_rq_insert: 202,1 R 0 () 1716848 + 16 [java]!
    ^C!
    Ending tracing...!
    • tpoint traces a given tracepoint. -H prints the header.
    
slide 8:
    tpoint for disk I/O
    • Who is creating disk I/O, and of what type?
    # ./tpoint -H block:block_rq_insert!
    tracepoint	
      
    Tracing block:block_rq_insert. Ctrl-C to end.!
    type	
      
    size	
      (sectors)	
      
    # tracer: nop!
    dev	
      
    offset	
      
    TASK-PID
    CPU#
    TIMESTAMP FUNCTION!
    | |
    flush-9:0-9318 [013] 1936182.007914: block_rq_insert: 202,16 W 0 () 160186560 + 8 [flush-9:0]!
    flush-9:0-9318 [013] 1936182.007939: block_rq_insert: 202,16 W 0 () 280100936 + 8 [flush-9:0]!
    java-9469 [014] 1936182.316184: block_rq_insert: 202,1 R 0 () 1319592 + 72 [java]!
    java-9469 [000] 1936182.331270: block_rq_insert: 202,1 R 0 () 1125744 + 8 [java]!
    java-9469 [000] 1936182.341418: block_rq_insert: 202,1 R 0 () 2699008 + 88 [java]!
    java-9469 [000] 1936182.341419: block_rq_insert: 202,1 R 0 () 2699096 + 88 [java]!
    java-9469 [000] 1936182.341419: block_rq_insert: 202,1 R 0 () 2699184 + 32 [java]!
    java-9469 [000] 1936182.345870: block_rq_insert: 202,1 R 0 () 1320304 + 24 [java]!
    java-9469 [000] 1936182.351590: block_rq_insert: 202,1 R 0 () 1716848 + 16 [java]!
    ^C!
    Ending tracing...!
    • tpoint traces a given tracepoint. -H prints the header.
    
slide 9:
    tpoint -l
    # ./tpoint -l!
    block:block_bio_backmerge!
    block:block_bio_bounce!
    block:block_bio_complete!
    block:block_bio_frontmerge!
    block:block_bio_queue!
    block:block_bio_remap!
    block:block_getrq!
    block:block_plug!
    block:block_rq_abort!
    block:block_rq_complete!
    block:block_rq_insert!
    block:block_rq_issue!
    block:block_rq_remap!
    block:block_rq_requeue!
    […]!
    # ./tpoint –l | wc –l!
    1257!
    Lis7ng	
      tracepoints	
      	
      
    • 1,257 tracepoints for this Linux kernel
    
slide 10:
    tpoint -h
    # ./tpoint -h!
    USAGE: tpoint [-hHsv] [-d secs] [-p PID] tracepoint [filter]!
    tpoint -l!
    -d seconds
    # trace duration, and use buffers!
    -p PID
    # PID to match on I/O issue!
    # view format file (don't trace)!
    # include column headers!
    # list all tracepoints!
    # show kernel stack traces!
    # this usage message!
    eg,!
    tpoint -l | grep open!
    # find tracepoints containing "open"!
    tpoint syscalls:sys_enter_open!
    # trace open() syscall entry!
    tpoint block:block_rq_issue!
    # trace block I/O issue!
    tpoint -s block:block_rq_issue!
    # show kernel stacks!
    See the man page and example file for more info.!
    
slide 11:
    Some tpoint One-Liners
    # List tracepoints!
    tpoint -l!
    # Show usage message!
    tpoint -h!
    # Trace disk I/O device issue with details:!
    tpoint block:block_rq_issue!
    # Trace disk I/O queue insertion with details:!
    tpoint block:block_rq_insert!
    # Trace disk I/O queue insertion with details, and include header:!
    tpoint -H block:block_rq_insert!
    # Trace disk I/O queue insertion, with kernel stack trace:!
    tpoint -s block:block_rq_insert!
    # Trace disk I/O queue insertion, for reads only:!
    tpoint -s block:block_rq_insert 'rwbs ~ "*R*"'!
    # Trace 1,000 disk I/O device issues:!
    tpoint block:block_rq_issue | head -1000!
    
slide 12:
    DEMO
    
slide 13:
    One-Liners
    • Useful
    – Keep a collection, copy-n-paste when needed
    • Instructive
    – Teaches tool usage by-example
    – Can also show what use cases are most useful
    • Intuitive
    – Follows Unix/POSIX/IEEE traditions/standards
    – getopts, -h for help, Ctrl-C to end, etc.
    • Competitive
    – Why this tool? Demonstrate key, competitive, features.
    
slide 14:
    DTrace One-Liners (Wikipedia)
    
slide 15:
    DTrace One-Liners (Wikipedia)
    • Good examples: Useful, Instructive, Intuitive
    • Taken from a longer list:
    – http://www.brendangregg.com/dtrace.html
    – (I wish they'd have included latency quantize as well)
    • And, competitive
    – Linux couldn't do these in 2005
    
slide 16:
    Linux One-Liners
    • Porting them to Linux:
    # New processes with arguments!
    execsnoop!
    # Files opened by process!
    opensnoop!
    # Syscall count by program!
    syscount!
    # Syscall count by syscall!
    funccount 'sys_*'!
    # Syscall count by process!
    syscount -v!
    # Disk size by process!
    iosnoop -Q!
    # Pages paged in by process!
    iosnoop –Qi '*R*'!
    
slide 17:
    perf-tools
    • These Linux one-liners (and tpoint) are from my
    collection of Linux performance analysis tools
    – https://github.com/brendangregg/perf-tools
    • New tools for old Linux secrets
    – Designed to work on 3.2+ kernels
    – Uses ftrace & perf_events, which have existed for years
    
slide 18:
    2. Background
    
slide 19:
    Background
    • Linux tracing is:
    ftrace
    perf_events (perf)
    eBPF
    SystemTap
    ktap
    LTTng
    dtrace4linux
    Oracle Linux DTrace
    sysdig
    • Understanding these is time consuming & complex
    – May be best told through personal experience...
    
slide 20:
    Personal Experience
    • Became a systems performance expert
    – Understood tools, metrics, inference,
    interpretation, OS internals
    • Became a DTrace expert (2005)
    – Wrote scripts, books, blogs, courses
    – Helped Sun compete with Linux
    • Began analyzing Linux perf (2011)
    – Tried SystemTap, dtrace4linux, ktap, …
    – Limited success, much pain & confusion
    • Switched to Linux (2014)
    – And expected it to be hell (bring it on!)
    
slide 21:
    The one that got away…
    • Early on at Netflix, I had a disk I/O issue
    – Only 5 minutes to debug, then load is migrated
    – Collected iostat/sar, but needed a trace
    • No time to install any tracers (system was too slow)
    – Failed to solve the issue. Furious at Linux and myself.
    – Noticed the system did have this thing called ftrace…
    • Ftrace?
    – Part of the Linux kernel
    – /sys interface for static and dynamic tracing
    – Already enabled on all our Linux 3.2 & 3.13 servers
    
slide 22:
    WHY AM I NOT
    USING FTRACE ALREADY?
    WHY IS NO ONE
    USING FTRACE ALREADY?
    
slide 23:
    Linux Secrets
    • Re-focused on what Linux already has in-tree
    – ftrace & perf_events
    – These seem to be well-kept secrets: no marketing
    • Clears up some confusion (and pain)
    – Instead of comparing 9 tracing options, it’s now:
    1. In-tree tools (currently: ftrace, perf_events)
    2. Everything else
    – Works for us; you may prefer picking one tracer
    • Many of our tracing needs can now be met
    – Linux has been closing the tracing gap
    It’s not 2005 anymore
    
slide 24:
    A Tracing Timeline
    • 1990’s: Static tracers, prototype dynamic tracers
    • 2004: Linux kprobes (2.6.9)
    – Dynamic kernel tracing, difficult interface
    • 2005: Solaris DTrace (s10)
    – Static & dynamic tracing, user & kernel level, production ready, easy
    to use, far better than anything of the time, and, marketed
    • 2008: Linux ftrace (2.6.27)
    • 2009: Linux perf (2.6.31)
    • 2009: Linux tracepoints (2.6.32)
    – Static kernel tracing
    • 2010-2014: ftrace & perf_events enhancements
    • 2014: eBPF patches
    – Fast (JIT’d) in kernel aggregations and programs
    
slide 25:
    3. Technology
    
slide 26:
    Tracing Sources
    • Linux provides three tracing sources
    – tracepoints: kernel static tracing
    – kprobes: kernel dynamic tracing
    – uprobes: user-level dynamic tracing
    
slide 27:
    Tracepoints
    • Statically placed at logical places in the kernel
    • Provides key event details as a “format” string
    – more on this later…
    
slide 28:
    Probes
    • kprobes: dynamic kernel tracing
    – function calls, returns, line numbers
    • uprobes: dynamic user-level tracing
    
slide 29:
    Tracers
    • Tracing sources are used by the tracers
    – In-tree: ftrace, perf_events, eBPF (soon?)
    – Out-of-tree: SystemTap, ktap, LTTng, dtrace4linux, Oracle
    Linux DTrace, sysdig
    
slide 30:
    ftrace
    • A collection of tracing capabilities
    – Tracing, counting, graphing (latencies), filters
    – Uses tracepoints, kprobes
    – Not programmable (yet)
    • Use via /sys/kernel/debug/tracing/…
    – Or use via front-end tools
    • Added by Steven Rostedt and others since 2.6.27
    – Out of necessity for Steven’s real time work
    • Can solve many perf issues
    
slide 31:
    ftrace Interface
    • Static tracing of block_rq_insert tracepoint
    # cd /sys/kernel/debug/tracing!
    # echo 1 >gt; events/block/block_rq_insert/enable!
    # cat trace_pipe!
    # echo 0 >gt; events/block/block_rq_insert/enable !
    • Dynamic function tracing of tcp_retransmit_skb():
    # cd /sys/kernel/debug/tracing!
    # echo tcp_retransmit_skb >gt; set_ftrace_filter!
    # echo function >gt; current_tracer !
    # cat trace_pipe!
    # echo nop >gt; current_tracer !
    # echo >gt; set_ftrace_filter!
    • Available tracing capabilities:
    # cat available_tracers !
    blk function_graph mmiotrace wakeup_rt wakeup function nop!
    
slide 32:
    I Am SysAdmin (And So Can You!)
    • What would a sysadmin do?
    # cd /sys/kernel/debug/tracing!
    # echo tcp_retransmit_skb >gt; set_ftrace_filter!
    # echo function >gt; current_tracer !
    # cat trace_pipe!
    # echo nop >gt; current_tracer !
    # echo >gt; set_ftrace_filter!
    • Automate:
    # functrace tcp_retransmit_skb!
    • Document:
    # man functrace!
    […]!
    SYNOPSIS!
    functrace [-hH] [-p PID] [-d secs] funcstring!
    […]!
    
slide 33:
    ftrace Interface
    • Plus many more capabilities
    – buffered (trace) or live tracing (trace_pipe)
    – filters for conditional tracing
    – stack traces on events
    – function triggers to enable/disable tracing
    – functions with arguments (via kprobes)
    • See Documentation/trace/ftrace.txt
    
slide 34:
    perf_events
    • Use via the “perf” command
    • Add from linux-tools-common, …
    – Source code is in Linux: tools/perf
    • Powerful multi-tool and profiler
    – interval sampling, CPU performance counter events
    – user and kernel dynamic tracing
    – kernel line tracing and local variables (debuginfo)
    – kernel filtering, and in-kernel counts (perf stat)
    • Not very programmable, yet
    – limited kernel summaries. May improve with eBPF.
    
slide 35:
    perf_events tracing
    • Static tracing of block_rq_insert tracepoint:
    # perf record -e block:block_rq_insert -a!
    ^C[ perf record: Woken up 1 times to write data ]!
    [ perf record: Captured and wrote 0.172 MB perf.data (~7527 samples) ]!
    # perf script!
    # ========!
    trace,	
      dump,	
      post-­‐process	
      
    # captured on: Wed Nov 12 20:50:05 2014!
    # hostname : bgregg-test-i-92b81f78!
    […]!
    # ========!
    java 9940 [015] 1199510.044783: block_rq_insert: 202,1 R 0 () 4783360 + 88 [java]!
    java 9940 [015] 1199510.044786: block_rq_insert: 202,1 R 0 () 4783448 + 88 [java]!
    java 9940 [015] 1199510.044786: block_rq_insert: 202,1 R 0 () 4783536 + 24 [java]!
    java 9940 [000] 1199510.065194: block_rq_insert: 202,1 R 0 () 4864000 + 88 [java]!
    java 9940 [000] 1199510.065195: block_rq_insert: 202,1 R 0 () 4864088 + 88 [java]!
    java 9940 [000] 1199510.065196: block_rq_insert: 202,1 R 0 () 4864176 + 80 [java]!
    java 9940 [000] 1199510.083745: block_rq_insert: 202,1 R 0 () 4864344 + 88 [java]!
    […]!
    
slide 36:
    perf_events One-Liners
    • Great one-liners. From http://www.brendangregg.com/perf.html:
    # List all currently known events:!
    perf list!
    # Various basic CPU statistics, system wide, for 10 seconds:!
    perf stat -e cycles,instructions,cache-references,cache-misses -a sleep 10!
    # Count ext4 events for the entire system, for 10 seconds:!
    perf stat -e 'ext4:*' -a sleep 10!
    # Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds:!
    perf record -F 99 -ag -- sleep 10!
    # Sample CPU stack traces, once every 100 last level cache misses, for 5 seconds:!
    perf record -e LLC-load-misses -c 100 -ag -- sleep 5 !
    # Trace all block device (disk I/O) requests with stack traces, until Ctrl-C:!
    perf record -e block:block_rq_issue –ag!
    # Add a tracepoint for the kernel tcp_sendmsg() function return:!
    perf probe 'tcp_sendmsg%return'!
    # Add a tracepoint for tcp_sendmsg, with size and socket state (needs debuginfo):!
    perf probe 'tcp_sendmsg size sk->gt;__sk_common.skc_state'!
    # Show perf.data as a text report, with data coalesced and percentages:!
    perf report –n --stdio!
    
slide 37:
    eBPF
    • Extended BPF: programs on tracepoints
    – High performance filtering: JIT
    – In-kernel summaries: maps
    Time	
      
    • eg, in-kernel latency heat map (showing bimodal):
    Low	
      
    latency	
      
    cache	
      
    hits	
      
    High	
      
    latency	
      
    device	
      
    I/O	
      
    
slide 38:
    eBPF
    • Created by Alexei Starovoitov
    • Gradually being included in Linux (see lkml)
    • Has been difficult to program directly
    – Other tools can become front-ends: ftrace, perf_events,
    SystemTap, ktap?
    
slide 39:
    Other Tracers
    • Discussion:
    – SystemTap
    – ktap
    – LTTng
    – DTrace ports
    – sysdig
    
slide 40:
    The Tracing Landscape, Nov 2014
    (less	
      brutal)	
      
    (my	
      opinion)	
      
    Ease	
      of	
      use	
      
    sysdig	
      
    perf	
      
    stap	
      
    Irace	
      
    (alpha)	
      
    (brutal)	
      
    dtrace4L.
    ktap	
      
    (mature)	
      
    Stage	
      of	
      
    Development	
      
    eBPF	
      
    Scope	
      &	
      Capability	
      
    
slide 41:
    4. Tools
    
slide 42:
    Tools
    one-­‐liners:	
      
    many	
      
    front-­‐end	
      tools:	
      
    perf,	
      trace-cmd,	
      perf-­‐tools	
      
    tracing	
      frameworks:	
      
    Irace,	
      perf_events,	
      eBPF,	
      …	
      
    back-­‐end	
      instrumenta7on:	
      
    tracepoints,	
      kprobes,	
      uprobes	
      
    
slide 43:
    Front-end Tools
    • For ftrace
    – trace-cmd by Steven Rostedt
    – perf-tools: tpoint, iosnoop, execsnoop, kprobe, …
    • For perf_events
    – perf (how perf_events is commonly used)
    – perf-tools: eg, syscount, bitesize, …
    • For eBPF
    – still evolving
    – Could be used via ftrace, perf_events, SystemTap, ktap?
    
slide 44:
    Tool Types
    • Multi-tools
    – perf!
    – trace-cmd!
    – perf-tools: tpoint, kprobe, funccount, …
    – Narrow audience: engineers & developers who can take
    the time to learn them; others via canned one-liners
    • Single purpose tools
    – perf-tools: iosnoop, execsnoop, bitesize, ...
    – Wide audience: sysadmins, developers, everyone
    – Unix philosophy: do one thing and do it well
    
slide 45:
    perf-tools
    • A collection of tools for both ftrace and perf_events
    – https://github.com/brendangregg/perf-tools
    • Each tool has:
    – The tool itself
    – A man page
    – An examples file
    – A symlink under /bin
    perf-tools>gt; ls -l execsnoop bin/execsnoop man/man8/execsnoop.8 \!
    examples/execsnoop_example.txt !
    lrwxr-xr-x 1 bgregg 1001
    12 Jul 26 16:35 bin/execsnoop@ ->gt; ../execsnoop!
    -rw-r--r--+ 1 bgregg 1001 2533 Jul 31 15:34 examples/execsnoop_example.txt!
    -rwxrwxr-x+ 1 bgregg 1001 8529 Jul 31 15:36 execsnoop*!
    -rw-r--r--+ 1 bgregg 1001 3497 Jul 31 22:40 man/man8/execsnoop.8!
    
slide 46:
    perf-tools
    • WARNING: These are unsupported hacks
    – May not work on some kernel versions without tweaking
    • I've tried to use stable approaches as much as possible, but
    it isn't always possible
    – May have higher overhead than expected
    • Extreme case: slow target app by 5x
    • See the "OVERHEAD" section in the man pages
    • If this is a problem, re-implement tool in C using perf_events
    style interface (dynamic buffered)
    – Over time this will improve as Linux includes more tracing
    features, and workarounds can be rewritten
    
slide 47:
    Dependencies
    • Depends on your Linux distribution
    – On our Ubuntu servers, perf-tools just works
    • Might need
    – mount -t debugfs none /sys/kernel/debug
    – CONFIG_DEBUG_FS, CONFIG_FUNCTION_PROFILER,
    CONFIG_FTRACE, CONFIG_KPROBES, …
    – awk (awk, mawk, or gawk), perl
    
slide 48:
    perf-tools
    • Current single purpose tools (Nov 2014):
    Tool	
      
    Descrip,on	
      
    iosnoop	
      
    trace	
      disk	
      I/O	
      with	
      details	
      including	
      latency	
      
    iolatency	
      
    summarize	
      disk	
      I/O	
      latency	
      as	
      a	
      histogram	
      
    execsnoop	
      
    trace	
      process	
      exec()	
      with	
      command	
      line	
      argument	
      details	
      
    opensnoop	
      
    trace	
      open()	
      syscalls	
      showing	
      filenames	
      
    killsnoop	
      
    trace	
      kill()	
      signals	
      showing	
      process	
      and	
      signal	
      details	
      
    syscount	
      
    count	
      syscalls	
      by	
      syscall	
      or	
      process	
      
    disk/bitesize	
      
    histogram	
      summary	
      of	
      disk	
      I/O	
      size	
      
    net/tcpretrans	
      
    show	
      TCP	
      retransmits,	
      with	
      address	
      and	
      other	
      details	
      
    tools/reset-­‐Irace	
       reset	
      Irace	
      state	
      if	
      needed	
      
    
slide 49:
    perf-tools
    • Current multi-tools (Nov 2014):
    Tool	
      
    Descrip,on	
      
    system/tpoint	
      
    trace	
      a	
      given	
      tracepoint	
      
    kernel/funccount	
       count	
      kernel	
      func7on	
      calls,	
      matching	
      a	
      string	
      
    kernel/functrace	
      
    trace	
      kernel	
      func7on	
      calls,	
      matching	
      a	
      string	
      
    kernel/funcslower	
       trace	
      kernel	
      func7ons	
      slower	
      than	
      a	
      threshold	
      
    kernel/funcgraph	
       graph	
      kernel	
      func7on	
      calls,	
      showing	
      children	
      and	
      7mes	
      
    kernel/kprobe	
      
    dynamically	
      trace	
      a	
      kernel	
      func7on	
      call	
      or	
      its	
      return,	
      with	
      
    variables	
      
    
slide 50:
    perf-tools (so far…)
    
slide 51:
    perf-tools (so far…)
    
slide 52:
    DEMO
    
slide 53:
    iosnoop
    • Block I/O (disk) events with latency:
    # ./iosnoop –ts!
    Tracing block I/O. Ctrl-C to end.!
    STARTs
    ENDs
    COMM
    5982800.302061 5982800.302679 supervise
    5982800.302423 5982800.302842 supervise
    5982800.304962 5982800.305446 supervise
    5982800.305250 5982800.305676 supervise
    […]!
    PID
    TYPE DEV
    202,1
    202,1
    202,1
    202,1
    BLOCK
    BYTES LATms!
    0.62!
    0.42!
    0.48!
    0.43!
    # ./iosnoop –h!
    USAGE: iosnoop [-hQst] [-d device] [-i iotype] [-p PID] [-n name] [duration]!
    -d device
    # device string (eg, "202,1)!
    -i iotype
    # match type (eg, '*R*' for all reads)!
    -n name
    # process name to match on I/O issue!
    -p PID
    # PID to match on I/O issue!
    # include queueing time in LATms!
    # include start time of I/O (s)!
    # include completion time of I/O (s)!
    # this usage message!
    duration
    # duration seconds, and use buffers!
    […]!
    
slide 54:
    iolatency
    • Block I/O (disk) latency distributions:
    # ./iolatency !
    Tracing block I/O. Output every 1 seconds. Ctrl-C to end.!
    >gt;=(ms) .. gt; 1
    : 1144
    |######################################|!
    1 ->gt; 2
    : 267
    |#########
    2 ->gt; 4
    : 10
    4 ->gt; 8
    : 5
    8 ->gt; 16
    : 248
    |#########
    16 ->gt; 32
    : 601
    |####################
    32 ->gt; 64
    : 117
    |####
    […]!
    • User-level processing sometimes can’t keep up
    – Over 50k IOPS. Could buffer more workaround, but would
    prefer in-kernel aggregations
    
slide 55:
    opensnoop
    • Trace open() syscalls showing filenames:
    # ./opensnoop -t!
    Tracing open()s. Ctrl-C to end.!
    TIMEs
    COMM
    PID
    postgres
    postgres
    postgres
    postgres
    postgres
    postgres
    postgres
    svstat
    svstat
    stat
    stat
    stat
    stat
    stat
    stat
    […]!
    FD FILE!
    0x8 /proc/self/oom_adj!
    0x5 global/pg_filenode.map!
    0x5 global/pg_internal.init!
    0x5 base/16384/PG_VERSION!
    0x5 base/16384/pg_filenode.map!
    0x5 base/16384/pg_internal.init!
    0x5 base/16384/11725!
    0x4 supervise/ok!
    0x4 supervise/status!
    0x3 /etc/ld.so.cache!
    0x3 /lib/x86_64-linux-gnu/libselinux…!
    0x3 /lib/x86_64-linux-gnu/libc.so.6!
    0x3 /lib/x86_64-linux-gnu/libdl.so.2!
    0x3 /proc/filesystems!
    0x3 /etc/nsswitch.conf!
    
slide 56:
    funcgraph
    • Trace a graph of kernel code flow:
    # ./funcgraph -Htp 5363 vfs_read!
    Tracing "vfs_read" for PID 5363... Ctrl-C to end.!
    # tracer: function_graph!
    TIME
    CPU DURATION
    FUNCTION CALLS!
    4346366.073832 |
    | vfs_read() {!
    4346366.073834 |
    rw_verify_area() {!
    4346366.073834 |
    security_file_permission() {!
    4346366.073834 |
    apparmor_file_permission() {!
    4346366.073835 |
    0.153 us
    common_file_perm();!
    4346366.073836 |
    0.947 us
    4346366.073836 |
    0.066 us
    __fsnotify_parent();!
    4346366.073836 |
    0.080 us
    fsnotify();!
    4346366.073837 |
    2.174 us
    4346366.073837 |
    2.656 us
    4346366.073837 |
    tty_read() {!
    4346366.073837 |
    0.060 us
    tty_paranoia_check();!
    […]!
    
slide 57:
    funccount
    • Count a kernel function call rate:
    # ./funccount -i 1 'bio_*'!
    Tracing "bio_*"... Ctrl-C to end.!
    FUNC
    COUNT!
    bio_attempt_back_merge
    26!
    bio_get_nr_vecs
    361!
    bio_alloc
    536!
    bio_alloc_bioset
    536!
    bio_endio
    536!
    bio_free
    536!
    bio_fs_destructor
    536!
    bio_init
    536!
    bio_integrity_enabled
    536!
    bio_put
    729!
    bio_add_page
    1004!
    [...]!
    Counts	
      are	
      in-­‐kernel,	
      
    for	
      low	
      overhead	
      
    – -i: set an output interval (seconds), otherwise until Ctrl-C
    
slide 58:
    kprobe
    • Just wrapping capabilities eases use. Eg, kprobes:
    # ./kprobe 'p:open do_sys_open filename=+0(%si):string' 'filename ~ "*stat"'!
    Tracing kprobe myopen. Ctrl-C to end.!
    postgres-1172 [000] d... 6594028.787166: open: (do_sys_open
    +0x0/0x220) filename="pg_stat_tmp/pgstat.stat"!
    postgres-1172 [001] d... 6594028.797410: open: (do_sys_open
    +0x0/0x220) filename="pg_stat_tmp/pgstat.stat"!
    postgres-1172 [001] d... 6594028.797467: open: (do_sys_open
    +0x0/0x220) filename="pg_stat_tmp/pgstat.stat”!
    ^C!
    Ending tracing...!
    • By some definition of “ease”. Would like easier symbol usage,
    instead of +0(%si).
    
slide 59:
    tpoint One-Liners
    # List tracepoints!
    tpoint -l!
    # Trace disk I/O device issue with details:!
    tpoint block:block_rq_issue!
    # Trace disk I/O queue insertion, with kernel stack trace:!
    tpoint -s block:block_rq_insert!
    # Show output format string and filter variables:!
    tpoint -v block:block_rq_insert!
    # Trace disk I/O queue insertion, for reads only:!
    tpoint block:block_rq_insert 'rwbs ~ "*R*"'!
    # Trace 1,000 disk I/O device issues:!
    tpoint block:block_rq_issue | head -1000!
    # Trace syscall open():!
    tpoint syscalls:sys_enter_open!
    
slide 60:
    Tracepoint Format Strings
    # ./tpoint -H block:block_rq_insert!
    Tracing block:block_rq_insert. Ctrl-C to end.!
    # tracer: nop!
    TASK-PID
    CPU#
    TIMESTAMP FUNCTION!
    | |
    java-9469 [000] 1936182.331270: block_rq_insert: 202,1 R 0 () 1125744 + 8 [java]!
    Comes	
      from	
      
    include/trace/events/block.h:!
    DECLARE_EVENT_CLASS(block_rq,!
    [...]!
    TP_printk("%d,%d %s %u (%s) %llu + %u [%s]",!
    MAJOR(__entry->gt;dev), MINOR(__entry->gt;dev),!
    __entry->gt;rwbs, __entry->gt;bytes, __get_str(cmd),!
    (unsigned long long)__entry->gt;sector,!
    __entry->gt;nr_sector, __entry->gt;comm)!
    – Kernel source may be the only docs for tracepoints
    
slide 61:
    Tracepoint Format Strings
    • Can also use tpoint -v for reminders:
    # ./tpoint -v block:block_rq_issue!
    name: block_rq_issue!
    ID: 942!
    format:!
    field:unsigned short common_type;
    offset:0;
    size:2; signed:0;!
    field:unsigned char common_flags;
    offset:2;
    size:1; signed:0;!
    field:unsigned char common_preempt_count;
    offset:3;
    size:1; signed:0;!
    field:int common_pid;
    offset:4;
    size:4; signed:1;!
    field:dev_t dev;
    offset:8;
    size:4; signed:0;!
    field:sector_t sector; offset:16;
    size:8; signed:0;!
    field:unsigned int nr_sector;
    offset:24;
    size:4; signed:0;!
    field:unsigned int bytes;
    offset:28;
    size:4; signed:0;!
    field:char rwbs[8];
    offset:32;
    size:8; signed:1;!
    field:char comm[16];
    offset:40;
    size:16;
    signed:1;!
    field:__data_loc char[] cmd;
    offset:56;
    size:4; signed:1;!
    print fmt: "%d,%d %s %u (%s) %llu + %u [%s]", ((unsigned int) ((REC->gt;dev) >gt;>gt; 20)),
    ((unsigned int) ((REC->gt;dev) & ((1U gt;rwbs, REC->gt;bytes, __get_str(cmd),
    (unsigned long long)REC->gt;sector, REC->gt;nr_sector, REC->gt;comm!
    – Fields can be used in filters. Eg:
    • tpoint block:block_rq_insert 'rwbs ~ "*R*"'!
    
slide 62:
    funccount One-Liners
    # Count all block I/O functions:!
    funccount 'bio_*'!
    # Count all block I/O functions, print every 1 second:!
    funccount -i 1 'bio_*'!
    # Count all vfs functions for 5 seconds:!
    funccount -t 5 'vfs*'!
    # Count all "tcp_" functions, printing the top 5 every 1 second:!
    funccount -i 1 -t 5 'tcp_*'!
    # Count all "ext4*" functions for 10 seconds, print the top 25:!
    funccount -t 25 -d 10 'ext4*'!
    # Check which I/O scheduler is in use:!
    funccount -i 1 'deadline*'!
    funccount -i 1 'noop*'!
    # Count syscall types, summarizing every 1 second (one of):!
    funccount -i 1 'sys_*'!
    funccount -i 1 'SyS_*'!
    
slide 63:
    kprobe One-Liners
    # Trace calls to do_sys_open():!
    kprobe p:do_sys_open!
    # Trace returns from do_sys_open(), and include column header:!
    kprobe -H r:do_sys_open!
    # Trace do_sys_open() return as "myopen" alias, with return value:!
    kprobe 'r:myopen do_sys_open $retval'!
    # Trace do_sys_open() calls, and print register %cx as uint16 "mode":!
    kprobe 'p:myopen do_sys_open mode=%cx:u16'!
    # Trace do_sys_open() calls, with register %si as a "filename" string:!
    kprobe 'p:myopen do_sys_open filename=+0(%si):string'!
    # Trace do_sys_open() filenames, when they match "*stat":!
    kprobe 'p:myopen do_sys_open filename=+0(%si):string' 'filename ~ "*stat"'!
    # Trace tcp_init_cwnd() with kernel call stack:!
    kprobe -s 'p:tcp_init_cwnd'!
    # Trace tcp_sendmsg() for PID 81 and for 5 seconds (buffered):!
    kprobe -p 81 -d 5 'p:tcp_sengmsg'!
    
slide 64:
    perf-tools Internals
    • If you ever read the tool source code…
    – They are designed to be:
    A. As stable as possible
    B. Use fewest dependencies
    C. Short, temporary, programs
    – They may be rewritten when newer tracing features exist
    D. Mindful of overheads
    – C implementations, like perf_event's dynamic buffered
    approach, would be better. But see (C).
    – Many tools have:
    • SIGPIPE handling, so "| head -100" etc.
    • -d duration, which buffers output, lowering overhead
    • In order of preference: bash, awk, perl5/python/…
    
slide 65:
    The AWK Wars
    • Tools may make use of gawk, mawk, or awk
    – Will check what is available, and pick the best option
    – mawk is faster, but (currently) less featured
    • Example issues encountered:
    – gawk has strftime(), mawk doesn't
    • Test: awk 'BEGIN { print strftime("%H:%M:%S") }'
    – gawk honors fflush(), mawk doesn't
    – mawk's "-W interactive" flushes too often: every column
    – gawk and mawk have inconsistent handlings of hex numbers:
    • prints "16 0" in mawk : mawk 'BEGIN { printf "%d %d\n", "0x10", 0x10 }'
    • prints "0 16" in gawk : gawk 'BEGIN { printf "%d %d\n", "0x10", 0x10 }'
    • prints "16 16" in gawk: gawk --non-decimal-data 'BEGIN { printf "%d %d
    \n", "0x10", 0x10 }'
    
slide 66:
    Much more to do… Porting more DTrace scripts
    
slide 67:
    Some Visual Tools
    • kernelshark
    – For ftrace
    • Trace Compass
    – To visualize LTTng (and more) time series trace data
    • Flame graphs
    – For any profiles with stack traces
    • Heat maps
    – To show distributions over time
    
slide 68:
    Kernelshark
    preemp7on	
      latency	
      
    wakeup	
      latency	
      
    
slide 69:
    Trace Compass
    
slide 70:
    perf CPU Flame Graph
    Kernel	
      
    TCP/IP	
      
    Broken	
      
    Java	
      stacks	
      
    (missing	
      
    frame	
      
    pointer)	
      
    GC	
      
    Locks	
      
    Time	
      
    Idle	
      
    thread	
      
    epoll	
      
    
slide 71:
    perf Block I/O Latency Heat Map
    
slide 72:
    Summary
    Some one-liners
    Background
    Technology
    Tools
    Most important take away: Linux can serve many
    tracing needs today with ftrace & perf_events
    
slide 73:
    Acks
    http://en.wikipedia.org/wiki/DTrace
    http://generalzoi.deviantart.com/art/Pony-Creator-v3-397808116 and
    Deirdré Straughan for the tracing pony mascots
    I Am SysAdmin (And So Can You!), Ben Rockwood, LISA14
    http://people.redhat.com/srostedt/kernelshark/HTML/ kernelshark
    screenshot
    https://projects.eclipse.org/projects/tools.tracecompass Trace Compass
    screenshot
    
slide 74:
    Links
    perf-tools
    • https://github.com/brendangregg/perf-tools
    • http://lwn.net/Articles/608497/
    perf_events
    • https://perf.wiki.kernel.org/index.php/Main_Page
    • http://www.brendangregg.com/perf.html
    perf, ftrace, and more: http://www.brendangregg.com/linuxperf.html
    eBPF: http://lwn.net/Articles/603983/
    ktap: http://www.ktap.org/
    SystemTap: https://sourceware.org/systemtap/
    sysdig: http://www.sysdig.org/
    Kernelshark: http://people.redhat.com/srostedt/kernelshark/HTML/
    Trace Compass: https://projects.eclipse.org/projects/tools.tracecompass
    Flame graphs: http://www.brendangregg.com/flamegraphs.html
    Heat maps: http://www.brendangregg.com/heatmaps.html
    
slide 75:
    Thanks
    • Questions?
    • http://slideshare.net/brendangregg
    • http://www.brendangregg.com
    • bgregg@netflix.com
    • @brendangregg