Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

QCon SF 2015: Broken Performance Tools

Video: https://www.infoq.com/presentations/broken-performance-tools

Talk for QCon San Francisco 2015 by Brendan Gregg.

Description: "Broken benchmarks, misleading metrics, and terrible tools. This talk will help you navigate the treacherous waters of system performance tools, touring common problems with system metrics, monitoring, statistics, visualizations, measurement overhead, and benchmarks. This will likely involve some unlearning, as you discover tools you have been using for years, are in fact, misleading, dangerous, or broken.

The speaker, Brendan Gregg, has given many popular talks on operating system performance tools. This is an anti-version of these talks, to focus on broken tools and metrics instead of the working ones. Metrics can be misleading, and counters can be counter-intuitive! This talk will include advice and methodologies for verifying new performance tools, understanding how they work, and using them successfully."

next
prev
1/128
next
prev
2/128
next
prev
3/128
next
prev
4/128
next
prev
5/128
next
prev
6/128
next
prev
7/128
next
prev
8/128
next
prev
9/128
next
prev
10/128
next
prev
11/128
next
prev
12/128
next
prev
13/128
next
prev
14/128
next
prev
15/128
next
prev
16/128
next
prev
17/128
next
prev
18/128
next
prev
19/128
next
prev
20/128
next
prev
21/128
next
prev
22/128
next
prev
23/128
next
prev
24/128
next
prev
25/128
next
prev
26/128
next
prev
27/128
next
prev
28/128
next
prev
29/128
next
prev
30/128
next
prev
31/128
next
prev
32/128
next
prev
33/128
next
prev
34/128
next
prev
35/128
next
prev
36/128
next
prev
37/128
next
prev
38/128
next
prev
39/128
next
prev
40/128
next
prev
41/128
next
prev
42/128
next
prev
43/128
next
prev
44/128
next
prev
45/128
next
prev
46/128
next
prev
47/128
next
prev
48/128
next
prev
49/128
next
prev
50/128
next
prev
51/128
next
prev
52/128
next
prev
53/128
next
prev
54/128
next
prev
55/128
next
prev
56/128
next
prev
57/128
next
prev
58/128
next
prev
59/128
next
prev
60/128
next
prev
61/128
next
prev
62/128
next
prev
63/128
next
prev
64/128
next
prev
65/128
next
prev
66/128
next
prev
67/128
next
prev
68/128
next
prev
69/128
next
prev
70/128
next
prev
71/128
next
prev
72/128
next
prev
73/128
next
prev
74/128
next
prev
75/128
next
prev
76/128
next
prev
77/128
next
prev
78/128
next
prev
79/128
next
prev
80/128
next
prev
81/128
next
prev
82/128
next
prev
83/128
next
prev
84/128
next
prev
85/128
next
prev
86/128
next
prev
87/128
next
prev
88/128
next
prev
89/128
next
prev
90/128
next
prev
91/128
next
prev
92/128
next
prev
93/128
next
prev
94/128
next
prev
95/128
next
prev
96/128
next
prev
97/128
next
prev
98/128
next
prev
99/128
next
prev
100/128
next
prev
101/128
next
prev
102/128
next
prev
103/128
next
prev
104/128
next
prev
105/128
next
prev
106/128
next
prev
107/128
next
prev
108/128
next
prev
109/128
next
prev
110/128
next
prev
111/128
next
prev
112/128
next
prev
113/128
next
prev
114/128
next
prev
115/128
next
prev
116/128
next
prev
117/128
next
prev
118/128
next
prev
119/128
next
prev
120/128
next
prev
121/128
next
prev
122/128
next
prev
123/128
next
prev
124/128
next
prev
125/128
next
prev
126/128
next
prev
127/128
next
prev
128/128

PDF: QCon2015_Broken_Performance_Tools.pdf

Keywords (from pdftotext):

slide 1:
    Nov	
      2015	
      
    Broken
    Performance
    Tools
    Brendan Gregg
    Senior Performance Architect, Netflix
    
slide 2:
    CAUTION: PERFORMANCE TOOLS
    
slide 3:
    Over 60 million subscribers
    AWS EC2 Linux cloud
    FreeBSD CDN
    Awesome place to work
    
slide 4:
    This Talk
    • Observability, benchmarking, anti-patterns, and lessons
    • Broken and misleading things that are surprising
    Note: problems with current implementations are discussed, which
    may be fixed/improved in the future
    
slide 5:
    Observability:
    System Metrics
    
slide 6:
    LOAD AVERAGES
    
slide 7:
    RFC	
      546	
      
    
slide 8:
    Load Averages (1, 5, 15 min)
    $ uptime
    22:08:07 up
    9:05,
    1 user,
    load average: 11.42, 11.87, 12.12
    • "load"
    – Usually CPU demand (scheduler run queue length/latency)
    – On Linux, task demand: CPU + uninterruptible disk I/O (?)
    • "average"
    – Exponentially damped moving sum
    • "1, 5, and 15 minutes"
    – Constants used in the equation
    • Don't study these for longer than 10 seconds
    
slide 9:
    t=0
    Load begins
    (1 thread)
    @ 1 min:
    1 min avg =~ 0.62
    
slide 10:
    Load Average
    "1 minute load average"
    really means…
    "The exponentially damped moving sum of
    CPU + uninterruptible disk I/O that uses
    a value of 60 seconds in its equation"
    
slide 11:
    TOP %CPU
    
slide 12:
    top %CPU
    $ top - 20:15:55 up 19:12, 1 user, load average: 7.96, 8.59, 7.05
    Tasks: 470 total,
    1 running, 468 sleeping,
    0 stopped,
    1 zombie
    %Cpu(s): 28.1 us, 0.4 sy, 0.0 ni, 71.2 id, 0.0 wa, 0.0 hi, 0.1 si, 0.1 st
    KiB Mem: 61663100 total, 61342588 used,
    320512 free,
    9544 buffers
    KiB Swap:
    0 total,
    0 used,
    0 free. 3324696 cached Mem
    PID USER
    11959 apiprod
    12595 snmp
    10447 snmp
    18463 apiprod
    […]
    VIRT
    RES
    SHR S %CPU %MEM
    TIME+ COMMAND
    0 81.731g 0.053t 14476 S 935.8 92.1 13568:22 java
    3256 1392 S
    3.6 0.0 2:37.23 snmp-pass
    6028 1432 S
    2.0 0.0 2:12.12 snmpd
    1972 1176 R
    0.7 0.0 0:00.07 top
    • Who is consuming CPU?
    • And by how much?
    
slide 13:
    top: Missing %CPU
    • Short-lived processes can be missing entirely
    – Process creates and exits in-between sampling /proc.
    e.g., software builds.
    – Try atop(1), or sampling using perf(1)
    • Stop clearing the screen!
    – No option to turn this off. Your eyes can miss updates.
    – I often use pidstat(1) on Linux instead. Scroll back for history.
    
slide 14:
    top: Misinterpreting %CPU
    • Different top(1)s use different calculations
    - On different OSes, check the man page, and run a test!
    • %CPU can mean:
    – A) Sum of per-CPU percents (0-Ncpu x 100%) consumed
    during the last interval
    – B) Percentage of total CPU capacity (0-100%) consumed
    during the last interval
    – C) (A) but historically damped (like load averages)
    – D) (B) " " "
    
slide 15:
    top: %Cpu vs %CPU
    $ top - 15:52:58 up 10 days, 21:58, 2 users, load average: 0.27, 0.53, 0.41
    Tasks: 180 total,
    1 running, 179 sleeping,
    0 stopped,
    0 zombie
    %Cpu(s): 1.2 us, 24.5 sy, 0.0 ni, 67.2 id, 0.2 wa, 0.0 hi, 6.6 si, 0.4 st
    KiB Mem:
    2872448 total, 2778160 used,
    94288 free,
    31424 buffers
    KiB Swap: 4151292 total,
    76 used, 4151216 free. 2411728 cached Mem
    PID USER
    12678 root
    12675 root
    215 root
    […]
    VIRT
    RES
    SHR S %CPU %MEM
    912 S 100.4 0.0
    904 S 88.8 0.0
    0 S
    0.3 0.0
    • This 4 CPU system is consuming:
    – 130% total CPU, via %Cpu(s)
    – 190% total CPU, via %CPU
    • Which one is right? Is either?
    TIME+ COMMAND
    0:23.52 iperf
    0:20.83 iperf
    0:27.73 jbd2/sda1-8
    
slide 16:
    CPU Summary Statistics
    • %Cpu row is from /proc/stat
    • linux/Documentation/cpu-load.txt:
    In most cases the `/proc/stat' information reflects
    the reality quite closely, however due to the nature
    of how/when the kernel collects this data
    sometimes it can not be trusted at all.
    • /proc/stat is used by everything for CPU stats
    
slide 17:
    %CPU
    
slide 18:
    What is %CPU anyway?
    • "Good" %CPU:
    – Retiring instructions (provided they aren't a spin loop)
    – High IPC (Instructions-Per-Cycle)
    • "Bad" %CPU:
    – Stall cycles waiting on resources, usually memory I/O
    – Low IPC
    – Buying faster processors may make little difference
    • %CPU alone is ambiguous
    – Would love top(1) to split %CPU into cycles retiring vs stalled
    – Although, it gets worse…
    
slide 19:
    A CPU Mystery…
    • As load increased, CPU ms per request lowered (blue)
    – up to 1.84x faster
    • Was it due to:
    - Cache warmth? no
    - Different code? no
    - Turbo boost? no
    • (Same test, but
    problem fixed, is
    shown in red)
    
slide 20:
    CPU Speed Variation
    • Clock speed can vary thanks to:
    – Intel Turbo Boost: by hardware, based on power, temp, etc
    – Intel Speed Step: by software, controlled by the kernel
    • %CPU is still ambiguous, given IPC. Need to know the
    clock speed as well
    • CPU counters nowadays have "reference cycles"
    
slide 21:
    Out-of-order Execution
    • CPUs execute uops out-oforder and in parallel across
    multiple functional units
    • %CPU doesn't account for
    how many units are active
    • Accounting each cycles as
    "stalled" or “retiring" is a
    simplification
    • Nowadays it's a lot of work
    to truly understand what
    CPUs are doing
    h/ps://upload.wikimedia.org/wikipedia/commons/6/64/Intel_Nehalem_arch.svg	
      
    
slide 22:
    I/O WAIT
    
slide 23:
    I/O Wait
    $ mpstat -P ALL 1
    08:06:43 PM CPU
    %usr
    08:06:44 PM all 53.45
    […]
    %nice
    %sys %iowait
    %irq
    %soft %steal %guest
    %idle
    • Suggests system is disk I/O bound, but often misleading
    • Comparing I/O wait between system A and B:
    - higher might be bad: slower disks, more blocking
    - lower might be bad: slower processor and architecture
    consumes more CPU, obscuring I/O wait
    • Solaris implementation was also broken and later
    hardwired to zero
    • Can be very useful when understood: another idle state
    
slide 24:
    I/O Wait Venn Diagram
    Per CPU:	
      
    CPU	
      
    "CPU"	
      
    Waiting for disk I/O	
      
    "CPU"	
       "I/O Wait"	
      
    "Idle"	
      
    
slide 25:
    FREE MEMORY
    
slide 26:
    Free Memory
    $ free -m
    total
    Mem:
    -/+ buffers/cache:
    Swap:
    used
    free
    0	
      
    shared
    buffers
    cached
    • "free" is near-zero: I'm running
    out of memory!
    - No, it's in the file system cache,
    and is still free for apps to use
    • Linux free(1) explains it, but
    other tools, e.g. vmstat(1), don't
    • Some file systems (e.g., ZFS)
    may not be shown in the
    system's cached metrics at all
    www.linuxatemyram.com	
      
    
slide 27:
    VMSTAT
    
slide 28:
    vmstat(1)
    $ vmstat –Sm 1
    procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---r b
    swpd
    free
    buff cache
    cs us sy id wa
    8 0
    12 25 34 0 0
    7 0
    0 205 186 46 13 0 0
    8 0
    8 210 435 39 21 0 0
    8 0
    0 218 219 42 17 0 0
    […]	
      
    • Linux: first line has some summary since boot values —
    confusing!
    • Other implementations:
    - "r" may be sampled once per second. Almost useless.
    - Columns like "de" for deficit, making much less sense for nonpage scanned situations
    
slide 29:
    NETSTAT -S
    
slide 30:
    netstat -s
    $ netstat -s
    Ip:
    7962754 total packets received
    8 with invalid addresses
    0 forwarded
    0 incoming packets discarded
    7962746 incoming packets delivered
    8019427 requests sent out
    Icmp:
    382 ICMP messages received
    0 input ICMP message failed.
    ICMP input histogram:
    destination unreachable: 125
    timeout in transit: 257
    3410 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
    destination unreachable: 3410
    IcmpMsg:
    InType3: 125
    InType11: 257
    OutType3: 3410
    Tcp:
    17337 active connections openings
    395515 passive connection openings
    8953 failed connection attempts
    240214 connection resets received
    3 connections established
    7198375 segments received
    7504939 segments send out
    62696 segments retransmited
    10 bad segments received.
    1072 resets sent
    InCsumErrors: 5
    Udp:
    759925 packets received
    3412 packets to unknown port received.
    0 packet receive errors
    784370 packets sent
    UdpLite:
    TcpExt:
    858 invalid SYN cookies received
    8951 resets received for embryonic SYN_RECV sockets
    14 packets pruned from receive queue because of socket buffer overrun
    6177 TCP sockets finished time wait in fast timer
    293 packets rejects in established connections because of timestamp
    733028 delayed acks sent
    89 delayed acks further delayed because of locked socket
    Quick ack mode was activated 13214 times
    336520 packets directly queued to recvmsg prequeue.
    43964 packets directly received from backlog
    11406012 packets directly received from prequeue
    1039165 packets header predicted
    7066 packets header predicted and directly queued to user
    1428960 acknowledgments not containing data received
    1004791 predicted acknowledgments
    1 times recovered from packet loss due to fast retransmit
    5044 times recovered from packet loss due to SACK data
    2 bad SACKs received
    Detected reordering 4 times using SACK
    Detected reordering 11 times using time stamp
    13 congestion windows fully recovered
    11 congestion windows partially recovered using Hoe heuristic
    TCPDSACKUndo: 39
    2384 congestion windows recovered after partial ack
    228 timeouts after SACK recovery
    100 timeouts in loss state
    5018 fast retransmits
    39 forward retransmits
    783 retransmits in slow start
    32455 other TCP timeouts
    TCPLossProbes: 30233
    TCPLossProbeRecovery: 19070
    992 sack retransmits failed
    18 times receiver scheduled too late for direct processing
    705 packets collapsed in receive queue due to low socket buffer
    13658 DSACKs sent for old packets
    8 DSACKs sent for out of order packets
    13595 DSACKs received
    33 DSACKs for out of order packets received
    32 connections reset due to unexpected data
    108 connections reset due to early user close
    1608 connections aborted due to timeout
    TCPSACKDiscard: 4
    TCPDSACKIgnoredOld: 1
    TCPDSACKIgnoredNoUndo: 8649
    TCPSpuriousRTOs: 445
    TCPSackShiftFallback: 8588
    TCPRcvCoalesce: 95854
    TCPOFOQueue: 24741
    TCPOFOMerge: 8
    TCPChallengeACK: 1441
    TCPSYNChallenge: 5
    TCPSpuriousRtxHostQueues: 1
    TCPAutoCorking: 4823
    IpExt:
    InOctets: 1561561375
    OutOctets: 1509416943
    InNoECTPkts: 8201572
    InECT1Pkts: 2
    InECT0Pkts: 3844
    InCEPkts: 306
    
slide 31:
    netstat -s
    […]
    Tcp:
    17337 active connections openings
    395515 passive connection openings
    8953 failed connection attempts
    240214 connection resets received
    3 connections established
    7198870 segments received
    7505329 segments send out
    62697 segments retransmited
    10 bad segments received.
    1072 resets sent
    InCsumErrors: 5
    […]
    
slide 32:
    netstat -s
    • Many metrics on Linux (can be over 200)
    • Still doesn't include everything: getting better, but don't
    assume everything is there
    • Includes typos & inconsistencies
    • Might be more readable to:
    cat /proc/net/snmp /proc/net/netstat
    • Totals since boot can be misleading
    • On Linux, -s needs -c support
    • Often no documentation outside kernel source code
    • Requires expertise to comprehend
    
slide 33:
    DISK METRICS
    
slide 34:
    Disk Metrics
    • All disk metrics are misleading
    • Disk %utilization / %busy
    – Logical devices (volume managers) can process requests in
    parallel, and may accept more I/O at 100%
    • Disk IOPS
    – High IOPS is "bad"? That depends…
    • Disk latency
    – Does it matter? File systems and volume managers try hard
    to hide latency and make latency asynchronous
    – Better measuring latency via application->gt;FS calls
    
slide 35:
    Rules for Metrics Makers
    • They must work
    – As well as possible. Clearly document caveats.
    • They must be useful
    – Document a real use case (eg, my example.txt files).
    If you get stuck, it's not useful – ditch it.
    • Aim to be intuitive
    – Document it. If it's too weird to explain, redo it.
    • As few as possible
    – Respect end-user's time
    • Good system examples:
    – iostat -x: workload columns, then resulting perf columns
    – Linux sar: consistency, units on columns, logical groups
    
slide 36:
    Observability:
    Profilers
    
slide 37:
    PROFILERS
    
slide 38:
    Linux perf
    • Can sample stack traces and summarize output:
    # perf report -n -stdio
    […]
    # Overhead
    Samples Command
    Shared Object
    Symbol
    # ........ ............ ....... ................. .............................
    20.42%
    bash [kernel.kallsyms] [k] xen_hypercall_xen_version
    --- xen_hypercall_xen_version
    check_events
    |--44.13%-- syscall_trace_enter
    tracesys
    |--35.58%-- __GI___libc_fcntl
    |--65.26%-- do_redirection_internal
    do_redirections
    execute_builtin_or_function
    execute_simple_command
    [… ~13,000 lines truncated …]
    
slide 39:
    Too Much Output
    
slide 40:
    … as a Flame Graph
    
slide 41:
    PROFILER VISIBILITY
    
slide 42:
    System Profilers with Java
    Kernel
    TCP/IP
    Java
    JVM
    Locks
    Time
    epoll
    Idle
    thread
    
slide 43:
    System Profilers with Java
    • e.g., Linux perf
    • Visibility
    – JVM (C++)
    – GC (C++)
    – libraries (C)
    – kernel (C)
    • Typical problems (x86):
    – Stacks missing for Java and other runtimes
    – Symbols missing for Java methods
    • Profile everything except Java and similar runtimes
    
slide 44:
    Java Profilers
    Kernel,
    libraries,
    JVM
    Java
    
slide 45:
    Java Profilers
    • Visibility
    – Java method execution
    – Object usage
    – GC logs
    – Custom Java context
    • Typical problems:
    – Sampling often happens at safety/yield points (skew)
    – Method tracing has massive observer effect
    – Misidentifies RUNNING as on-CPU (e.g., epoll)
    – Doesn't include or profile GC or JVM CPU time
    – Tree views not quick (proportional) to comprehend
    • Inaccurate (skewed) and incomplete profiles
    
slide 46:
    COMPILER OPTIMIZATIONS
    
slide 47:
    Broken System Stack Traces
    • Profiling Java on
    x86 using perf
    • The stacks are
    1 or 2 levels
    deep, and have
    junk values
    # perf record –F 99 –a –g – sleep 30
    # perf script
    […]
    java 4579 cpu-clock:
    ffffffff8172adff tracesys ([kernel.kallsyms])
    7f4183bad7ce pthread_cond_timedwait@@GLIBC_2…
    java
    4579 cpu-clock:
    7f417908c10b [unknown] (/tmp/perf-4458.map)
    java
    4579 cpu-clock:
    7f4179101c97 [unknown] (/tmp/perf-4458.map)
    java
    4579 cpu-clock:
    7f41792fc65f [unknown] (/tmp/perf-4458.map)
    a2d53351ff7da603 [unknown] ([unknown])
    java
    4579 cpu-clock:
    7f4179349aec [unknown] (/tmp/perf-4458.map)
    java
    4579 cpu-clock:
    7f4179101d0f [unknown] (/tmp/perf-4458.map)
    […]
    
slide 48:
    Why Stacks are Broken
    • On x86 (x86_64), hotspot uses the frame pointer register
    (RBP) as general purpose
    • This "compiler optimization" breaks stack walking
    • Once upon a time, x86 had fewer registers, and this made
    much more sense
    • gcc provides -fno-omit-frame-pointer to avoid
    doing this
    – JDK8u60+ now has this as -XX:+PreserveFramePoiner
    
slide 49:
    Missing Symbols
    • Missing symbols may show up as hex; e.g., Linux perf:
    71.79%
    sed
    sed
    |--11.65%-- 0x40a447
    0x40659a
    0x408dd8
    0x408ed1
    0x402689
    0x7fa1cd08aec5
    12.06%
    sed
    sed
    --- re_search_internal
    |--96.78%-- re_search_stub
    rpl_re_search
    match_regex
    do_subst
    execute_program
    process_files
    main
    __libc_start_main
    [.] 0x000000000001afc1
    broken
    [.] re_search_internal
    not broken
    
slide 50:
    Fixing Symbols
    For applications, install debug symbol package
    For JIT'd code, Linux perf already looks for an
    externally provided symbol file: /tmp/perf-PID.map
    # perf script
    Failed to open /tmp/perf-8131.map, continuing without symbols
    […]
    java 8131 cpu-clock:
    7fff76f2dce1 [unknown] ([vdso])
    7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…
    7fd301861e46 [unknown] (/tmp/perf-8131.map)
    […]
    Find for a way to create this for your runtime
    
slide 51:
    INSTRUCTION PROFILING
    
slide 52:
    Instruction Profiling
    # perf annotate -i perf.data.noplooper --stdio
    Percent |
    Source code & Disassembly of noplooper
    -------------------------------------------------------:
    Disassembly of section .text:
    00000000004004ed gt;:
    0.00 :
    4004ed:
    push
    %rbp
    0.00 :
    4004ee:
    mov
    %rsp,%rbp
    20.86 :
    4004f1:
    nop
    0.00 :
    4004f2:
    nop
    0.00 :
    4004f3:
    nop
    0.00 :
    4004f4:
    nop
    19.84 :
    4004f5:
    nop
    0.00 :
    4004f6:
    nop
    0.00 :
    4004f7:
    nop
    0.00 :
    4004f8:
    nop
    18.73 :
    4004f9:
    nop
    0.00 :
    4004fa:
    nop
    0.00 :
    4004fb:
    nop
    0.00 :
    4004fc:
    nop
    19.08 :
    4004fd:
    nop
    0.00 :
    4004fe:
    nop
    0.00 :
    4004ff:
    nop
    0.00 :
    400500:
    nop
    21.49 :
    400501:
    jmp
    4004f1 gt;
    • Often broken nowadays due
    to skid, out-of-order
    execution, and sampling the
    resumption instruction
    • Better with PEBS support
    
slide 53:
    Observability:
    Overhead
    
slide 54:
    TCPDUMP
    
slide 55:
    tcpdump
    $ tcpdump -i eth0 -w /tmp/out.tcpdump
    tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
    ^C7985 packets captured
    8996 packets received by filter
    1010 packets dropped by kernel
    • Packet tracing doesn't scale. Overheads:
    – CPU cost of per-packet tracing (improved by [e]BPF)
    • Consider CPU budget per-packet at 10/40/100 GbE
    – Transfer to user-level (improved by ring buffers)
    – File system storage (more CPU, and disk I/O)
    – Possible additional network transfer
    • Can also drop packets when overloaded
    • You should only trace send/receive as a last resort
    – I solve problems by tracing lower frequency TCP events
    
slide 56:
    STRACE
    
slide 57:
    strace
    • Before:
    $ dd if=/dev/zero of=/dev/null bs=1 count=500k
    […]
    512000 bytes (512 kB) copied, 0.103851 s, 4.9 MB/s
    • After:
    $ strace -eaccept dd if=/dev/zero of=/dev/null bs=1 count=500k
    […]
    512000 bytes (512 kB) copied, 45.9599 s, 11.1 kB/s
    • 442x slower. This is worst case.
    • strace(1) pauses the process twice for each syscall.
    This is like putting metering lights on your app.
    – "BUGS: A traced process runs slowly." – strace(1) man page
    – Use buffered tracing / in-kernel counters instead, e.g. DTrace
    
slide 58:
    DTRACE
    
slide 59:
    DTrace
    • Overhead often negligible, but not always
    • Before:
    # time wc systemlog
    262600 2995200 23925200 systemlog
    real
    0m1.098s
    user
    0m1.085s
    sys
    0m0.012s
    • After:
    # time dtrace -n 'pid$target:::entry { @[probefunc] = count(); }' -c 'wc systemlog'
    dtrace: description 'pid$target:::entry ' matched 3756 probes
    262600 2995200 23925200 systemlog
    […]
    real
    7m2.896s
    user
    7m2.650s
    sys
    0m0.572s
    • 384x slower. Fairly worst case: frequent pid probes.
    
slide 60:
    Tracing Dangers
    • Overhead potential exists for all tracers
    – Overhead = event instrumentation cost X frequency of event
    • Costs
    – Lower: counters, in-kernel aggregations
    – Higher: event dumps, stack traces, string copies, copyin/outs
    • Frequencies
    – Lower: process creation & destruction, disk I/O (usually), …
    – Higher: instructions, functions in I/O hot path, malloc/free,
    Java methods, …
    • Advice
    – gt; 100,000 events/sec, overhead may start to be measurable
    
slide 61:
    DTraceToolkit
    • My own tools that can cause massive overhead:
    – dapptrace/dappprof: can trace all native functions
    – Java/j_flow.d, ...: can trace all Java methods with
    +ExtendedDTraceProbes
    # j_flow.d
    PID TIME(us)
    0 311403 4789112583163
    0 311403 4789112583207
    0 311403 4789112583323
    0 311403 4789112583333
    0 311403 4789112583343
    0 311403 4789112583732
    0 311403 4789112583743
    0 311403 4789112583752
    [...]
    -- CLASS.METHOD
    ->gt; java/lang/Object.gt;
    ->gt; java/lang/Object.registerNatives
    gt;
    ->gt; java/lang/String.gt;
    ->gt; java/lang/String$CaseInsensitiveComparator.gt;
    ->gt; java/lang/String$CaseInsensitiveComparator.gt;
    ->gt; java/lang/Object.gt;
    • Useful for debugging, but should warn about overheads
    
slide 62:
    VALGRIND
    
slide 63:
    Valgrind
    • A suite of tools including an extensive leak detector
    "Your	
      program	
      will	
      run	
      much	
      slower	
      
    (eg.	
      20	
      to	
      30	
      Imes)	
      than	
      normal"	
      
    	
      
    –	
      h/p://valgrind.org/docs/manual/quick-­‐start.html	
      
    • To its credit it does warn the end user
    
slide 64:
    JAVA PROFILERS
    
slide 65:
    Java Profilers
    • Some Java profilers have two modes:
    – Sampling stacks: eg, at 100 Hertz
    – Tracing methods: instrumenting and timing every method
    • Method timing has been described as "highly accurate",
    despite slowing the target by up to 1000x!
    • Issues & advice already covered at QCon:
    – Nitsan Wakart "Profilers are Lying Hobbitses" earlier today
    – Java track tomorrow
    
slide 66:
    Observability:
    Monitoring
    
slide 67:
    MONITORING
    
slide 68:
    Monitoring
    • By now you should recognize these pathologies:
    Let's just graph the system metrics!
    • That's not the problem that needs solving
    Let's just trace everything and post process!
    • Now you have one million problems per second
    • Monitoring adds additional problems:
    – Let's have a cloud-wide dashboard update per-second!
    • From every instance? Packet overheads?
    – Now we have billions of metrics!
    
slide 69:
    Observability:
    Statistics
    
slide 70:
    STATISTICS
    
slide 71:
    Statistics
    "Then	
      there	
      is	
      the	
      man	
      who	
      drowned	
      crossing	
      a	
      
    stream	
      with	
      an	
      average	
      depth	
      of	
      six	
      inches."	
      	
      	
      
    	
      
    –	
      W.I.E.	
      Gates	
      
    
slide 72:
    Statistics
    • Averages can be misleading
    – Hide latency outliers
    – Per-minute averages can hide multi-second issues
    • Percentiles can be misleading
    – Probability of hitting 99.9th latency may be more than 1/1000
    after many dependency requests
    • Show the distribution:
    – Summarize: histogram, density plot, frequency trail
    – Over-time: scatter plot, heat map
    • See Gil Tene's "How Not to Measure Latency" QCon talk
    from earlier today
    
slide 73:
    Average Latency
    • When the index of central tendency isn't…
    
slide 74:
    Observability:
    Visualizations
    
slide 75:
    VISUALIZATIONS
    
slide 76:
    Tachometers
    …especially with arbitrary color highlighting
    
slide 77:
    Pie Charts
    usr	
      
    sys	
      
    wait	
      
    idle	
      
    …for real-time metrics
    
slide 78:
    Doughnuts
    usr	
      
    sys	
      
    wait	
      
    idle	
      
    …like pie charts but worse
    
slide 79:
    Traffic Lights
    RED == BAD (usually)
    GREEN == GOOD (hopefully)
    …when used for subjective metrics
    These can be used for objective metrics
    
slide 80:
    Benchmarking
    
slide 81:
    BENCHMARKING
    
slide 82:
    ~100% of benchmarks are wrong
    
slide 83:
    "Most popular benchmarks are flawed"
    Source: Traeger, A., E. Zadok, N. Joukov, and C. Wright.
    “A Nine Year Study of File System and Storage
    Benchmarking,” ACM Transactions on Storage, 2008.
    Not only can a popular benchmark be broken, but so can all
    alternates.
    
slide 84:
    REFUTING BENCHMARKS
    
slide 85:
    The energy needed
    to refute benchmarks
    is multiple orders of magnitude
    bigger than to run them
    It can take 1-2 weeks of senior performance engineering
    time to debug a single benchmark.
    
slide 86:
    Benchmarking
    • Benchmarking is a useful form of experimental analysis
    – Try observational first; benchmarks can perturb
    • Accurate and realistic benchmarking is vital for technical
    investments that improve our industry
    • However, benchmarking is error prone
    
slide 87:
    COMMON MISTAKES
    
slide 88:
    Common Mistakes
    1. Testing the wrong target
    – eg, FS cache instead of disk; misconfiguration
    2. Choosing the wrong target
    – eg, disk instead of FS cache … doesn’t resemble real world
    3. Invalid results
    – benchmark software bugs
    4. Ignoring errors
    – error path may be fast!
    5. Ignoring variance or perturbations
    – real workload isn't steady/consistent, which matters
    6. Misleading results
    – you benchmark A, but actually measure B, and conclude you
    measured C
    
slide 89:
    PRODUCT EVALUATIONS
    
slide 90:
    Product Evaluations
    • Benchmarking is used for product evaluations & sales
    • The Benchmark Paradox:
    – If your product’s chances of winning a benchmark are
    50/50, you’ll usually lose
    – To justify a product switch, a customer may run several
    benchmarks, and expect you to win them all
    – May mean winning a coin toss at least 3 times in a row
    – http://www.brendangregg.com/blog/2014-05-03/the-benchmark-paradox.html
    • Solving this seeming paradox (and benchmarking):
    – Confirm benchmark is relevant to intended workload
    – Ask: why isn't it 10x?
    
slide 91:
    Active Benchmarking
    • Root cause performance analysis while the
    benchmark is still running
    – Use observability tools
    – Identify the limiter (or suspected limiter) and include it with the
    benchmark results
    – Answer: why not 10x?
    • This takes time, but uncovers most mistakes
    
slide 92:
    MICRO BENCHMARKS
    
slide 93:
    Micro Benchmarks
    • Test a specific function in isolation. e.g.:
    – File system maximum cached read operations/sec
    – Network maximum throughput
    • Examples of bad microbenchmarks:
    – gitpid() in a tight loop
    – speed of /dev/zero and /dev/null
    • Common problems:
    – Testing a workload that is not very relevant
    – Missing other workloads that are relevant
    
slide 94:
    MACRO BENCHMARKS
    
slide 95:
    Macro Benchmarks
    • Simulate application user load. e.g.:
    – Simulated web client transaction
    • Common problems:
    – Misplaced trust: believed to be realistic, but misses variance,
    errors, perturbations, e.t.c.
    – Complex to debug, verify, and root cause
    
slide 96:
    KITCHEN SINK BENCHMARKS
    
slide 97:
    Kitchen Sink Benchmarks
    • Run everything!
    – Mostly random benchmarks found on the Internet, where
    most are are broken or irrelevant
    – Developers focus on collecting more benchmarks than
    verifying or fixing the existing ones
    • Myth that more benchmarks == greater accuracy
    – No, use active benchmarking (analysis)
    
slide 98:
    AUTOMATION
    
slide 99:
    Automated Benchmarks
    • Completely automated procedure. e.g.:
    – Cloud benchmarks: spin up an instance, benchmark, destroy.
    Automate.
    • Little or no provision for debugging
    • Automation is only part of the solution
    
slide 100:
    Benchmarking:
    More Examples
    
slide 101:
    BONNIE++
    
slide 102:
    bonnie++
    • "simple tests of hard drive and file system performance"
    • First metric printed by (thankfully) older versions:
    per character sequential output
    • What was actually tested:
    – 1 byte writes to libc (via putc())
    – 4 Kbyte writes from libc ->gt; FS (depends on OS; see setbuffer())
    – 128 Kbyte async writes to disk (depends on storage stack)
    – Any file system throttles that may be present (eg, ZFS)
    – C++ code, to some extent (bonnie++ 10% slower than Bonnie)
    • Actual limiter:
    – Single threaded write_block_putc() and putc() calls
    
slide 103:
    APACHE BENCH
    
slide 104:
    Apache Bench
    • HTTP web server benchmark
    • Single thread limited (use wrk for multi-threaded)
    • Keep-alive option (-k):
    – without: Can become an unrealistic TCP session benchmark
    – with: Can become an unrealistic server throughput test
    • Performance issues of ab's own code
    
slide 105:
    UNIXBENCH
    
slide 106:
    UnixBench
    • The original kitchen-sink micro benchmark from 1984,
    published in BYTE magazine
    • Innovative & useful for the time, but that time has passed
    • More problems than I can shake a stick at
    • Starting with…
    
slide 107:
    COMPILERS
    
slide 108:
    UnixBench Makefile
    • Default (by ./Run) for Linux. Would you edit it? Then what?
    ## Very generic
    #OPTON = -O
    ## For Linux 486/Pentium, GCC 2.7.x and 2.8.x
    #OPTON = -O2 -fomit-frame-pointer -fforce-addr -fforce-mem -ffast-math \
    # -m486 -malign-loops=2 -malign-jumps=2 -malign-functions=2
    ## For Linux, GCC previous to 2.7.0
    #OPTON = -O2 -fomit-frame-pointer -fforce-addr -fforce-mem -ffast-math -m486
    #OPTON = -O2 -fomit-frame-pointer -fforce-addr -fforce-mem -ffast-math \
    # -m386 -malign-loops=1 -malign-jumps=1 -malign-functions=1
    ## For Solaris 2, or general-purpose GCC 2.7.x
    OPTON = -O2 -fomit-frame-pointer -fforce-addr -ffast-math -Wall
    ## For Digital Unix v4.x, with DEC cc v5.x
    #OPTON = -O4
    #CFLAGS = -DTIME -std1 -verbose -w0
    
slide 109:
    UnixBench Makefile
    • "Fixing" the Makefile improved the first result, Dhrystone
    2, by 64%
    • Is everyone "fixing" it the same way, or not? Are they
    using the same compiler version? Same OS? (No.)
    
slide 110:
    UnixBench Documentation
    "The results will depend not only on your
    hardware, but on your operating system,
    libraries, and even compiler."
    "So you may want to make sure that all your
    test systems are running the same version of
    the OS; or at least publish the OS and
    compuiler versions with your results."
    
slide 111:
    SYSTEM MICROBENCHMARKS
    
slide 112:
    UnixBench Tests
    • Results summarized as "The BYTE Index". From USAGE:
    system:
    dhry2reg
    Dhrystone 2 using register variables
    whetstone-double Double-Precision Whetstone
    syscall
    System Call Overhead
    pipe
    Pipe Throughput
    context1
    Pipe-based Context Switching
    spawn
    Process Creation
    execl
    Execl Throughput
    fstime-w
    File Write 1024 bufsize 2000 maxblocks
    fstime-r
    File Read 1024 bufsize 2000 maxblocks
    fstime
    File Copy 1024 bufsize 2000 maxblocks
    fsbuffer-w
    File Write 256 bufsize 500 maxblocks
    fsbuffer-r
    File Read 256 bufsize 500 maxblocks
    fsbuffer
    File Copy 256 bufsize 500 maxblocks
    fsdisk-w
    File Write 4096 bufsize 8000 maxblocks
    fsdisk-r
    File Read 4096 bufsize 8000 maxblocks
    fsdisk
    File Copy 4096 bufsize 8000 maxblocks
    shell1
    Shell Scripts (1 concurrent) (runs "looper 60 multi.sh 1")
    shell8
    Shell Scripts (8 concurrent) (runs "looper 60 multi.sh 8")
    shell16
    Shell Scripts (8 concurrent) (runs "looper 60 multi.sh 16")
    • What can go wrong? Everything.
    
slide 113:
    Anti-Patterns
    
slide 114:
    ANTI-PATTERNS
    
slide 115:
    Street Light Anti-Method
    1. Pick observability tools that are:
    – Familiar
    – Found on the Internet
    – Found at random
    2. Run tools
    3. Look for obvious issues
    
slide 116:
    Blame Someone Else Anti-Method
    1. Find a system or environment component you are not
    responsible for
    2. Hypothesize that the issue is with that component
    3. Redirect the issue to the responsible team
    4. When proven wrong, go to 1
    
slide 117:
    Performance Tools Team
    • Having a separate performance tools team, who creates
    tools but doesn't use them (no production exposure)
    • At Netflix:
    – The performance engineering team builds tools and uses
    tools for both service consulting and live production triage
    • Mogul, Vector, …
    – Other teams (CORE, traffic, …) also build performance tools
    and use them during issues
    • Good performance tools are built out of necessity
    
slide 118:
    Messy House Fallacy
    • Fallacy: my code is a mess, I bet yours is immaculate,
    therefore the bug must be mine
    • Reality: everyone's code is terrible and buggy
    • When analyzing performance, don't overlook the system:
    kernel, libraries, etc.
    
slide 119:
    Lessons
    
slide 120:
    PERFORMANCE TOOLS
    
slide 121:
    Observability
    • Trust nothing, verify everything
    – Cross-check with other observability tools
    – Write small "known" workloads, and confirm metrics match
    – Find other sanity tests: e.g. check known system limits
    – Determine how metrics are calculated, averaged, updated
    • Find metrics to solve problems
    – Instead of understanding hundreds of system metrics
    – What problems do you want to observe? What metrics would
    be sufficient? Find, verify, and use those. e.g., USE Method.
    – The metric you want may not yet exist
    • File bugs, get these fixed/improved
    
slide 122:
    Observe Everything
    • Use functional diagrams to pose Q's and find missing metrics:
    
slide 123:
    Profile Everything
    Java Mixed-Mode Flame Graph
    Kernel
    Java
    JVM
    
slide 124:
    Visualize Everything
    
slide 125:
    Benchmark Nothing
    • Trust nothing, verify everything
    • Do Active Benchmarking:
    1. Configure the benchmark to run in steady state, 24x7
    2. Do root-cause analysis of benchmark performance
    3. Answer: why is it not 10x?
    
slide 126:
    Links & References
    https://www.rfc-editor.org/rfc/rfc546.pdf
    https://upload.wikimedia.org/wikipedia/commons/6/64/Intel_Nehalem_arch.svg
    http://www.linuxatemyram.com/
    Traeger, A., E. Zadok, N. Joukov, and C. Wright. “A Nine Year Study of File System
    and Storage Benchmarking,” ACM Trans- actions on Storage, 2008.
    http://www.brendangregg.com/blog/2014-06-09/java-cpu-sampling-using-hprof.html
    http://www.brendangregg.com/blog/2014-05-03/the-benchmark-paradox.html
    http://www.brendangregg.com/ActiveBenchmarking/bonnie++.html
    https://blogs.oracle.com/roch/entry/decoding_bonnie
    http://www.brendangregg.com/blog/2014-05-02/compilers-love-messing-withbenchmarks.html
    https://code.google.com/p/byte-unixbench/
    https://qconsf.com/sf2015/presentation/how-not-measure-latency
    https://qconsf.com/sf2015/presentation/profilers-lying
    Caution signs drawn be me, inspired by real-world signs
    
slide 127:
    THANKS
    
slide 128:
    Nov	
      2015	
      
    Thanks
    Questions?
    http://techblog.netflix.com
    http://slideshare.net/brendangregg
    http://www.brendangregg.com
    bgregg@netflix.com
    @brendangregg