Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

SCaLE14x (2016): Broken Linux Performance Tools

Video: https://www.youtube.com/watch?v=OPio8V-z03c

Talk by Brendan Gregg for the Southern California Linux Expo 14x (2016).

Description: "Broken benchmarks, misleading metrics, and terrible tools. This talk will help you navigate the treacherous waters of Linux performance tools, touring common problems with system tools, metrics, statistics, visualizations, measurement overhead, and benchmarks. You might discover that tools you have been using for years, are in fact, misleading, dangerous, or broken.

The speaker, Brendan Gregg, has given many talks on tools that work, including giving the Linux PerformanceTools talk originally at SCALE. This is an anti-version of that talk, to focus on broken tools and metrics instead of the working ones. Metrics can be misleading, and counters can be counter-intuitive! This talk will include advice for verifying new performance tools, understanding how they work, and using them successfully."

next
prev
1/95
next
prev
2/95
next
prev
3/95
next
prev
4/95
next
prev
5/95
next
prev
6/95
next
prev
7/95
next
prev
8/95
next
prev
9/95
next
prev
10/95
next
prev
11/95
next
prev
12/95
next
prev
13/95
next
prev
14/95
next
prev
15/95
next
prev
16/95
next
prev
17/95
next
prev
18/95
next
prev
19/95
next
prev
20/95
next
prev
21/95
next
prev
22/95
next
prev
23/95
next
prev
24/95
next
prev
25/95
next
prev
26/95
next
prev
27/95
next
prev
28/95
next
prev
29/95
next
prev
30/95
next
prev
31/95
next
prev
32/95
next
prev
33/95
next
prev
34/95
next
prev
35/95
next
prev
36/95
next
prev
37/95
next
prev
38/95
next
prev
39/95
next
prev
40/95
next
prev
41/95
next
prev
42/95
next
prev
43/95
next
prev
44/95
next
prev
45/95
next
prev
46/95
next
prev
47/95
next
prev
48/95
next
prev
49/95
next
prev
50/95
next
prev
51/95
next
prev
52/95
next
prev
53/95
next
prev
54/95
next
prev
55/95
next
prev
56/95
next
prev
57/95
next
prev
58/95
next
prev
59/95
next
prev
60/95
next
prev
61/95
next
prev
62/95
next
prev
63/95
next
prev
64/95
next
prev
65/95
next
prev
66/95
next
prev
67/95
next
prev
68/95
next
prev
69/95
next
prev
70/95
next
prev
71/95
next
prev
72/95
next
prev
73/95
next
prev
74/95
next
prev
75/95
next
prev
76/95
next
prev
77/95
next
prev
78/95
next
prev
79/95
next
prev
80/95
next
prev
81/95
next
prev
82/95
next
prev
83/95
next
prev
84/95
next
prev
85/95
next
prev
86/95
next
prev
87/95
next
prev
88/95
next
prev
89/95
next
prev
90/95
next
prev
91/95
next
prev
92/95
next
prev
93/95
next
prev
94/95
next
prev
95/95

PDF: SCALE2016_Broken_Linux_Performance_Tools.pdf

Keywords (from pdftotext):

slide 1:
    Jan	
      2016	
      
    Broken Linux
    Performance
    Tools
    Brendan Gregg
    Senior Performance Architect, Netflix
    
slide 2:
    Previously (SCaLE11x)
    Working Linux performance tools:
    
slide 3:
    This Talk (SCaLE14x)
    Broken Linux performance tools:
    Observability
    Benchmarking
    Objectives:
    – Bust assumptions about tools and metrics
    – Learn how to verify and find missing metrics
    – Avoid the common mistakes when benchmarking
    Note: Current software is discussed, which could be fixed in the future (by you!)
    
slide 4:
slide 5:
    OBSERVABILITY
    Load Averages
    top %CPU
    iowait
    vmstat
    Overhead
    strace
    Java Profilers
    Monitoring
    
slide 6:
    LOAD AVERAGES
    
slide 7:
    Load Averages (1, 5, 15 min)
    $ uptime
    22:08:07 up
    9:05,
    1 user,
    load average: 11.42, 11.87, 12.12
    • "load"
    – Usually CPU demand (run queue length/latency)
    – On Linux: CPU + uninterruptible I/O (e.g., disk)
    • "average"
    – Exponentially damped moving sum
    • "1, 5, and 15 minutes"
    – Constants used in the equation
    • Don't study these for longer than 10 seconds
    
slide 8:
    t=0
    Load begins
    (1 thread)
    @ 1 min:
    1 min avg =~ 0.62
    
slide 9:
    TOP %CPU
    
slide 10:
    top %CPU
    $ top - 20:15:55 up 19:12, 1 user, load average: 7.96, 8.59, 7.05
    Tasks: 470 total,
    1 running, 468 sleeping,
    0 stopped,
    1 zombie
    %Cpu(s): 28.1 us, 0.4 sy, 0.0 ni, 71.2 id, 0.0 wa, 0.0 hi, 0.1 si, 0.1 st
    KiB Mem: 61663100 total, 61342588 used,
    320512 free,
    9544 buffers
    KiB Swap:
    0 total,
    0 used,
    0 free. 3324696 cached Mem
    PID USER
    11959 apiprod
    12595 snmp
    10447 snmp
    18463 apiprod
    […]
    VIRT
    RES
    SHR S %CPU %MEM
    TIME+ COMMAND
    0 81.731g 0.053t 14476 S 935.8 92.1 13568:22 java
    3256 1392 S
    3.6 0.0 2:37.23 snmp-pass
    6028 1432 S
    2.0 0.0 2:12.12 snmpd
    1972 1176 R
    0.7 0.0 0:00.07 top
    • Who is consuming CPU?
    • And by how much?
    
slide 11:
    top: Missing %CPU
    • Short-lived processes can be missing entirely
    – Process creates and exits in-between sampling /proc.
    e.g., software builds.
    – Try atop(1), or sampling using perf(1)
    • Short-lived processes may vanish on screen updates
    – I often use pidstat(1) on Linux instead, for concise scroll back
    
slide 12:
    top: Misinterpreting %CPU
    • Different top(1)s use different calculations
    - On different OSes, check the man page, and run a test!
    • %CPU can mean:
    – A) Sum of per-CPU percents (0-Ncpu x 100%) consumed
    during the last interval
    – B) Percentage of total CPU capacity (0-100%) consumed
    during the last interval
    – C) (A) but historically damped (like load averages)
    – D) (B) " " "
    
slide 13:
    top: %Cpu vs %CPU
    $ top - 15:52:58 up 10 days, 21:58, 2 users, load average: 0.27, 0.53, 0.41
    Tasks: 180 total,
    1 running, 179 sleeping,
    0 stopped,
    0 zombie
    %Cpu(s): 1.2 us, 24.5 sy, 0.0 ni, 67.2 id, 0.2 wa, 0.0 hi, 6.6 si, 0.4 st
    KiB Mem:
    2872448 total, 2778160 used,
    94288 free,
    31424 buffers
    KiB Swap: 4151292 total,
    76 used, 4151216 free. 2411728 cached Mem
    PID USER
    12678 root
    12675 root
    215 root
    […]
    VIRT
    RES
    SHR S %CPU %MEM
    912 S 100.4 0.0
    904 S 88.8 0.0
    0 S
    0.3 0.0
    TIME+ COMMAND
    0:23.52 iperf
    0:20.83 iperf
    0:27.73 jbd2/sda1-8
    • This 4 CPU system is consuming:
    – 130% total CPU, via %Cpu(s)
    – 190% total CPU, via %CPU
    • Which one is right? Is either?
    – "A man with one watch knows the time; with two he's never sure"
    
slide 14:
    CPU Summary Statistics
    • %Cpu row is from /proc/stat
    • linux/Documentation/cpu-load.txt:
    In most cases the `/proc/stat' information reflects
    the reality quite closely, however due to the nature
    of how/when the kernel collects this data
    sometimes it can not be trusted at all.
    • /proc/stat is used by everything for CPU stats
    
slide 15:
    %CPU
    
slide 16:
    What is %CPU anyway?
    • "Good" %CPU:
    – Retiring instructions (provided they aren't a spin loop)
    – High IPC (Instructions-Per-Cycle)
    • "Bad" %CPU:
    – Stall cycles waiting on resources, usually memory I/O
    – Low IPC
    – Buying faster processors may make little difference
    • %CPU alone is ambiguous
    – Would love top(1) to split %CPU into cycles retiring vs stalled
    – Although, it gets worse…
    
slide 17:
    CPU Speed Variation
    • Clock speed can vary thanks to:
    – Intel Turbo Boost: by hardware, based on power, temp, etc
    – Intel Speed Step: by software, controlled by the kernel
    • %CPU is still ambiguous, given IPC
    80%	
      CPU	
      
    (1.6	
      IPC)	
      
    may	
      not	
      
    ==	
      
    4	
      x	
      20%	
      CPU	
      
    (1.6	
      IPC)	
      
    • Need to know the clock speed as well
    – 80% CPU (@3000MHz) != 4 x 20% CPU (@1600MHz)
    • CPU counters nowadays have "reference cycles"
    
slide 18:
    Out-of-order Execution
    • CPUs execute uops out-oforder and in parallel across
    multiple functional units
    • %CPU doesn't account for
    how many units are active
    • Accounting each cycles as
    "stalled" or “retiring" is a
    simplification
    h:ps://upload.wikimedia.org/wikipedia/commons/6/64/Intel_Nehalem_arch.svg	
      
    
slide 19:
    I/O WAIT
    
slide 20:
    I/O Wait
    $ mpstat -P ALL 1
    08:06:43 PM CPU
    %usr
    08:06:44 PM all 53.45
    […]
    %nice
    %sys %iowait
    %irq
    %soft %steal %guest
    %idle
    • Suggests system is disk I/O bound, but often misleading
    • Comparing I/O wait between system A and B:
    - higher might be bad: slower disks, more blocking
    - lower might be bad: slower processor and architecture
    consumes more CPU, obscuring I/O wait
    • Can be very useful when understood: another idle state
    
slide 21:
    I/O Wait Venn Diagram
    Per CPU:	
      
    CPU	
      
    "CPU"	
      
    Waiting for disk I/O	
      
    "CPU"	
       "I/O Wait"	
      
    "Idle"	
      
    
slide 22:
    FREE MEMORY
    
slide 23:
    Free Memory
    $ free -m
    total
    Mem:
    -/+ buffers/cache:
    Swap:
    used
    free
    0	
      
    shared
    buffers
    cached
    • "free" is near-zero: I'm running
    out of memory!
    - No, it's in the file system cache,
    and is still free for apps to use
    • Linux free(1) explains it, but
    other tools, e.g. vmstat(1), don't
    • Some file systems (e.g., ZFS)
    may not be shown in the
    system's cached metrics at all
    www.linuxatemyram.com	
      
    
slide 24:
    VMSTAT
    
slide 25:
    vmstat(1)
    $ vmstat –Sm 1
    procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---r b
    swpd
    free
    buff cache
    cs us sy id wa
    8 0
    12 25 34 0 0
    7 0
    0 205 186 46 13 0 0
    8 0
    8 210 435 39 21 0 0
    8 0
    0 218 219 42 17 0 0
    […]	
      
    • Linux: first line has some summary since boot values —
    confusing!
    • This system-wide summary is missing networking
    
slide 26:
    NETSTAT -S
    
slide 27:
    netstat -s
    $ netstat -s
    Ip:
    7962754 total packets received
    8 with invalid addresses
    0 forwarded
    0 incoming packets discarded
    7962746 incoming packets delivered
    8019427 requests sent out
    Icmp:
    382 ICMP messages received
    0 input ICMP message failed.
    ICMP input histogram:
    destination unreachable: 125
    timeout in transit: 257
    3410 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
    destination unreachable: 3410
    IcmpMsg:
    InType3: 125
    InType11: 257
    OutType3: 3410
    Tcp:
    17337 active connections openings
    395515 passive connection openings
    8953 failed connection attempts
    240214 connection resets received
    3 connections established
    7198375 segments received
    7504939 segments send out
    62696 segments retransmited
    10 bad segments received.
    1072 resets sent
    InCsumErrors: 5
    Udp:
    759925 packets received
    3412 packets to unknown port received.
    0 packet receive errors
    784370 packets sent
    UdpLite:
    TcpExt:
    858 invalid SYN cookies received
    8951 resets received for embryonic SYN_RECV sockets
    14 packets pruned from receive queue because of socket buffer overrun
    6177 TCP sockets finished time wait in fast timer
    293 packets rejects in established connections because of timestamp
    733028 delayed acks sent
    89 delayed acks further delayed because of locked socket
    Quick ack mode was activated 13214 times
    336520 packets directly queued to recvmsg prequeue.
    43964 packets directly received from backlog
    11406012 packets directly received from prequeue
    1039165 packets header predicted
    7066 packets header predicted and directly queued to user
    1428960 acknowledgments not containing data received
    1004791 predicted acknowledgments
    1 times recovered from packet loss due to fast retransmit
    5044 times recovered from packet loss due to SACK data
    2 bad SACKs received
    Detected reordering 4 times using SACK
    Detected reordering 11 times using time stamp
    13 congestion windows fully recovered
    11 congestion windows partially recovered using Hoe heuristic
    TCPDSACKUndo: 39
    2384 congestion windows recovered after partial ack
    228 timeouts after SACK recovery
    100 timeouts in loss state
    5018 fast retransmits
    39 forward retransmits
    783 retransmits in slow start
    32455 other TCP timeouts
    TCPLossProbes: 30233
    TCPLossProbeRecovery: 19070
    992 sack retransmits failed
    18 times receiver scheduled too late for direct processing
    705 packets collapsed in receive queue due to low socket buffer
    13658 DSACKs sent for old packets
    8 DSACKs sent for out of order packets
    13595 DSACKs received
    33 DSACKs for out of order packets received
    32 connections reset due to unexpected data
    108 connections reset due to early user close
    1608 connections aborted due to timeout
    TCPSACKDiscard: 4
    TCPDSACKIgnoredOld: 1
    TCPDSACKIgnoredNoUndo: 8649
    TCPSpuriousRTOs: 445
    TCPSackShiftFallback: 8588
    TCPRcvCoalesce: 95854
    TCPOFOQueue: 24741
    TCPOFOMerge: 8
    TCPChallengeACK: 1441
    TCPSYNChallenge: 5
    TCPSpuriousRtxHostQueues: 1
    TCPAutoCorking: 4823
    IpExt:
    InOctets: 1561561375
    OutOctets: 1509416943
    InNoECTPkts: 8201572
    InECT1Pkts: 2
    InECT0Pkts: 3844
    InCEPkts: 306
    
slide 28:
    netstat -s
    • Many metrics on Linux (can be over 200)
    • Still doesn't include everything: getting better, but don't
    assume everything is there
    • Includes typos & inconsistencies
    • Might be more readable to:
    cat /proc/net/snmp /proc/net/netstat
    • Totals since boot can be misleading
    • On Linux, -s needs -c support
    • Often no documentation outside kernel source code
    • Requires expertise to comprehend
    
slide 29:
    DISK METRICS
    
slide 30:
    Disk Metrics
    • All disk metrics are misleading
    • Disk %utilization / %busy
    – Logical devices (volume managers) and individual disks can
    process I/O in parallel, and may accept more I/O at 100%
    • Disk IOPS
    – High IOPS is "bad"? That depends…
    • Disk latency
    – Does it matter? File systems and volume managers try hard
    to hide latency and make it asynchronous
    – Better measuring latency via application->gt;FS calls
    
slide 31:
    FS CACHE METRICS
    
slide 32:
    FS Cache Metrics
    • Size metrics exist: free -m
    • Activity metrics are missing: e.g., hit/miss ratio
    • Hacking stats using ftrace (/eBPF):
    # ./cachestat 1
    Counting cache functions... Output every 1 seconds.
    HITS
    MISSES DIRTIES
    RATIO
    BUFFERS_MB
    CACHE_MB
    19.5%
    23.9%
    25.2%
    21.1%
    34.3%
    [...]
    
slide 33:
    What You Can Do
    • Verify and understand existing metrics
    – Even %CPU can be misleading
    – Cross check with another tool & backend
    – Test with known workloads
    – Read the source, including comments
    – Use "known to be good" metrics to sanity test others
    • Find missing metrics
    – Follow the USE Method, and other methodologies
    – Draw a functional diagram
    • Burn it all down and start again from scratch?
    
slide 34:
    PROFILERS
    
slide 35:
    Linux perf
    • Can sample stack traces and summarize output:
    # perf report -n -stdio
    […]
    # Overhead
    Samples Command
    Shared Object
    Symbol
    # ........ ............ ....... ................. .............................
    20.42%
    bash [kernel.kallsyms] [k] xen_hypercall_xen_version
    --- xen_hypercall_xen_version
    check_events
    |--44.13%-- syscall_trace_enter
    tracesys
    |--35.58%-- __GI___libc_fcntl
    |--65.26%-- do_redirection_internal
    do_redirections
    execute_builtin_or_function
    execute_simple_command
    [… ~13,000 lines truncated …]
    
slide 36:
    Too Much Output
    
slide 37:
    … as a Flame Graph
    
slide 38:
    PROFILER VISIBILITY
    
slide 39:
    Java Profilers
    CPU Flame Graph
    Java (+object stats)
    Kernel,
    libraries,
    JVM
    
slide 40:
    Java Profilers
    • Typical problems:
    – Sampling at safepoints (skew)
    – Method tracing observer effect
    – RUNNING != on-CPU (e.g., epoll)
    – Missing GC or JVM CPU time
    • Inaccurate (skewed) and incomplete profiles
    • Let's try a system profiler?
    
slide 41:
    System Profilers with Java (x86)
    Java
    (missing
    stacks &
    symbols)
    Kernel
    TCP/IP
    JVM
    compiler
    optimization
    #fail
    Locks
    Time
    epoll
    Idle
    thread
    
slide 42:
    COMPILER OPTIMIZATIONS
    
slide 43:
    Broken System Stack Traces
    • Broken stacks (1
    or 2 levels deep,
    junk values):
    # perf record –F 99 –a –g – sleep 30; perf script
    […]
    java 4579 cpu-clock:
    ffffffff8172adff tracesys ([kernel.kallsyms])
    7f4183bad7ce pthread_cond_timedwait@@GLIBC_2…
    java
    4579 cpu-clock:
    • On x86 (x86_64),
    7f417908c10b [unknown] (/tmp/perf-4458.map)
    hotspot reuses
    java 4579 cpu-clock:
    7f4179101c97 [unknown] (/tmp/perf-4458.map)
    the frame pointer
    register (RBP) as general purpose (a "compiler
    optimization"), which once upon a time made sense
    • gcc has -fno-omit-frame-pointer to avoid this
    – JDK8u60+ now has this as -XX:+PreserveFramePoiner
    
slide 44:
    Missing Symbols
    • Missing symbols may show up as hex; e.g., Linux perf:
    # perf script
    Failed to open /tmp/perf-8131.map, continuing without symbols
    […]
    java 8131 cpu-clock:
    7fff76f2dce1 [unknown] ([vdso])
    7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…
    7fd301861e46 [unknown] (/tmp/perf-8131.map)
    […]
    For applications, install debug symbol package
    For JIT'd code, Linux perf already looks for an
    externally provided symbol file: /tmp/perf-PID.map
    – Find a way to do this for your runtime
    
slide 45:
    INSTRUCTION PROFILING
    
slide 46:
    Instruction Profiling
    # perf annotate -i perf.data.noplooper --stdio
    Percent | Source code & Disassembly of noplooper
    ---------------------------------------------------: Disassembly of section .text:
    : 00000000004004ed gt;:
    0.00 :
    4004ed:
    push %rbp
    0.00 :
    4004ee:
    mov
    %rsp,%rbp
    20.86 :
    4004f1:
    nop
    0.00 :
    4004f2:
    nop
    0.00 :
    4004f3:
    nop
    0.00 :
    4004f4:
    nop
    19.84 :
    4004f5:
    nop
    0.00 :
    4004f6:
    nop
    0.00 :
    4004f7:
    nop
    0.00 :
    4004f8:
    nop
    18.73 :
    4004f9:
    nop
    0.00 :
    4004fa:
    nop
    0.00 :
    4004fb:
    nop
    0.00 :
    4004fc:
    nop
    19.08 :
    4004fd:
    nop
    0.00 :
    4004fe:
    nop
    0.00 :
    4004ff:
    nop
    0.00 :
    400500:
    nop
    21.49 :
    400501:
    jmp
    4004f1 gt;
    • Often broken nowadays due to
    skid, out-of-order execution, and
    sampling the resumption instruction
    • Better with PEBS support
    
slide 47:
    What You Can Do
    • Do stack trace profiling
    – Get stack traces to work
    – Get symbols to work
    – This all may be a lot of work. It's worth it!
    • Make CPU flame graphs!
    
slide 48:
    OVERHEAD
    
slide 49:
    tcpdump
    $ tcpdump -i eth0 -w /tmp/out.tcpdump
    tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
    ^C7985 packets captured
    8996 packets received by filter
    1010 packets dropped by kernel
    • Packet tracing doesn't scale. Overheads:
    – CPU cost of per-packet tracing (improved by [e]BPF)
    • Consider CPU budget per-packet at 10/40/100 GbE
    – Transfer to user-level (improved by ring buffers)
    – File system storage (more CPU, and disk I/O)
    – Possible additional network transfer
    • Can also drop packets when overloaded
    • You should only trace send/receive as a last resort
    – I solve problems by tracing lower frequency TCP events
    
slide 50:
    STRACE
    
slide 51:
    strace
    • Before:
    $ dd if=/dev/zero of=/dev/null bs=1 count=500k
    […]
    512000 bytes (512 kB) copied, 0.103851 s, 4.9 MB/s
    • After:
    $ strace -eaccept dd if=/dev/zero of=/dev/null bs=1 count=500k
    […]
    512000 bytes (512 kB) copied, 45.9599 s, 11.1 kB/s
    • 442x slower. This is worst case.
    • strace(1) pauses the process twice for each syscall.
    This is like putting metering lights on your app.
    – "BUGS: A traced process runs slowly." – strace(1) man page
    
slide 52:
    PERF_EVENTS
    
slide 53:
    perf_events
    • Buffered tracing helps, but you can still trace too much:
    # perf record -e sched:sched_switch -a -g -- sleep 1
    [ perf record: Woken up 3 times to write data ]
    [ perf record: Captured and wrote 100.212 MB perf.data (486550 samples) ]
    • Overhead = event instrumentation cost X event frequency
    • Costs
    – Higher: event dumps (perf.data), stack traces, copyin/outs
    – Lower: counters, in-kernel aggregations (ftrace, eBPF)
    • Frequencies
    – Higher: instructions, scheduler, malloc/free, Java methods
    – Lower: process creation & destruction, disk I/O (usually)
    
slide 54:
    VALGRIND
    
slide 55:
    Valgrind
    • A suite of tools including an extensive leak detector
    "Your	
      program	
      will	
      run	
      much	
      slower	
      
    (eg.	
      20	
      to	
      30	
      Omes)	
      than	
      normal"	
      
    	
      
    –	
      h:p://valgrind.org/docs/manual/quick-­‐start.html	
      
    • To its credit it does warn the end user
    
slide 56:
    JAVA PROFILERS
    
slide 57:
    Java Profilers
    • Some Java profilers have two modes:
    – Sampling stacks: eg, at 100 Hertz
    – Tracing methods: instrumenting and timing every method
    • Method timing has been described as "highly accurate",
    despite slowing the target by up to 1000x!
    • For more about Java profiler issues, see Nitsan Wakart's
    QCon2015 talk "Profilers are Lying Hobbitses"
    
slide 58:
    What You Can Do
    • Understand how the profiler works
    – Measure overhead
    – Know the frequency of instrumented events
    • Use in-kernel summaries (ftrace, eBPF)
    – gt; 100,000 events/sec, overhead may start to be measurable
    
slide 59:
    MONITORING
    
slide 60:
    Monitoring
    • By now you should recognize these pathologies:
    Let's just graph the system metrics!
    • That's not the problem that needs solving
    Let's just trace everything and post process!
    • Now you have one million problems per second
    • Monitoring adds additional problems:
    – Let's have a cloud-wide dashboard update per-second!
    • From every instance? Packet overheads?
    – Now we have billions of metrics!
    
slide 61:
    STATISTICS
    "Then	
      there	
      is	
      the	
      man	
      
    who	
      drowned	
      crossing	
      
    a	
      stream	
      with	
      an	
      
    average	
      depth	
      
    of	
      six	
      inches."	
      	
      	
      
    –	
      	
      
    W.I.E.	
      Gates	
      
    
slide 62:
    Statistics
    • Averages can be misleading
    – Hide latency outliers
    – Per-minute averages can hide multi-second issues
    • Percentiles can be misleading
    – Probability of hitting 99.9th latency may be more than 1/1000
    after many dependency requests
    • Show the distribution:
    – Summarize: histogram, density plot, frequency trail
    – Over-time: scatter plot, heat map
    
slide 63:
    Average Latency
    • When the index of central tendency isn't…
    
slide 64:
    VISUALIZATIONS
    
slide 65:
    Traffic Lights
    RED == bad, GREEN == good
    …misleading for subjective metrics
    Better suited for objective metrics
    
slide 66:
    Tachometers
    …especially with arbitrary color highlighting
    
slide 67:
    Pie Charts
    usr	
      
    sys	
      
    wait	
      
    idle	
      
    …for real-time metrics
    
slide 68:
    What You Can Do
    • Monitoring:
    – Verify metrics, test overhead (same as tools)
    • Statistics:
    – Ask how is this calculated?
    – Study the full distribution
    • Visualizations:
    – Use histograms, heat maps, flame graphs
    
slide 69:
    BENCHMARKING
    Benchmarks
    Macro
    Common Mistakes
    Kitchen-Sink
    bonnie++
    Micro
    Apache Bench
    
slide 70:
    BENCHMARKS
    
slide 71:
    ~100% of Benchmarks are Wrong
    • "Most popular benchmarks are flawed"
    – Traeger, A., E. Zadok, N. Joukov, and C. Wright. "A Nine Year Study of File
    System and Storage Benchmarking," ACM Transactions on Storage, 2008.
    • All alternates can also be flawed
    
slide 72:
    COMMON MISTAKES
    
slide 73:
    Common Mistakes
    1. Testing the wrong target
    – eg, FS cache instead of disk; misconfiguration
    2. Choosing the wrong target
    – eg, disk instead of FS cache … doesn’t resemble real world
    3. Invalid results
    – benchmark software bugs
    4. Ignoring errors
    – error path may be fast!
    5. Ignoring variance or perturbations
    – real workload isn't steady/consistent, which matters
    6. Misleading results
    – Casual benchmarking: you benchmark A, but actually
    measure B, and conclude you measured C
    
slide 74:
    MICRO BENCHMARKS
    
slide 75:
    Micro Benchmarks
    • Test a specific function in isolation. e.g.:
    – File system maximum cached read ops/sec
    – Network maximum throughput
    • Examples of bad microbenchmarks:
    – gitpid() in a tight loop
    – speed of /dev/zero and /dev/null
    • Common problems:
    – Testing a workload that is not very relevant
    – Missing other workloads that are relevant
    
slide 76:
    MACRO BENCHMARKS
    
slide 77:
    Macro Benchmarks
    • Simulate application user load. e.g.:
    – Simulated web client transaction
    • Common problems:
    – Misplaced trust: believed to be realistic, but misses variance,
    errors, perturbations, etc.
    – Complex to debug, verify, and root cause
    
slide 78:
    KITCHEN SINK BENCHMARKS
    
slide 79:
    Kitchen Sink Benchmarks
    • Run everything!
    – Mostly random benchmarks found on the Internet, where
    most are are broken or irrelevant
    – Developers focus on collecting more benchmarks than
    verifying or fixing the existing ones
    • Myth that more benchmarks == greater accuracy
    – No, use active benchmarking (analysis)
    
slide 80:
    BONNIE++
    
slide 81:
    bonnie++
    • "simple tests of hard drive and file system performance"
    • First metric printed: per character sequential output
    • What I found it actually tested:
    – 1 byte writes to libc (via putc())
    – 4 Kbyte writes from libc ->gt; FS (depends on OS; see setbuffer())
    – 128 Kbyte async writes to disk (depends on storage stack)
    – Any file system throttles that may be present (eg, ionice)
    – C++ code, to some extent (bonnie++ 10% slower than Bonnie)
    • Actual limiter:
    – Single threaded write_block_putc() and putc() calls
    • Now thankfully fixed
    
slide 82:
    APACHE BENCH
    
slide 83:
    Apache Bench
    • HTTP web server benchmark
    • Single thread limited (use wrk for multi-threaded)
    • Keep-alive option (-k):
    – without: Can become an unrealistic TCP session benchmark
    – with: Can become an unrealistic server throughput test
    • Performance issues of ab's own code
    
slide 84:
    UNIXBENCH
    
slide 85:
    UnixBench
    • The original kitchen-sink micro benchmark from 1984,
    published in BYTE magazine
    • Results summarized as "The BYTE Index". Including:
    system:
    dhry2reg
    Dhrystone 2 using register variables
    whetstone-double Double-Precision Whetstone
    syscall
    System Call Overhead
    pipe
    Pipe Throughput
    context1
    Pipe-based Context Switching
    spawn
    Process Creation
    execl
    Execl Throughput
    fstime-w
    File Write 1024 bufsize 2000 maxblocks
    fstime-r
    File Read 1024 bufsize 2000 maxblocks
    fstime
    File Copy 1024 bufsize 2000 maxblocks
    fsbuffer-w
    File Write 256 bufsize 500 maxblocks
    fsbuffer-r
    File Read 256 bufsize 500 maxblocks
    fsbuffer
    File Copy 256 bufsize 500 maxblocks
    fsdisk-w
    File Write 4096 bufsize 8000 maxblocks
    […]
    • Many problems, starting with…
    
slide 86:
    UnixBench Makefile
    • Default (by ./Run) for Linux. Would you edit it? Then what?
    • I "fixed" it and "improved" Dhrystone 2 performance by 64%
    ## Very generic
    #OPTON = -O
    ## For Linux 486/Pentium, GCC 2.7.x and 2.8.x
    #OPTON = -O2 -fomit-frame-pointer -fforce-addr -fforce-mem -ffast-math \
    # -m486 -malign-loops=2 -malign-jumps=2 -malign-functions=2
    ## For Linux, GCC previous to 2.7.0
    #OPTON = -O2 -fomit-frame-pointer -fforce-addr -fforce-mem -ffast-math -m486
    #OPTON = -O2 -fomit-frame-pointer -fforce-addr -fforce-mem -ffast-math \
    # -m386 -malign-loops=1 -malign-jumps=1 -malign-functions=1
    ## For Solaris 2, or general-purpose GCC 2.7.x
    OPTON = -O2 -fomit-frame-pointer -fforce-addr -ffast-math -Wall
    ## For Digital Unix v4.x, with DEC cc v5.x
    #OPTON = -O4
    #CFLAGS = -DTIME -std1 -verbose -w0
    
slide 87:
    UnixBench Documentation
    "The results will depend not only on your
    hardware, but on your operating system,
    libraries, and even compiler."
    "So you may want to make sure that all your
    test systems are running the same version of
    the OS; or at least publish the OS and
    compuiler versions with your results."
    … UnixBench was innovative & useful, but it's time has passed
    
slide 88:
    What You Can Do
    • Match the benchmark to your workload
    • Active Benchmarking
    1. Configure the benchmark to run in steady state, 24x7
    2. Do root-cause analysis of benchmark performance
    3. Answer: why X and not 10X? Limiting factor?
    It can take 1-2 weeks to debug a single benchmark
    
slide 89:
    Summary
    
slide 90:
    Observe Everything
    • Trust nothing. Verify. Write small tests.
    • Pose Q's first then find the metrics. e.g., functional diagrams:
    Reference: http://www.brendangregg.com/linuxperf.html
    
slide 91:
    Profile Everything
    • e.g., Java Mixed-Mode Flame Graphs:
    Kernel
    Java
    JVM
    Reference: http://www.brendangregg.com/linuxperf.html
    
slide 92:
    Visualize Everything
    • Full distributions of latency. e.g., heat maps:
    Reference: http://queue.acm.org/detail.cfm?id=1809426
    
slide 93:
    Benchmark Nothing!
    (if you must, use Active Benchmarking)
    
slide 94:
    Links & References
    Things that aren't broken:
    http://www.brendangregg.com/linuxperf.html
    References:
    https://upload.wikimedia.org/wikipedia/commons/6/64/Intel_Nehalem_arch.svg
    http://www.linuxatemyram.com/
    Traeger, A., E. Zadok, N. Joukov, and C. Wright. “A Nine Year Study of File System
    and Storage Benchmarking,” ACM Trans- actions on Storage, 2008.
    http://www.brendangregg.com/blog/2014-06-09/java-cpu-sampling-using-hprof.html
    http://www.brendangregg.com/activebenchmarking.html
    https://blogs.oracle.com/roch/entry/decoding_bonnie
    http://www.brendangregg.com/blog/2014-05-02/compilers-love-messing-withbenchmarks.html
    https://code.google.com/p/byte-unixbench/
    https://qconsf.com/sf2015/presentation/how-not-measure-latency
    https://qconsf.com/system/files/presentation-slides/profilers_are_lying_hobbitses.pdf
    Caution signs drawn be me, inspired by real-world signs
    
slide 95:
    Jan	
      2016	
      
    Thanks
    Questions?
    http://techblog.netflix.com
    http://slideshare.net/brendangregg
    http://www.brendangregg.com
    bgregg@netflix.com
    @brendangregg