Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

FISL 2013: Performance Analysis, The USE Method

Delivered at the FISL13 conference in Brazil, by Brendan Gregg.

Video: http://www.youtube.com/watch?v=K9w2cipqfvc

Description: "This talk introduces the USE Method: a simple strategy for performing a complete check of system performance health, identifying common bottlenecks and errors. This methodology can be used early in a performance investigation to quickly identify the most severe system performance issues, and is a methodology the speaker has used successfully for years in both enterprise and cloud computing environments. Checklists have been developed to show how the USE Method can be applied to Solaris/illumos-based and Linux-based systems.

Many hardware and software resource types have been commonly overlooked, including memory and I/O busses, CPU interconnects, and kernel locks. Any of these can become a system bottleneck. The USE Method provides a way to find and identify these.

This approach focuses on the questions to ask of the system, before reaching for the tools. Tools that are ultimately used include all the standard performance tools (vmstat, iostat, top), and more advanced tools, including dynamic tracing (DTrace), and hardware performance counters.

Other performance methodologies are included for comparison: the Problem Statement Method, Workload Characterization Method, and Drill-Down Analysis Method."

next
prev
1/46
next
prev
2/46
next
prev
3/46
next
prev
4/46
next
prev
5/46
next
prev
6/46
next
prev
7/46
next
prev
8/46
next
prev
9/46
next
prev
10/46
next
prev
11/46
next
prev
12/46
next
prev
13/46
next
prev
14/46
next
prev
15/46
next
prev
16/46
next
prev
17/46
next
prev
18/46
next
prev
19/46
next
prev
20/46
next
prev
21/46
next
prev
22/46
next
prev
23/46
next
prev
24/46
next
prev
25/46
next
prev
26/46
next
prev
27/46
next
prev
28/46
next
prev
29/46
next
prev
30/46
next
prev
31/46
next
prev
32/46
next
prev
33/46
next
prev
34/46
next
prev
35/46
next
prev
36/46
next
prev
37/46
next
prev
38/46
next
prev
39/46
next
prev
40/46
next
prev
41/46
next
prev
42/46
next
prev
43/46
next
prev
44/46
next
prev
45/46
next
prev
46/46

PDF: FISL13_USE_Method.pdf

Keywords (from pdftotext):

slide 1:
    Performance Analysis:
    The USE Method
    Brendan Gregg
    Lead Performance Engineer, Joyent
    brendan.gregg@joyent.com
    Saturday, July 28, 2012
    
slide 2:
    whoami
    • I work at the top of the performance support chain
    • I also write open source performance tools
    out of necessity to solve issues
    • http://github.com/brendangregg
    • http://www.brendangregg.com/#software
    • And books (DTrace, Solaris Performance and Tools)
    • Was Brendan @ Sun Microsystems, Oracle,
    now Joyent
    Saturday, July 28, 2012
    
slide 3:
    Joyent
    • Cloud computing provider
    • Cloud computing software
    • SmartOS
    • host OS, and guest via OS virtualization
    • Linux, Windows
    • guest via KVM
    Saturday, July 28, 2012
    
slide 4:
    Agenda
    • Example Problem
    • Performance Methodology
    • Problem Statement
    • The USE Method
    • Workload Characterization
    • Drill-Down Analysis
    • Specific Tools
    Saturday, July 28, 2012
    
slide 5:
    Example Problem
    • Recent cloud-based performance issue
    • Customer problem statement:
    • “Database response time sometimes take multiple
    seconds. Is the network dropping packets?”
    • Tested network using traceroute, which showed some
    packet drops
    Saturday, July 28, 2012
    
slide 6:
    Example: Support Path
    • Performance Analysis
    Top
    2nd Level
    1st Level
    Customer Issues
    Saturday, July 28, 2012
    
slide 7:
    Example: Support Path
    • Performance Analysis
    Top
    my turn
    2nd Level
    “network looks ok,
    CPU also ok”
    1st Level
    “ran traceroute,
    can’t reproduce”
    Customer: “network drops?”
    Saturday, July 28, 2012
    
slide 8:
    Example: Network Drops
    • Old fashioned: network packet capture (sniffing)
    • Performance overhead during capture (CPU, storage)
    and post-processing (wireshark)
    • Time consuming to analyze: not real-time
    Saturday, July 28, 2012
    
slide 9:
    Example: Network Drops
    • New: dynamic tracing
    • Efficient: only drop/retransmit paths traced
    • Context: kernel state readable
    • Real-time: analysis and summaries
    # ./tcplistendrop.d
    TIME
    2012 Jan 19 01:22:49
    2012 Jan 19 01:22:49
    2012 Jan 19 01:22:49
    2012 Jan 19 01:22:49
    2012 Jan 19 01:22:49
    2012 Jan 19 01:22:49
    2012 Jan 19 01:22:49
    [...]
    Saturday, July 28, 2012
    SRC-IP
    PORT
    DST-IP
    25691 ->gt; 192.192.240.212
    18423 ->gt; 192.192.240.212
    38883 ->gt; 192.192.240.212
    10739 ->gt; 192.192.240.212
    27988 ->gt; 192.192.240.212
    28824 ->gt; 192.192.240.212
    65070 ->gt; 192.192.240.212
    PORT
    
slide 10:
    Example: Methodology
    • Instead of network drop analysis, I began with the
    USE method to check system health
    Saturday, July 28, 2012
    
slide 11:
    Example: Methodology
    • Instead of network drop analysis, I began with the
    USE method to check system health
    • In 
slide 12:
    Example: Other Methodologies
    • Customer was surprised (are you sure?) I used
    latency analysis to confirm. Details (if interesting):
    • memory: using both microstate accounting and
    dynamic tracing to confirm that anonymous pagins
    were hurting the database; worst case app thread
    spent 97% of time waiting on disk (data faults).
    • disk: using dynamic tracing to confirm latency at the
    application / file system interface; included up to
    1000ms fsync() calls.
    • Different methodology, smaller audience (expertise),
    more time (1 hour).
    Saturday, July 28, 2012
    
slide 13:
    Example: Summary
    • What happened:
    • customer, 1st and 2nd level support spent much time
    chasing network packet drops.
    • What could have happened:
    • customer or 1st level follows the USE method and
    quickly discover memory and disk issues
    • memory: fixable by customer reconfig
    • disk: could go back to 1st or 2nd level support for confirmation
    • Faster resolution, frees time
    Saturday, July 28, 2012
    
slide 14:
    Performance Methodology
    • Not a tool
    • Not a product
    • Is a procedure (documentation)
    Saturday, July 28, 2012
    
slide 15:
    Performance Methodology
    • Not a tool ->gt; but tools can be written to help
    • Not a product ->gt; could be in monitoring solutions
    • Is a procedure (documentation)
    Saturday, July 28, 2012
    
slide 16:
    Why Now: past
    • Performance analysis circa ‘90s, metric-orientated:
    • Vendor creates metrics and performance tools
    • Users develop methods to interpret metrics
    • Common method: “Tools Method”
    • List available performance tools
    • For each tool, list useful metrics
    • For each metric, determine interpretation
    • Problematic: vendors often don’t provide the best
    metrics; can be blind to issue types
    Saturday, July 28, 2012
    
slide 17:
    Why Now: changes
    • Open Source
    • Dynamic Tracing
    • See anything, not just what the vendor gave you
    • Only practical on open source software
    • Hardest part is knowing what questions to ask
    Saturday, July 28, 2012
    
slide 18:
    Why Now: present
    • Performance analysis now (post dynamic tracing),
    question-orientated:
    • Users pose questions
    • Check if vendor has provided metrics
    • Develop custom metrics using dynamic tracing
    • Methodologies pose the questions
    • What would previously be an academic exercise is
    now practical
    Saturday, July 28, 2012
    
slide 19:
    Methology Audience
    • Beginners: provides a starting point
    • Experts: provides a checklist/reminder
    Saturday, July 28, 2012
    
slide 20:
    Performance Methodolgies
    • Suggested order of execution:
    1.Problem Statement
    2.The USE Method
    3.Workload Characterization
    4.Drill-Down Analysis (Latency)
    Saturday, July 28, 2012
    
slide 21:
    Problem Statement
    • Typical support procedure (1st Methodology):
    1.What makes you think there is a problem?
    2.Has this system ever performed well?
    3.What changed? Software? Hardware? Load?
    4.Can the performance degradation be expressed in
    terms of latency or run time?
    5.Does the problem affect other people or
    applications?
    6.What is the environment? What software and
    hardware is used? Versions? Configuration?
    Saturday, July 28, 2012
    
slide 22:
    The USE Method
    • Quick System Health Check (2nd Methodology):
    • For every resource, check:
    • Utilization
    • Saturation
    • Errors
    Saturday, July 28, 2012
    
slide 23:
    The USE Method
    • Quick System Health Check (2nd Methodology):
    • For every resource, check:
    • Utilization: time resource was busy, or degree used
    • Saturation: degree of queued extra work
    • Errors: any errors
    Saturation
    Errors
    Saturday, July 28, 2012
    Utilization
    
slide 24:
    The USE Method: Hardware
    Resources
    • CPUs
    • Main Memory
    • Network Interfaces
    • Storage Devices
    • Controllers
    • Interconnects
    Saturday, July 28, 2012
    
slide 25:
    The USE Method: Hardware
    Resources
    • A great way to determine resources is to find (or
    draw) the server functional diagram
    • The hardware team at vendors should have these
    • Analyze every component in the data path
    Saturday, July 28, 2012
    
slide 26:
    The USE Method: Functional
    Diagrams, Generic Example
    Memory
    Bus
    DRAM
    CPU
    Interconnect
    CPU
    DRAM
    CPU
    I/O Bus
    I/O
    Bridge
    I/O
    Controller
    Expander Interconnect
    Network
    Controller
    Interface Transports
    Disk
    Saturday, July 28, 2012
    Disk
    Port
    Port
    
slide 27:
    The USE Method: Resource
    Types
    • There are two different resource types, each define
    utilization differently:
    • I/O Resource: eg, network interface
    • utilization: time resource was busy.
    current IOPS / max or current throughput / max
    can be used in some cases
    • Capacity Resource: eg, main memory
    • utilization: space consumed
    • Storage devices act as both resource types
    Saturday, July 28, 2012
    
slide 28:
    The USE Method: Software
    Resources
    • Mutex Locks
    • Thread Pools
    • Process/Thread Capacity
    • File Descriptor Capacity
    Saturday, July 28, 2012
    
slide 29:
    The USE Method: Flow Diagram
    Choose Resource
    Errors
    Present?
    High
    Utilization?
    Saturday, July 28, 2012
    Saturation?
    Problem
    Identified
    
slide 30:
    The USE Method: Interpretation
    • Utilization
    • 100% usually a bottleneck
    • 70%+ often a bottleneck for I/O resources, especially
    when high priority work cannot easily interrupt lower
    priority work (eg, disks)
    • Beware of time intervals. 60% utilized over 5 minutes
    may mean 100% utilized for 3 minutes then idle
    • Best examined per-device (unbalanced workloads)
    Saturday, July 28, 2012
    
slide 31:
    The USE Method: Interpretation
    • Saturation
    • Any non-zero value adds latency
    • Errors
    • Should be obvious
    Saturday, July 28, 2012
    
slide 32:
    The USE Method: Easy
    Combinations
    Resource
    Type
    CPU
    utilization
    CPU
    saturation
    Memory
    utilization
    Memory
    saturation
    Network Interface
    utilization
    Storage Device I/O utilization
    Storage Device I/O saturation
    Storage Device I/O errors
    Saturday, July 28, 2012
    Metric
    
slide 33:
    The USE Method: Easy
    Combinations
    Resource
    Type
    Metric
    CPU
    utilization
    CPU utilization
    CPU
    saturation run-queue length
    Memory
    utilization
    Memory
    saturation paging or swapping
    Network Interface
    utilization
    Storage Device I/O utilization
    available memory
    RX/TX tput/bandwidth
    device busy percent
    Storage Device I/O saturation wait queue length
    Storage Device I/O errors
    Saturday, July 28, 2012
    device errors
    
slide 34:
    The USE Method: Harder
    Combinations
    Resource
    Type
    CPU
    errors
    Network
    saturation
    Storage Controller utilization
    CPU Interconnect
    utilization
    Mem. Interconnect saturation
    I/O Interconnect
    Saturday, July 28, 2012
    saturation
    Metric
    
slide 35:
    The USE Method: Harder
    Combinations
    Resource
    Type
    Metric
    CPU
    errors
    eg, correctable CPU
    cache ECC events
    Network
    saturation “nocanputs”, buffering
    Storage Controller utilization
    CPU Interconnect
    utilization
    active vs max controller
    IOPS and tput
    per port tput / max
    bandwidth
    Mem. Interconnect saturation memory stall cycles
    I/O Interconnect
    Saturday, July 28, 2012
    bus throughput / max
    saturation
    bandwidth
    
slide 36:
    The USE Method: tools
    • To be thorough, you will need to use:
    • CPU performance counters
    • For bus and interconnect activity; eg, perf events, cpustat
    • Dynamic Tracing
    • For missing saturation and error metrics; eg, DTrace
    • Both can get tricky; tools can be developed to help
    • Please, no more top variants! ... unless it is
    interconnect-top or bus-top
    • I’ve written dozens of open source tools for both CPC
    and DTrace; much more can be done
    Saturday, July 28, 2012
    
slide 37:
    Workload Characterization
    • May use as a 3rd Methodology
    • Characterize workload by:
    • who is causing the load? PID, UID, IP addr, ...
    • why is the load called? code path
    • what is the load? IOPS, tput, type
    • how is the load changing over time?
    • Best performance wins are from eliminating
    unnecessary work
    • Identifies class of issues that are load-based, not
    architecture-based
    Saturday, July 28, 2012
    
slide 38:
    Drill-Down Analysis
    • May use as a 4th Methodology
    • Peel away software layers to drill down on the issue
    • Eg, software stack I/O latency analysis:
    Application
    System Call Interface
    File System
    Block Device Interface
    Storage Device Drivers
    Storage Devices
    Saturday, July 28, 2012
    
slide 39:
    Drill-Down Analysis:
    Open Source
    • With Dynamic Tracing, all function entry & return
    points can be traced, with nanosecond timestamps.
    • One Strategy is to measure latency pairs, to search
    for the source; eg, A->gt;B & C->gt;D:
    static int
    arc_cksum_equal(arc_buf_t *buf)
    zio_cksum_t zc;
    int equal;
    mutex_enter(&buf->gt;b_hdr->gt;b_freeze_lock);
    C fletcher_2_native(buf->gt;b_data, buf->gt;b_hdr->gt;b_size, &zc); D
    equal = ZIO_CHECKSUM_EQUAL(*buf->gt;b_hdr->gt;b_freeze_cksum, zc);
    mutex_exit(&buf->gt;b_hdr->gt;b_freeze_lock);
    Saturday, July 28, 2012
    return (equal);
    
slide 40:
    Other Methodologies
    • Method R
    • A latency-based analysis approach for Oracle
    databases. See “Optimizing Oracle Performance" by
    Cary Millsap and Jeff Holt (2003)
    • Experimental approaches
    • Can be very useful: eg, validating network throughput
    using iperf
    Saturday, July 28, 2012
    
slide 41:
    Specific Tools for the USE
    Method
    Saturday, July 28, 2012
    
slide 42:
    illumos-based
    • http://dtrace.org/blogs/brendan/2012/03/01/the-usemethod-solaris-performance-checklist/
    Resource Type
    Metric
    CPU
    Utilization
    per-cpu: mpstat 1, “idl”; system-wide: vmstat 1, “id”;
    per-process:prstat -c 1 (“CPU” == recent), prstat mLc 1 (“USR” + “SYS”); per-kernel-thread: lockstat -Ii
    rate, DTrace profile stack()
    Saturation
    system-wide: uptime, load averages; vmstat 1, “r”;
    DTrace dispqlen.d (DTT) for a better “vmstat r”; per-process:
    prstat -mLc 1, “LAT”
    Errors
    fmadm faulty; cpustat (CPC) for whatever error
    counters are supported (eg, thermal throttling)
    Saturation
    system-wide: vmstat 1, “sr” (bad now), “w” (was very
    bad); vmstat -p 1, “api” (anon page ins == pain), “apo”;
    per-process: prstat -mLc 1, “DFL”; DTrace anonpgpid.d
    (DTT), vminfo:::anonpgin on execname
    CPU
    CPU
    Memory
    • ... etc for all combinations (would span a dozen slides)
    Saturday, July 28, 2012
    
slide 43:
    Linux-based
    • http://dtrace.org/blogs/brendan/2012/03/07/the-usemethod-linux-performance-checklist/
    Resource Type
    Metric
    CPU
    Utilization
    per-cpu: mpstat -P ALL 1, “%idle”; sar -P ALL,
    “%idle”; system-wide: vmstat 1, “id”; sar -u, “%idle”;
    dstat -c, “idl”; per-process:top, “%CPU”; htop, “CPU%”;
    ps -o pcpu; pidstat 1, “%CPU”; per-kernel-thread:
    top/htop (“K” to toggle), where VIRT == 0 (heuristic). [1]
    Saturation
    system-wide: vmstat 1, “r” >gt; CPU count [2]; sar -q,
    “runq-sz” >gt; CPU count; dstat -p, “run” >gt; CPU count; perprocess: /proc/PID/schedstat 2nd field
    (sched_info.run_delay); perf sched latency (shows
    “Average” and “Maximum” delay per-schedule); dynamic
    tracing, eg, SystemTap schedtimes.stp “queued(us)” [3]
    Errors
    perf (LPE) if processor specific error events (CPC) are
    available; eg, AMD64′s “04Ah Single-bit ECC Errors Recorded
    by Scrubber” [4]
    CPU
    CPU
    • ... etc for all combinations (would span a dozen slides)
    Saturday, July 28, 2012
    
slide 44:
    Products
    • Earlier I said methodologies could be supported by
    monitoring solutions
    • At Joyent we develop Cloud Analytics:
    Saturday, July 28, 2012
    
slide 45:
    Future
    • Methodologies for advanced performance issues
    • I recently worked a complex KVM bandwidth issue where
    no current methodologies really worked
    • Innovative methods based on open source +
    dynamic tracing
    • Less performance mystery. Less guesswork.
    • Better use of resources (price/performance)
    • Easier for beginners to get started
    Saturday, July 28, 2012
    
slide 46:
    Thank you
    • Resources:
    • http://dtrace.org/blogs/brendan
    • http://dtrace.org/blogs/brendan/2012/02/29/the-use-method/
    • http://dtrace.org/blogs/brendan/tag/usemethod/
    • http://dtrace.org/blogs/brendan/2011/12/18/visualizing-deviceutilization/ - ideas if you are a monitoring solution developer
    • brendan@joyent.com
    Saturday, July 28, 2012