FISL 2013: Performance Analysis, The USE Method
Delivered at the FISL13 conference in Brazil, by Brendan Gregg.Video: http://www.youtube.com/watch?v=K9w2cipqfvc
Description: "This talk introduces the USE Method: a simple strategy for performing a complete check of system performance health, identifying common bottlenecks and errors. This methodology can be used early in a performance investigation to quickly identify the most severe system performance issues, and is a methodology the speaker has used successfully for years in both enterprise and cloud computing environments. Checklists have been developed to show how the USE Method can be applied to Solaris/illumos-based and Linux-based systems.
Many hardware and software resource types have been commonly overlooked, including memory and I/O busses, CPU interconnects, and kernel locks. Any of these can become a system bottleneck. The USE Method provides a way to find and identify these.
This approach focuses on the questions to ask of the system, before reaching for the tools. Tools that are ultimately used include all the standard performance tools (vmstat, iostat, top), and more advanced tools, including dynamic tracing (DTrace), and hardware performance counters.
Other performance methodologies are included for comparison: the Problem Statement Method, Workload Characterization Method, and Drill-Down Analysis Method."
next prev 1/46 | |
next prev 2/46 | |
next prev 3/46 | |
next prev 4/46 | |
next prev 5/46 | |
next prev 6/46 | |
next prev 7/46 | |
next prev 8/46 | |
next prev 9/46 | |
next prev 10/46 | |
next prev 11/46 | |
next prev 12/46 | |
next prev 13/46 | |
next prev 14/46 | |
next prev 15/46 | |
next prev 16/46 | |
next prev 17/46 | |
next prev 18/46 | |
next prev 19/46 | |
next prev 20/46 | |
next prev 21/46 | |
next prev 22/46 | |
next prev 23/46 | |
next prev 24/46 | |
next prev 25/46 | |
next prev 26/46 | |
next prev 27/46 | |
next prev 28/46 | |
next prev 29/46 | |
next prev 30/46 | |
next prev 31/46 | |
next prev 32/46 | |
next prev 33/46 | |
next prev 34/46 | |
next prev 35/46 | |
next prev 36/46 | |
next prev 37/46 | |
next prev 38/46 | |
next prev 39/46 | |
next prev 40/46 | |
next prev 41/46 | |
next prev 42/46 | |
next prev 43/46 | |
next prev 44/46 | |
next prev 45/46 | |
next prev 46/46 |
PDF: FISL13_USE_Method.pdf
Keywords (from pdftotext):
slide 1:
Performance Analysis: The USE Method Brendan Gregg Lead Performance Engineer, Joyent brendan.gregg@joyent.com Saturday, July 28, 2012slide 2:
whoami • I work at the top of the performance support chain • I also write open source performance tools out of necessity to solve issues • http://github.com/brendangregg • http://www.brendangregg.com/#software • And books (DTrace, Solaris Performance and Tools) • Was Brendan @ Sun Microsystems, Oracle, now Joyent Saturday, July 28, 2012slide 3:
Joyent • Cloud computing provider • Cloud computing software • SmartOS • host OS, and guest via OS virtualization • Linux, Windows • guest via KVM Saturday, July 28, 2012slide 4:
Agenda • Example Problem • Performance Methodology • Problem Statement • The USE Method • Workload Characterization • Drill-Down Analysis • Specific Tools Saturday, July 28, 2012slide 5:
Example Problem • Recent cloud-based performance issue • Customer problem statement: • “Database response time sometimes take multiple seconds. Is the network dropping packets?” • Tested network using traceroute, which showed some packet drops Saturday, July 28, 2012slide 6:
Example: Support Path • Performance Analysis Top 2nd Level 1st Level Customer Issues Saturday, July 28, 2012slide 7:
Example: Support Path • Performance Analysis Top my turn 2nd Level “network looks ok, CPU also ok” 1st Level “ran traceroute, can’t reproduce” Customer: “network drops?” Saturday, July 28, 2012slide 8:
Example: Network Drops • Old fashioned: network packet capture (sniffing) • Performance overhead during capture (CPU, storage) and post-processing (wireshark) • Time consuming to analyze: not real-time Saturday, July 28, 2012slide 9:
Example: Network Drops • New: dynamic tracing • Efficient: only drop/retransmit paths traced • Context: kernel state readable • Real-time: analysis and summaries # ./tcplistendrop.d TIME 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 [...] Saturday, July 28, 2012 SRC-IP PORT DST-IP 25691 ->gt; 192.192.240.212 18423 ->gt; 192.192.240.212 38883 ->gt; 192.192.240.212 10739 ->gt; 192.192.240.212 27988 ->gt; 192.192.240.212 28824 ->gt; 192.192.240.212 65070 ->gt; 192.192.240.212 PORTslide 10:
Example: Methodology • Instead of network drop analysis, I began with the USE method to check system health Saturday, July 28, 2012slide 11:
Example: Methodology • Instead of network drop analysis, I began with the USE method to check system health • Inslide 12: Example: Other Methodologies • Customer was surprised (are you sure?) I used latency analysis to confirm. Details (if interesting): • memory: using both microstate accounting and dynamic tracing to confirm that anonymous pagins were hurting the database; worst case app thread spent 97% of time waiting on disk (data faults). • disk: using dynamic tracing to confirm latency at the application / file system interface; included up to 1000ms fsync() calls. • Different methodology, smaller audience (expertise), more time (1 hour). Saturday, July 28, 2012slide 13:Example: Summary • What happened: • customer, 1st and 2nd level support spent much time chasing network packet drops. • What could have happened: • customer or 1st level follows the USE method and quickly discover memory and disk issues • memory: fixable by customer reconfig • disk: could go back to 1st or 2nd level support for confirmation • Faster resolution, frees time Saturday, July 28, 2012slide 14:Performance Methodology • Not a tool • Not a product • Is a procedure (documentation) Saturday, July 28, 2012slide 15:Performance Methodology • Not a tool ->gt; but tools can be written to help • Not a product ->gt; could be in monitoring solutions • Is a procedure (documentation) Saturday, July 28, 2012slide 16:Why Now: past • Performance analysis circa ‘90s, metric-orientated: • Vendor creates metrics and performance tools • Users develop methods to interpret metrics • Common method: “Tools Method” • List available performance tools • For each tool, list useful metrics • For each metric, determine interpretation • Problematic: vendors often don’t provide the best metrics; can be blind to issue types Saturday, July 28, 2012slide 17:Why Now: changes • Open Source • Dynamic Tracing • See anything, not just what the vendor gave you • Only practical on open source software • Hardest part is knowing what questions to ask Saturday, July 28, 2012slide 18:Why Now: present • Performance analysis now (post dynamic tracing), question-orientated: • Users pose questions • Check if vendor has provided metrics • Develop custom metrics using dynamic tracing • Methodologies pose the questions • What would previously be an academic exercise is now practical Saturday, July 28, 2012slide 19:Methology Audience • Beginners: provides a starting point • Experts: provides a checklist/reminder Saturday, July 28, 2012slide 20:Performance Methodolgies • Suggested order of execution: 1.Problem Statement 2.The USE Method 3.Workload Characterization 4.Drill-Down Analysis (Latency) Saturday, July 28, 2012slide 21:Problem Statement • Typical support procedure (1st Methodology): 1.What makes you think there is a problem? 2.Has this system ever performed well? 3.What changed? Software? Hardware? Load? 4.Can the performance degradation be expressed in terms of latency or run time? 5.Does the problem affect other people or applications? 6.What is the environment? What software and hardware is used? Versions? Configuration? Saturday, July 28, 2012slide 22:The USE Method • Quick System Health Check (2nd Methodology): • For every resource, check: • Utilization • Saturation • Errors Saturday, July 28, 2012slide 23:The USE Method • Quick System Health Check (2nd Methodology): • For every resource, check: • Utilization: time resource was busy, or degree used • Saturation: degree of queued extra work • Errors: any errors Saturation Errors Saturday, July 28, 2012 Utilizationslide 24:The USE Method: Hardware Resources • CPUs • Main Memory • Network Interfaces • Storage Devices • Controllers • Interconnects Saturday, July 28, 2012slide 25:The USE Method: Hardware Resources • A great way to determine resources is to find (or draw) the server functional diagram • The hardware team at vendors should have these • Analyze every component in the data path Saturday, July 28, 2012slide 26:The USE Method: Functional Diagrams, Generic Example Memory Bus DRAM CPU Interconnect CPU DRAM CPU I/O Bus I/O Bridge I/O Controller Expander Interconnect Network Controller Interface Transports Disk Saturday, July 28, 2012 Disk Port Portslide 27:The USE Method: Resource Types • There are two different resource types, each define utilization differently: • I/O Resource: eg, network interface • utilization: time resource was busy. current IOPS / max or current throughput / max can be used in some cases • Capacity Resource: eg, main memory • utilization: space consumed • Storage devices act as both resource types Saturday, July 28, 2012slide 28:The USE Method: Software Resources • Mutex Locks • Thread Pools • Process/Thread Capacity • File Descriptor Capacity Saturday, July 28, 2012slide 29:The USE Method: Flow Diagram Choose Resource Errors Present? High Utilization? Saturday, July 28, 2012 Saturation? Problem Identifiedslide 30:The USE Method: Interpretation • Utilization • 100% usually a bottleneck • 70%+ often a bottleneck for I/O resources, especially when high priority work cannot easily interrupt lower priority work (eg, disks) • Beware of time intervals. 60% utilized over 5 minutes may mean 100% utilized for 3 minutes then idle • Best examined per-device (unbalanced workloads) Saturday, July 28, 2012slide 31:The USE Method: Interpretation • Saturation • Any non-zero value adds latency • Errors • Should be obvious Saturday, July 28, 2012slide 32:The USE Method: Easy Combinations Resource Type CPU utilization CPU saturation Memory utilization Memory saturation Network Interface utilization Storage Device I/O utilization Storage Device I/O saturation Storage Device I/O errors Saturday, July 28, 2012 Metricslide 33:The USE Method: Easy Combinations Resource Type Metric CPU utilization CPU utilization CPU saturation run-queue length Memory utilization Memory saturation paging or swapping Network Interface utilization Storage Device I/O utilization available memory RX/TX tput/bandwidth device busy percent Storage Device I/O saturation wait queue length Storage Device I/O errors Saturday, July 28, 2012 device errorsslide 34:The USE Method: Harder Combinations Resource Type CPU errors Network saturation Storage Controller utilization CPU Interconnect utilization Mem. Interconnect saturation I/O Interconnect Saturday, July 28, 2012 saturation Metricslide 35:The USE Method: Harder Combinations Resource Type Metric CPU errors eg, correctable CPU cache ECC events Network saturation “nocanputs”, buffering Storage Controller utilization CPU Interconnect utilization active vs max controller IOPS and tput per port tput / max bandwidth Mem. Interconnect saturation memory stall cycles I/O Interconnect Saturday, July 28, 2012 bus throughput / max saturation bandwidthslide 36:The USE Method: tools • To be thorough, you will need to use: • CPU performance counters • For bus and interconnect activity; eg, perf events, cpustat • Dynamic Tracing • For missing saturation and error metrics; eg, DTrace • Both can get tricky; tools can be developed to help • Please, no more top variants! ... unless it is interconnect-top or bus-top • I’ve written dozens of open source tools for both CPC and DTrace; much more can be done Saturday, July 28, 2012slide 37:Workload Characterization • May use as a 3rd Methodology • Characterize workload by: • who is causing the load? PID, UID, IP addr, ... • why is the load called? code path • what is the load? IOPS, tput, type • how is the load changing over time? • Best performance wins are from eliminating unnecessary work • Identifies class of issues that are load-based, not architecture-based Saturday, July 28, 2012slide 38:Drill-Down Analysis • May use as a 4th Methodology • Peel away software layers to drill down on the issue • Eg, software stack I/O latency analysis: Application System Call Interface File System Block Device Interface Storage Device Drivers Storage Devices Saturday, July 28, 2012slide 39:Drill-Down Analysis: Open Source • With Dynamic Tracing, all function entry & return points can be traced, with nanosecond timestamps. • One Strategy is to measure latency pairs, to search for the source; eg, A->gt;B & C->gt;D: static int arc_cksum_equal(arc_buf_t *buf) zio_cksum_t zc; int equal; mutex_enter(&buf->gt;b_hdr->gt;b_freeze_lock); C fletcher_2_native(buf->gt;b_data, buf->gt;b_hdr->gt;b_size, &zc); D equal = ZIO_CHECKSUM_EQUAL(*buf->gt;b_hdr->gt;b_freeze_cksum, zc); mutex_exit(&buf->gt;b_hdr->gt;b_freeze_lock); Saturday, July 28, 2012 return (equal);slide 40:Other Methodologies • Method R • A latency-based analysis approach for Oracle databases. See “Optimizing Oracle Performance" by Cary Millsap and Jeff Holt (2003) • Experimental approaches • Can be very useful: eg, validating network throughput using iperf Saturday, July 28, 2012slide 41:Specific Tools for the USE Method Saturday, July 28, 2012slide 42:illumos-based • http://dtrace.org/blogs/brendan/2012/03/01/the-usemethod-solaris-performance-checklist/ Resource Type Metric CPU Utilization per-cpu: mpstat 1, “idl”; system-wide: vmstat 1, “id”; per-process:prstat -c 1 (“CPU” == recent), prstat mLc 1 (“USR” + “SYS”); per-kernel-thread: lockstat -Ii rate, DTrace profile stack() Saturation system-wide: uptime, load averages; vmstat 1, “r”; DTrace dispqlen.d (DTT) for a better “vmstat r”; per-process: prstat -mLc 1, “LAT” Errors fmadm faulty; cpustat (CPC) for whatever error counters are supported (eg, thermal throttling) Saturation system-wide: vmstat 1, “sr” (bad now), “w” (was very bad); vmstat -p 1, “api” (anon page ins == pain), “apo”; per-process: prstat -mLc 1, “DFL”; DTrace anonpgpid.d (DTT), vminfo:::anonpgin on execname CPU CPU Memory • ... etc for all combinations (would span a dozen slides) Saturday, July 28, 2012slide 43:Linux-based • http://dtrace.org/blogs/brendan/2012/03/07/the-usemethod-linux-performance-checklist/ Resource Type Metric CPU Utilization per-cpu: mpstat -P ALL 1, “%idle”; sar -P ALL, “%idle”; system-wide: vmstat 1, “id”; sar -u, “%idle”; dstat -c, “idl”; per-process:top, “%CPU”; htop, “CPU%”; ps -o pcpu; pidstat 1, “%CPU”; per-kernel-thread: top/htop (“K” to toggle), where VIRT == 0 (heuristic). [1] Saturation system-wide: vmstat 1, “r” >gt; CPU count [2]; sar -q, “runq-sz” >gt; CPU count; dstat -p, “run” >gt; CPU count; perprocess: /proc/PID/schedstat 2nd field (sched_info.run_delay); perf sched latency (shows “Average” and “Maximum” delay per-schedule); dynamic tracing, eg, SystemTap schedtimes.stp “queued(us)” [3] Errors perf (LPE) if processor specific error events (CPC) are available; eg, AMD64′s “04Ah Single-bit ECC Errors Recorded by Scrubber” [4] CPU CPU • ... etc for all combinations (would span a dozen slides) Saturday, July 28, 2012slide 44:Products • Earlier I said methodologies could be supported by monitoring solutions • At Joyent we develop Cloud Analytics: Saturday, July 28, 2012slide 45:Future • Methodologies for advanced performance issues • I recently worked a complex KVM bandwidth issue where no current methodologies really worked • Innovative methods based on open source + dynamic tracing • Less performance mystery. Less guesswork. • Better use of resources (price/performance) • Easier for beginners to get started Saturday, July 28, 2012slide 46:Thank you • Resources: • http://dtrace.org/blogs/brendan • http://dtrace.org/blogs/brendan/2012/02/29/the-use-method/ • http://dtrace.org/blogs/brendan/tag/usemethod/ • http://dtrace.org/blogs/brendan/2011/12/18/visualizing-deviceutilization/ - ideas if you are a monitoring solution developer • brendan@joyent.com Saturday, July 28, 2012