Performance Analysis Methodology
A performance analysis methodology is a procedure that you can follow to analyze system or application performance. These generally provide a starting point and then guidance to root cause, or causes. Different methodologies are suited for solving different classes of issues, and you may try more than one before accomplishing your goal.
Analysis without a methodology can become a fishing expedition, where metrics are examined ad hoc, until the issue is found – if it is at all.
Methodologies documented in more detail on this site are:
- The USE Method: for finding resource bottlenecks
- The TSA Method: for analyzing application time
- Off-CPU Analysis: for analyzing any type of thread wait latency
- Active Benchmarking: for accurate and successful benchmarking
The following briefly summarizes methodologies I've either created or encountered. You can print these all out as a cheetsheet/reminder.
Summaries
I first summarized and named various performance methodologies (mostly developed by me) for my USENIX LISA 2012 talk: Performance Analysis Methodology (PDF, slideshare, youtube, USENIX), then later documented them in my Systems Performance book, and the ACMQ article Thinking Methodically about Performance, which was also published in Communications of the ACM, Feb 2013. More detailed references are at the end of this page.
The following is my most up to date summary list, with methodologies enumerated. These begin with anti-methods, which are included for comparison, and not to follow.
Anti-Methodologies
Blame-Someone-Else Anti-Method
- Find a system or environment component you are not responsible for
- Hypothesize that the issue is with that component
- Redirect the issue to the responsible team
- When proven wrong, go to 1
Streetlight Anti-Method
- Pick observability tools that are:
- familiar
- found on the Internet
- found at random
- Run tools
- Look for obvious issues
Drunk Man Anti-Method
- Change things at random until the problem goes away
Random Change Anti-Method
- Measure a performance baseline
- Pick a random attribute to change (eg, a tunable)
- Change it in one direction
- Measure performance
- Change it in the other direction
- Measure performance
- Were the step 4 or 6 results better than the baseline? If so, keep the change; of not, revert
- Goto step 1
Passive Benchmarking Anti-Method
- Pick a benchmark tool
- Run it with a variety of options
- Make a slide deck of the results
- Hand the slides to management
Traffic Light Anti-Method
- Open dashboard
- All green? Assume everything is good.
- Something red? Assume that's a problem.
Methodologies
Ad Hoc Checklist Method
- ..N. Run A, if B, do C
Problem Statement Method
- What makes you think there is a performance problem?
- Has this system ever performed well?
- What has changed recently? (Software? Hardware? Load?)
- Can the performance degradation be expressed in terms of latency or run time?
- Does the problem affect other people or applications (or is it just you)?
- What is the environment? What software and hardware is used? Versions? Configuration?
RTFM Method
(Read The Fine Manual) How to research performance tools and metrics:
- Man pages
- Books
- Web search
- Co-workers
- Prior talk slides/video
- Support services
- Source code
- Experimentation
- Social Media
Scientific Method
- Question
- Hypothesis
- Prediction
- Test
- Analysis
OODA Loop
- Observe
- Orient
- Decide
- Act
Workload Characterization Method
- Who is causing the load? (PID, UID, IP addr, ...)
- Why is the load called? (code path)
- What is the load? (IOPS, tput, type)
- How is the load changing over time? (time series line graph)
Drill-Down Analysis Method
- Start at highest level
- Examine next-level details
- Pick most interesting breakdown
- If problem unsolved, go to 2
Process of Elimination
- Divide the target into components
- Choose a test which:
- Can exonerate many untested components (ideally, half of those remaining)
- Is quick to perform
- Perform test
- Were the tested components exonerated?
- Yes: go to 2
- No: problem found?
- Yes: done
- No: how many components were tested?
- one: target = tested component; go to 1
- multiple: go to 2
- Not sure: consider components untested; go to 2 and choose a different test
Time Division Method
- Measure operation time (or latency)
- Divide time into logical synchronous components
- Continue division until latency origin is identified
- Quantify: estimate speedup if problem fixed
(I previously called this the "Latency Analysis Method")
5 Whys Performance Method
- Given delivered performance, ask, "why?", then answer this question
- ..5 Given previous answer, ask, "why?", then answer this question
By-Layer Method
Measure latency in detail (eg, as a histogram) from:
- Dynamic languages
- Executable
- Libraries
- Syscalls
- Kernel: FS, network
- Device drivers
Investigate the lowest layer that latency is introduced
Tools Method
- List available performance tools (optionally add more)
- For each tool, list its useful metrics
- For each metric, list possible interpretation
- Run selected tools and interpret selected metrics.
USE Method
For every resource, check:
- Utilization
- Saturation
- Errors
RED Method
For every service or microservice, check:
- Request rate
- Errors
- Duration
CPU Profile Method
- Take a CPU profile (especially a flame graph)
- Understand all software in profile > 1%
Off-CPU Analysis
- Profile per-thread off-CPU time with stack traces
- Coalesce times with like stacks
- Study stacks from largest to shortest time
Stack Profile Method
- Profile thread stack traces, on- and off-CPU
- Coalesce
- Study stacks bottom-up
TSA Method
- For each thread of interest, measure time in operating system thread states. Eg:
- Executing
- Runnable
- Swapping
- Sleeping
- Lock
- Idle
- Investigate states from most to least frequent, using appropriate tools
Active Benchmarking Method
- Configure the benchmark to run for a long duration
- While running, analyze performance using other tools, and determine limiting factors
Method R
- Select user actions that matter for the business workload
- Measure causes of response time for user actions
- Calculate best net-payoff optimization activity
- If sufficient gain, tune
- If insufficient gain, suspend tuning until something changes
- Goto 1
Performance Evaluation Steps
- State the goals of the study and define system boundaries
- List system services and possible outcomes
- Select performance metrics
- List system and workload parameters
- Select factors and their values
- Select the workload
- Design the experiments
- Analyze and interpret the data
- Present the results
- If necessary, start over
Capacity Planning Process
- Instrument the system
- Monitor system usage
- Characterize workload
- Predict performance under different alternatives
- Select the lowest cost, highest performance alternative
Intel Hierarchical Top-Down Performance Characterization Methodology
- Are UOPs issued?
- If yes:
- Are UOPs retired?
- If yes: retiring (good)
- If no: investigate bad speculations
- If no:
- Allocation stall?
- If yes: investigate back-end stalls
- If no: investigate front-end stalls
Performance Mantras
- Don't do it
- Do it, but don't do it again
- Do it less
- Do it later
- Do it when they're not looking
- Do it concurrently
- Do it cheaper
Benchmarking Checklist
- Why not double?
- Did it break limits?
- Did it error?
- Does it reproduce?
- Does it matter?
- Did it even happen?
References
- Blame-Someone-Else Anti-Method and the USE Method were developed by me and first in print in [Gregg 13a]: Gregg, B., "Thinking Methodically about Performance", Communications of the ACM, Volume 56 Issue 2, Feb 2013. This article also included the Streetlight Anti-Method and the Problem Statement Method for the first time in print, also developed by me, but based on earlier work (the streetlight effect, https://en.wikipedia.org/wiki/Streetlight_effect, and initial performance checklists used by Sun Microsystems support).
- Random Change Anti-Method, Passive Benchmarking Anti-Method, Ad Hoc Checklist Method, Tools Method, TSA Method, and Active Benchmarking, were developed by me and first in print in [Gregg 13b]: Gregg, B., Systems Performance: Enterprise and the Cloud, Prentice Hall, Oct 2013.
- My USENIX LISA 2012 talk: "Performance Analysis Methodology", first summarized and named many of these (before they were in print).
- Method R is from [Millsap 03]: Millsap, C., Holt, J., Optimizing Oracle Performance, O'Reilly, 2003.
- Performance Evaluation Steps and Capacity Planning Process are from pages 26 and 124 of [Jain 91]: Jain, R., The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling, Wiley, 1991.
- Workload Characterization Method, and Drill-Down Analysis Method, were documented as specific methodologies in [Gregg 13a] and [Gregg 13b], but the general process has been known in IT for many years. You could glean it from [Jain 91], at least. I don't yet have an earlier reference.
- Intel Hierarchical Top-Down Performance Characterization Methodology was developed by Ahmad Yasin at Intel and documented in B.3.2 of the Intel 64 and IA-32 Architectures Optimization Reference Manual, 248966-030, Sep 2014.
- OODA Loop is from Joyd Boyd of the US Air Force, and popularized in technology by Roy Rapoport of Netflix.
- Performance Mantas is from Craig Hanson and Pat Crain.
- Benchmarking Checklist was developed by me.
- RED Method is from Tom Wilkie.
- Find a system or environment component you are not responsible for
- Hypothesize that the issue is with that component
- Redirect the issue to the responsible team
- When proven wrong, go to 1
- Pick observability tools that are:
- familiar
- found on the Internet
- found at random
- Run tools
- Look for obvious issues
- Change things at random until the problem goes away
- Measure a performance baseline
- Pick a random attribute to change (eg, a tunable)
- Change it in one direction
- Measure performance
- Change it in the other direction
- Measure performance
- Were the step 4 or 6 results better than the baseline? If so, keep the change; of not, revert
- Goto step 1
- Pick a benchmark tool
- Run it with a variety of options
- Make a slide deck of the results
- Hand the slides to management
- Open dashboard
- All green? Assume everything is good.
- Something red? Assume that's a problem.
Ad Hoc Checklist Method
- ..N. Run A, if B, do C
Problem Statement Method
- What makes you think there is a performance problem?
- Has this system ever performed well?
- What has changed recently? (Software? Hardware? Load?)
- Can the performance degradation be expressed in terms of latency or run time?
- Does the problem affect other people or applications (or is it just you)?
- What is the environment? What software and hardware is used? Versions? Configuration?
RTFM Method
- Man pages
- Books
- Web search
- Co-workers
- Prior talk slides/video
- Support services
- Source code
- Experimentation
- Social Media
(Read The Fine Manual) How to research performance tools and metrics:
Scientific Method
- Question
- Hypothesis
- Prediction
- Test
- Analysis
OODA Loop
- Observe
- Orient
- Decide
- Act
Workload Characterization Method
- Who is causing the load? (PID, UID, IP addr, ...)
- Why is the load called? (code path)
- What is the load? (IOPS, tput, type)
- How is the load changing over time? (time series line graph)
Drill-Down Analysis Method
- Start at highest level
- Examine next-level details
- Pick most interesting breakdown
- If problem unsolved, go to 2
Process of Elimination
- Divide the target into components
- Choose a test which:
- Can exonerate many untested components (ideally, half of those remaining)
- Is quick to perform
- Perform test
- Were the tested components exonerated?
- Yes: go to 2
- No: problem found?
- Yes: done
- No: how many components were tested?
- one: target = tested component; go to 1
- multiple: go to 2
- Not sure: consider components untested; go to 2 and choose a different test
Time Division Method
- Measure operation time (or latency)
- Divide time into logical synchronous components
- Continue division until latency origin is identified
- Quantify: estimate speedup if problem fixed
(I previously called this the "Latency Analysis Method")
5 Whys Performance Method
- Given delivered performance, ask, "why?", then answer this question
- ..5 Given previous answer, ask, "why?", then answer this question
By-Layer Method
- Dynamic languages
- Executable
- Libraries
- Syscalls
- Kernel: FS, network
- Device drivers
Measure latency in detail (eg, as a histogram) from:
Investigate the lowest layer that latency is introduced
Tools Method
- List available performance tools (optionally add more)
- For each tool, list its useful metrics
- For each metric, list possible interpretation
- Run selected tools and interpret selected metrics.
USE Method
- Utilization
- Saturation
- Errors
For every resource, check:
RED Method
- Request rate
- Errors
- Duration
For every service or microservice, check:
CPU Profile Method
- Take a CPU profile (especially a flame graph)
- Understand all software in profile > 1%
Off-CPU Analysis
- Profile per-thread off-CPU time with stack traces
- Coalesce times with like stacks
- Study stacks from largest to shortest time
Stack Profile Method
- Profile thread stack traces, on- and off-CPU
- Coalesce
- Study stacks bottom-up
TSA Method
- For each thread of interest, measure time in operating system thread states. Eg:
- Executing
- Runnable
- Swapping
- Sleeping
- Lock
- Idle
- Investigate states from most to least frequent, using appropriate tools
Active Benchmarking Method
- Configure the benchmark to run for a long duration
- While running, analyze performance using other tools, and determine limiting factors
Method R
- Select user actions that matter for the business workload
- Measure causes of response time for user actions
- Calculate best net-payoff optimization activity
- If sufficient gain, tune
- If insufficient gain, suspend tuning until something changes
- Goto 1
Performance Evaluation Steps
- State the goals of the study and define system boundaries
- List system services and possible outcomes
- Select performance metrics
- List system and workload parameters
- Select factors and their values
- Select the workload
- Design the experiments
- Analyze and interpret the data
- Present the results
- If necessary, start over
Capacity Planning Process
- Instrument the system
- Monitor system usage
- Characterize workload
- Predict performance under different alternatives
- Select the lowest cost, highest performance alternative
Intel Hierarchical Top-Down Performance Characterization Methodology
- Are UOPs issued?
- If yes:
- Are UOPs retired?
- If yes: retiring (good)
- If no: investigate bad speculations
- If no:
- Allocation stall?
- If yes: investigate back-end stalls
- If no: investigate front-end stalls
Performance Mantras
- Don't do it
- Do it, but don't do it again
- Do it less
- Do it later
- Do it when they're not looking
- Do it concurrently
- Do it cheaper
Benchmarking Checklist
- Why not double?
- Did it break limits?
- Did it error?
- Does it reproduce?
- Does it matter?
- Did it even happen?