Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

USENIX SREcon 2016: Performance Checklists for SREs

Talk from SREcon2016 by Brendan Gregg.

Video: https://www.youtube.com/watch?v=zxCWXNigDpA

Video: https://www.usenix.org/conference/srecon16/program/presentation/gregg

Description: "There's limited time for performance analysis in the emergency room. When there is a performance-related site outage, the SRE team must analyze and solve complex performance issues as quickly as possible, and under pressure. Many performance tools and techniques are designed for a different environment: an engineer analyzing their system over the course of hours or days, and given time to try dozens of tools: profilers, tracers, monitoring tools, benchmarks, as well as different tunings and configurations. But when Netflix is down, minutes matter, and there's little time for such traditional systems analysis. As with aviation emergencies, short checklists and quick procedures can be applied by the on-call SRE staff to help solve performance issues as quickly as possible.

In this talk, I'll cover a checklist for Linux performance analysis in 60 seconds, as well as other methodology-derived checklists and procedures for cloud computing, with examples of performance issues for context. Whether you are solving crises in the SRE war room, or just have limited time for performance engineering, these checklists and approaches should help you find some quick performance wins. Safe flying."

next
prev
1/79
next
prev
2/79
next
prev
3/79
next
prev
4/79
next
prev
5/79
next
prev
6/79
next
prev
7/79
next
prev
8/79
next
prev
9/79
next
prev
10/79
next
prev
11/79
next
prev
12/79
next
prev
13/79
next
prev
14/79
next
prev
15/79
next
prev
16/79
next
prev
17/79
next
prev
18/79
next
prev
19/79
next
prev
20/79
next
prev
21/79
next
prev
22/79
next
prev
23/79
next
prev
24/79
next
prev
25/79
next
prev
26/79
next
prev
27/79
next
prev
28/79
next
prev
29/79
next
prev
30/79
next
prev
31/79
next
prev
32/79
next
prev
33/79
next
prev
34/79
next
prev
35/79
next
prev
36/79
next
prev
37/79
next
prev
38/79
next
prev
39/79
next
prev
40/79
next
prev
41/79
next
prev
42/79
next
prev
43/79
next
prev
44/79
next
prev
45/79
next
prev
46/79
next
prev
47/79
next
prev
48/79
next
prev
49/79
next
prev
50/79
next
prev
51/79
next
prev
52/79
next
prev
53/79
next
prev
54/79
next
prev
55/79
next
prev
56/79
next
prev
57/79
next
prev
58/79
next
prev
59/79
next
prev
60/79
next
prev
61/79
next
prev
62/79
next
prev
63/79
next
prev
64/79
next
prev
65/79
next
prev
66/79
next
prev
67/79
next
prev
68/79
next
prev
69/79
next
prev
70/79
next
prev
71/79
next
prev
72/79
next
prev
73/79
next
prev
74/79
next
prev
75/79
next
prev
76/79
next
prev
77/79
next
prev
78/79
next
prev
79/79

PDF: SREcon_2016_perf_checklists.pdf

Keywords (from pdftotext):

slide 1:
    Performance	
      Checklists	
      
    for	
      SREs	
      
    Brendan Gregg
    Senior Performance Architect
    
slide 2:
    Performance	
      Checklists	
      
    per instance:
    uptime
    dmesg -T | tail
    vmstat 1
    mpstat -P ALL 1
    pidstat 1
    iostat -xz 1
    free -m
    sar -n DEV 1
    sar -n TCP,ETCP 1
    10. top
    cloud wide:
    1.	
      RPS,	
      CPU	
      
    2.	
      Volume	
      
    3.	
      Instances	
      
    4.	
      Scaling	
      
    5.	
      CPU/RPS	
      
    6.	
      Load	
      Avg	
      
    7.	
      Java	
      Heap	
      
    8.	
      ParNew	
      
    9.	
      Latency	
      
    10.	
      99th	
      Qle	
      
    
slide 3:
slide 4:
    Brendan	
      the	
      SRE	
      
    • On the Perf Eng team & primary on-call rotation for Core:
    our central SRE team
    – we get paged on SPS dips (starts per second) & more
    • In this talk I'll condense some perf engineering into SRE
    timescales (minutes) using checklists
    
slide 5:
    Performance	
      Engineering	
      
    !=	
      
    SRE	
      Performance	
      
    Incident	
      Response	
      
    
slide 6:
    Performance	
      Engineering	
      
    • Aim: best price/performance possible
    – Can be endless: continual improvement
    • Fixes can take hours, days, weeks, months
    – Time to read docs & source code, experiment
    – Can take on large projects no single team would staff
    • Usually no prior "good" state
    – No spot the difference. No starting point.
    – Is now "good" or "bad"? Experience/instinct helps
    • Solo/team work
    At Netflix: The Performance Engineering team, with help from
    developers
    
slide 7:
    Performance	
      Engineering	
      
    
slide 8:
    Performance	
      Engineering	
      
    stat tools
    tracers
    benchmarks
    monitoring
    dashboards
    documentation
    source code
    tuning
    PMCs
    profilers
    flame graphs
    
slide 9:
    SRE	
      Perf	
      Incident	
      Response	
      
    • Aim: resolve issue in minutes
    – Quick resolution is king. Can scale up, roll back, redirect traffic.
    – Must cope under pressure, and at 3am
    • Previously was in a "good" state
    – Spot the difference with historical graphs
    • Get immediate help from all staff
    – Must be social
    • Reliability & perf issues often related
    At Netflix, the Core team (5 SREs), with immediate help
    from developers and performance engineers
    
slide 10:
    SRE	
      Perf	
      Incident	
      Response	
      
    
slide 11:
    SRE	
      Perf	
      Incident	
      Response	
      
    custom dashboards
    central event logs
    distributed system tracing
    chat rooms
    pager
    ticket system
    
slide 12:
    NeSlix	
      Cloud	
      Analysis	
      Process	
      
    In summary…
    Example SRE
    response path
    enumerated
    Atlas	
      Alerts	
      
    ICE	
      
    1.	
      Check	
      Issue	
      
    Cost	
      
    Atlas	
      Dashboards	
      
    2.	
      Check	
      Events	
      
    Chronos	
      
    Create	
      
    New	
      Alert	
      
    Plus some other
    tools not pictured
    Redirected	
      to	
      
    a	
      new	
      Target	
      
    3.	
      Drill	
      Down	
      
    Atlas	
      Metrics	
      
    4.	
      Check	
      Dependencies	
      
    5.	
      Root	
      
    Cause	
      
    Mogul	
      
    SSH,	
      instance	
      tools	
      
    Salp	
      
    
slide 13:
    The	
      Need	
      for	
      Checklists	
      
    Speed
    Completeness
    A Starting Point
    An Ending Point
    Reliability
    Training
    Perf checklists have historically
    been created for perf engineering
    (hours) not SRE response (minutes)
    More on checklists: Gawande, A.,
    The Checklist Manifesto. Metropolitan
    Books, 2008
    Boeing	
      707	
      Emergency	
      Checklist	
      (1969)	
      
    
slide 14:
    SRE	
      Checklists	
      at	
      NeSlix	
      
    • Some shared docs
    – PRE Triage Methodology
    – go/triage: a checklist of dashboards
    • Most "checklists" are really custom dashboards
    – Selected metrics for both reliability and performance
    • I maintain my own per-service and per-device checklists
    
slide 15:
    SRE	
      Performance	
      Checklists	
      
    The following are:
    • Cloud performance checklists/dashboards
    • SSH/Linux checklists (lowest common denominator)
    • Methodologies for deriving cloud/instance checklists
    Ad Hoc
    Methodology
    Checklists
    Dashboards
    Including aspirational: what we want to do & build as dashboards
    
slide 16:
    1.	
      PRE	
      Triage	
      Checklist	
      
    	
      
    Our	
      iniQal	
      checklist	
      
    NeSlix	
      specific	
      
    
slide 17:
    PRE	
      Triage	
      Checklist	
      
    • Performance and Reliability Engineering checklist
    – Shared doc with a hierarchal checklist with 66 steps total
    1. Initial Impact
    record timestamp
    quantify: SPS, signups, support calls
    check impact: regional or global?
    check devices: device specific?
    2. Time Correlations
    1. pretriage dashboard
    1. check for suspect NIWS client: error rates
    2. check for source of error/request rate change
    3. […dashboard specifics…]
    Confirms, quantifies,
    & narrows problem.
    Helps you reason
    about the cause.
    
slide 18:
    PRE	
      Triage	
      Checklist.	
      cont.	
      
    • 3. Evaluate Service Health
    – perfvitals dashboard
    – mogul dependency correlation
    – by cluster/asg/node:
    • latency: avg, 90 percentile
    • request rate
    • CPU: utilization, sys/user
    • Java heap: GC rate, leaks
    • memory
    • load average
    • thread contention (from Java)
    • JVM crashes
    • network: tput, sockets
    • […]
    custom dashboards
    
slide 19:
    2.	
      predash	
      
    	
      
    IniQal	
      dashboard	
      
    NeSlix	
      specific	
      
    
slide 20:
    predash	
      
    Performance and Reliability Engineering dashboard
    A list of selected dashboards suited for incident response
    
slide 21:
    predash	
      
    List of dashboards is its own checklist:
    1. Overview
    2. Client stats
    3. Client errors & retries
    4. NIWS HTTP errors
    5. NIWS Errors by code
    6. DRM request overview
    7. DoS attack metrics
    8. Push map
    9. Cluster status
    
slide 22:
    3.	
      perfvitals	
      
    	
      
    Service	
      dashboard	
      
    
slide 23:
    perfvitals	
      
    1.	
      RPS,	
      CPU	
      
    2.	
      Volume	
      
    3.	
      Instances	
      
    4.	
      Scaling	
      
    5.	
      CPU/RPS	
      
    6.	
      Load	
      Avg	
      
    7.	
      Java	
      Heap	
      
    8.	
      ParNew	
      
    9.	
      Latency	
      
    10.	
      99th	
      Qle	
      
    
slide 24:
    4.	
      Cloud	
      ApplicaQon	
      Performance	
      
    Dashboard	
      
    	
      
    A	
      generic	
      example	
      
    
slide 25:
    Cloud	
      App	
      Perf	
      Dashboard	
      
    1. Load
    2. Errors
    3. Latency
    4. Saturation
    5. Instances
    
slide 26:
    Cloud	
      App	
      Perf	
      Dashboard	
      
    1. Load
    2. Errors
    3. Latency
    4. Saturation
    5. Instances
    problem	
      of	
      load	
      applied?	
      req/sec,	
      by	
      type	
      
    errors,	
      Qmeouts,	
      retries	
      
    response	
      Qme	
      average,	
      99th	
      -­‐Qle,	
      distribuQon	
      
    CPU	
      load	
      averages,	
      queue	
      length/Qme	
      
    scale	
      up/down?	
      count,	
      state,	
      version	
      
    All time series, for every application, and dependencies.
    Draw a functional diagram with the entire data path.
    Same as Google's "Four Golden Signals" (Latency, Traffic,
    Errors, Saturation), with instances added due to cloud
    – Beyer, B., Jones, C., Petoff, J., Murphy, N. Site Reliability Engineering.
    O'Reilly, Apr 2016
    
slide 27:
    5.	
      Bad	
      Instance	
      Dashboard	
      
    	
      
    An	
      An>gt;-­‐Methodology	
      
    
slide 28:
    Bad	
      Instance	
      Dashboard	
      
    Plot request time per-instance
    Find the bad instance
    Terminate bad instance
    Someone else’s problem now!
    In SRE incident response, if it works,
    do it.
    Bad	
      instance	
      
    Terminate!	
      
    95th	
      percenQle	
      latency	
      
    (Atlas	
      Exploder)	
      
    
slide 29:
    Lots	
      More	
      Dashboards	
      
    We have countless more,
    mostly app specific and
    reliability focused
    • Most reliability incidents
    involve time correlation with a
    central log system
    Sometimes, dashboards &
    monitoring aren't enough.
    Time for SSH.
    NIWS HTTP errors:
    Error	
      Types	
      
    Regions	
      
    Apps	
      
    Time	
      
    
slide 30:
    6.	
      Linux	
      Performance	
      Analysis	
      
    in	
      
    60,000	
      milliseconds	
      
    
slide 31:
    Linux	
      Perf	
      Analysis	
      in	
      60s	
      
    1. uptime
    2. dmesg -T | tail
    3. vmstat 1
    4. mpstat -P ALL 1
    5. pidstat 1
    6. iostat -xz 1
    7. free -m
    8. sar -n DEV 1
    9. sar -n TCP,ETCP 1
    10. top
    
slide 32:
    Linux	
      Perf	
      Analysis	
      in	
      60s	
      
    1. uptime
    2. dmesg -T | tail
    3. vmstat 1
    4. mpstat -P ALL 1
    5. pidstat 1
    6. iostat -xz 1
    7. free -m
    8. sar -n DEV 1
    9. sar -n TCP,ETCP 1
    10. top
    load	
      averages	
      
    kernel	
      errors	
      
    overall	
      stats	
      by	
      Qme	
      
    CPU	
      balance	
      
    process	
      usage	
      
    disk	
      I/O	
      
    memory	
      usage	
      
    network	
      I/O	
      
    TCP	
      stats	
      
    check	
      overview	
      
    hap://techblog.neSlix.com/2015/11/linux-­‐performance-­‐analysis-­‐in-­‐60s.html	
      
    
slide 33:
    60s:	
      upQme,	
      dmesg,	
      vmstat	
      
    $ uptime
    23:51:26 up 21:31,
    1 user,
    load average: 30.02, 26.43, 19.02
    $ dmesg | tail
    [1880957.563150] perl invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
    [...]
    [1880957.563400] Out of memory: Kill process 18694 (perl) score 246 or sacrifice child
    [1880957.563408] Killed process 18694 (perl) total-vm:1972392kB, anon-rss:1953348kB, file-rss:0kB
    [2320864.954447] TCP: Possible SYN flooding on port 7001. Dropping request. Check SNMP counters.
    $ vmstat 1
    procs ---------memory---------- ---swap-- -----io---- -system-- ------cpu----r b swpd
    free
    buff cache
    cs us sy id wa st
    34 0
    0 200889792 73708 591828
    10 96 1 3 0 0
    32 0
    0 200889920 73708 591860
    592 13284 4282 98 1 1 0 0
    32 0
    0 200890112 73708 591860
    0 9501 2154 99 1 0 0 0
    32 0
    0 200889568 73712 591856
    48 11900 2459 99 0 0 0 0
    32 0
    0 200890208 73712 591860
    0 15898 4840 98 1 1 0 0
    
slide 34:
    60s:	
      mpstat	
      
    $ mpstat -P ALL 1
    Linux 3.13.0-49-generic (titanclusters-xxxxx)
    07:38:49 PM
    07:38:50 PM
    07:38:50 PM
    07:38:50 PM
    07:38:50 PM
    07:38:50 PM
    [...]
    CPU
    all
    %usr
    %nice
    %sys %iowait
    07/14/2015
    %irq
    _x86_64_ (32 CPU)
    %soft
    %steal
    %guest
    %gnice
    %idle
    
slide 35:
    60s:	
      pidstat	
      
    $ pidstat 1
    Linux 3.13.0-49-generic (titanclusters-xxxxx)
    07/14/2015
    _x86_64_
    (32 CPU)
    07:41:02 PM
    UID
    07:41:03 PM
    07:41:03 PM
    07:41:03 PM
    07:41:03 PM
    07:41:03 PM
    07:41:03 PM 60004
    PID
    %usr %system
    6521 1596.23
    6564 1571.70
    %guest
    %CPU
    0.00 1598.11
    0.00 1579.25
    CPU
    Command
    rcuos/0
    mesos-slave
    java
    java
    java
    pidstat
    07:41:03 PM
    UID
    07:41:04 PM
    07:41:04 PM
    07:41:04 PM
    07:41:04 PM
    07:41:04 PM 60004
    PID
    %usr %system
    6521 1590.00
    6564 1573.00
    %guest
    %CPU
    0.00 1591.00
    0.00 1583.00
    CPU
    Command
    mesos-slave
    java
    java
    snmp-pass
    pidstat
    
slide 36:
    60s:	
      iostat	
      
    $ iostat -xmdz 1
    Linux 3.13.0-29 (db001-eb883efa)
    Device:
    xvda
    xvdb
    xvdc
    md0
    rrqm/s
    08/18/2014
    wrqm/s
    r/s
    0.00 15299.00
    0.00 15271.00
    0.00 31082.00
    w/s
    _x86_64_
    rMB/s
    (16 CPU)
    wMB/s \ ...
    0.00 / ...
    0.00 \ ...
    0.01 / ...
    0.01 \ ...
    Workload	
      
    ... \ avgqu-sz
    ... /
    ... \
    ... /
    ... \
    await r_await w_await
    ResulQng	
      Performance	
      
    svctm
    %util
    
slide 37:
    60s:	
      free,	
      sar	
      –n	
      DEV	
      
    $ free -m
    total
    Mem:
    -/+ buffers/cache:
    Swap:
    used
    free
    shared
    $ sar -n DEV 1
    Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015
    buffers
    _x86_64_
    cached
    (32 CPU)
    12:16:48 AM
    12:16:49 AM
    12:16:49 AM
    12:16:49 AM
    IFACE rxpck/s
    eth0 18763.00
    docker0
    txpck/s
    rxkB/s
    5032.00 20686.42
    txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
    12:16:49 AM
    12:16:50 AM
    12:16:50 AM
    12:16:50 AM
    IFACE rxpck/s
    eth0 19763.00
    docker0
    txpck/s
    rxkB/s
    5101.00 21999.10
    txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
    
slide 38:
    60s:	
      sar	
      –n	
      TCP,ETCP	
      
    $ sar -n TCP,ETCP 1
    Linux 3.13.0-49-generic (titanclusters-xxxxx)
    (32 CPU)
    12:17:19 AM
    12:17:20 AM
    active/s passive/s
    12:17:19 AM
    12:17:20 AM
    atmptf/s
    12:17:20 AM
    12:17:21 AM
    active/s passive/s
    12:17:20 AM
    12:17:21 AM
    atmptf/s
    iseg/s
    07/14/2015
    oseg/s
    estres/s retrans/s isegerr/s
    iseg/s
    _x86_64_
    orsts/s
    oseg/s
    estres/s retrans/s isegerr/s
    orsts/s
    
slide 39:
    60s:	
      top	
      
    $ top
    top - 00:15:40 up 21:56, 1 user, load average: 31.09, 29.87, 29.92
    Tasks: 871 total,
    1 running, 868 sleeping,
    0 stopped,
    2 zombie
    %Cpu(s): 96.8 us, 0.4 sy, 0.0 ni, 2.7 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
    KiB Mem: 25190241+total, 24921688 used, 22698073+free,
    60448 buffers
    KiB Swap:
    0 total,
    0 used,
    0 free.
    554208 cached Mem
    PID USER
    20248 root
    4213 root
    66128 titancl+
    5235 root
    4299 root
    1 root
    2 root
    3 root
    5 root
    6 root
    8 root
    PR NI
    VIRT
    RES
    0 0.227t 0.012t
    0 2722544 64640
    0 38.227g 547004
    0 20.015g 2.682g
    0 -20
    SHR S
    18748 S
    44232 S
    1172 R
    49996 S
    16836 S
    1496 S
    0 S
    0 S
    0 S
    0 S
    0 S
    %CPU %MEM
    TIME+ COMMAND
    3090 5.2 29812:58 java
    23.5 0.0 233:35.37 mesos-slave
    1.0 0.0
    0:00.07 top
    0.7 0.2
    2:02.74 java
    0.3 1.1 33:14.42 java
    0.0 0.0
    0:03.82 init
    0.0 0.0
    0:00.02 kthreadd
    0.0 0.0
    0:05.35 ksoftirqd/0
    0.0 0.0
    0:00.00 kworker/0:0H
    0.0 0.0
    0:06.94 kworker/u256:0
    0.0 0.0
    2:38.05 rcu_sched
    
slide 40:
    Other	
      Analysis	
      in	
      60s	
      
    • We need such checklists for:
    – Java
    – Cassandra
    – MySQL
    – Nginx
    – etc…
    • Can follow a methodology:
    – Process of elimination
    – Workload characterization
    – Differential diagnosis
    – Some summaries: http://www.brendangregg.com/methodology.html
    • Turn checklists into dashboards (many do exist)
    
slide 41:
    7.	
      Linux	
      Disk	
      Checklist	
      
    
slide 42:
slide 43:
    Linux	
      Disk	
      Checklist	
      
    iostat –xnz 1
    vmstat 1
    df -h
    ext4slower 10
    bioslower 10
    ext4dist 1
    biolatency 1
    cat /sys/devices/…/ioerr_cnt
    smartctl -l error /dev/sda1
    
slide 44:
    Linux	
      Disk	
      Checklist	
      
    iostat –xnz 1
    any	
      disk	
      I/O?	
      if	
      not,	
      stop	
      looking	
      
    vmstat 1
    is	
      this	
      swapping?	
      or,	
      high	
      sys	
      Qme?	
      
    df -h
    are	
      file	
      systems	
      nearly	
      full?	
      
    ext4slower 10
    (zfs*,	
      xfs*,	
      etc.)	
      slow	
      file	
      system	
      I/O?	
      
    bioslower 10
    if	
      so,	
      check	
      disks	
      
    check	
      distribuQon	
      and	
      rate	
      
    ext4dist 1
    biolatency 1
    if	
      interesQng,	
      check	
      disks	
      
    	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      (if	
      available)	
      errors	
      
    cat /sys/devices/…/ioerr_cnt
    smartctl -l error /dev/sda1
    	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      (if	
      available)	
      errors	
      
    Another short checklist. Won't solve everything. FS focused.
    ext4slower/dist, bioslower, are from bcc/BPF tools.
    
slide 45:
    ext4slower	
      
    • ext4 operations slower than the threshold:
    # ./ext4slower 1
    Tracing ext4 operations slower than 1 ms
    TIME
    COMM
    PID
    T BYTES
    OFF_KB
    06:49:17 bash
    R 128
    06:49:17 cksum
    R 39552
    06:49:17 cksum
    R 96
    06:49:17 cksum
    R 96
    06:49:17 cksum
    R 10320
    06:49:17 cksum
    R 65536
    06:49:17 cksum
    R 55400
    06:49:17 cksum
    R 36792
    06:49:17 cksum
    R 15008
    […]
    LAT(ms) FILENAME
    7.75 cksum
    1.34 [
    5.36 2to3-2.7
    14.94 2to3-3.4
    6.82 411toppm
    4.01 a2p
    8.77 ab
    16.34 aclocal-1.14
    19.31 acpi_listen
    • Better indicator of application pain than disk I/O
    • Measures & filters in-kernel for efficiency using BPF
    – From https://github.com/iovisor/bcc
    
slide 46:
    BPF	
      is	
      coming…	
      
    Free	
      your	
      mind	
      
    
slide 47:
    BPF	
      
    • That file system checklist should be a dashboard:
    – FS & disk latency histograms, heatmaps, IOPS, outlier log
    • Now possible with enhanced BPF (Berkeley Packet Filter)
    – Built into Linux 4.x: dynamic tracing, filters, histograms
    System dashboards of 2017+ should look very different
    
slide 48:
    8.	
      Linux	
      Network	
      Checklist	
      
    
slide 49:
    Linux	
      Network	
      Checklist	
      
    1. sar -n DEV,EDEV 1
    2. sar -n TCP,ETCP 1
    3. cat /etc/resolv.conf
    4. mpstat -P ALL 1
    5. tcpretrans
    6. tcpconnect
    7. tcpaccept
    8. netstat -rnv
    9. check firewall config
    10. netstat -s
    
slide 50:
    Linux	
      Network	
      Checklist	
      
    1. sar -n DEV,EDEV 1
    2. sar -n TCP,ETCP 1
    3. cat /etc/resolv.conf
    4. mpstat -P ALL 1
    5. tcpretrans
    6. tcpconnect
    7. tcpaccept
    8. netstat -rnv
    9. check firewall config
    10. netstat -s
    tcp*, are from bcc/BPF tools
    at	
      interface	
      limits?	
      or	
      use	
      nicstat	
      
    acQve/passive	
      load,	
      retransmit	
      rate	
      
    it's	
      always	
      DNS	
      
    high	
      kernel	
      Qme?	
      single	
      hot	
      CPU?	
      
    what	
      are	
      the	
      retransmits?	
      state?	
      
    connecQng	
      to	
      anything	
      unexpected?	
      
    unexpected	
      workload?	
      
    any	
      inefficient	
      routes?	
      
    anything	
      blocking/throaling?	
      
    play	
      252	
      metric	
      pickup	
      
    
slide 51:
    tcpretrans	
      
    • Just trace kernel TCP retransmit functions for efficiency:
    # ./tcpretrans
    TIME
    PID
    01:55:05 0
    01:55:05 0
    01:55:17 0
    […]
    IP LADDR:LPORT
    4 10.153.223.157:22
    4 10.153.223.157:22
    4 10.153.223.157:22
    T>gt; RADDR:RPORT
    R>gt; 69.53.245.40:34619
    R>gt; 69.53.245.40:34619
    R>gt; 69.53.245.40:22957
    STATE
    ESTABLISHED
    ESTABLISHED
    ESTABLISHED
    • From either bcc (BPF) or perf-tools (ftrace, older kernels)
    
slide 52:
    9.	
      Linux	
      CPU	
      Checklist	
      
    
slide 53:
    (too many lines – should be a utilization heat map)
    
slide 54:
    http://www.brendangregg.com/HeatMaps/subsecondoffset.html
    
slide 55:
    $ perf script
    […]
    java 14327 [022] 252764.179741: cycles:
    java 14315 [014] 252764.183517: cycles:
    java 14310 [012] 252764.185317: cycles:
    java 14332 [015] 252764.188720: cycles:
    java 14341 [019] 252764.191307: cycles:
    java 14341 [019] 252764.198825: cycles:
    java 14341 [019] 252764.207057: cycles:
    java 14341 [019] 252764.215962: cycles:
    java 14341 [019] 252764.225141: cycles:
    java 14341 [019] 252764.234578: cycles:
    […]
    7f36570a4932 SpinPause (/usr/lib/jvm/java-8
    7f36570a4932 SpinPause (/usr/lib/jvm/java-8
    7f36570a4932 SpinPause (/usr/lib/jvm/java-8
    7f3658078350 pthread_cond_wait@@GLIBC_2.3.2
    7f3656d150c8 ClassLoaderDataGraph::do_unloa
    7f3656d140b8 ClassLoaderData::free_dealloca
    7f3657192400 nmethod::do_unloading(BoolObje
    7f3656ba807e Assembler::locate_operand(unsi
    7f36571922e8 nmethod::do_unloading(BoolObje
    7f3656ec4960 CodeHeap::block_start(void*) c
    
slide 56:
    Linux	
      CPU	
      Checklist	
      
    uptime
    vmstat 1
    mpstat -P ALL 1
    pidstat 1
    CPU flame graph
    CPU subsecond offset heat map
    perf stat -a -- sleep 10
    
slide 57:
    Linux	
      CPU	
      Checklist	
      
    uptime
    load	
      averages	
      
    vmstat 1
    system-­‐wide	
      uQlizaQon,	
      run	
      q	
      length	
      
    mpstat -P ALL 1
    CPU	
      balance	
      
    pidstat 1
    per-­‐process	
      CPU	
      
    CPU flame graph
    CPU	
      profiling	
      
    	
      	
      	
      	
      	
      	
      	
      	
      map
    	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      look	
      for	
      gaps	
      
    CPU subsecond offset heat
    perf stat -a -- sleep	
      	
      	
      	
      	
      10
    	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      	
      IPC,	
      LLC	
      hit	
      raQo	
      
    htop can do 1-4
    
slide 58:
    htop	
      
    
slide 59:
    CPU	
      Flame	
      Graph	
      
    
slide 60:
    perf_events	
      CPU	
      Flame	
      Graphs	
      
    • We have this automated in Netflix Vector:
    git clone --depth 1 https://github.com/brendangregg/FlameGraph
    cd FlameGraph
    perf record -F 99 -a –g -- sleep 30
    perf script | ./stackcollapse-perf.pl |./flamegraph.pl >gt; perf.svg
    • Flame graph interpretation:
    – x-axis: alphabetical stack sort, to maximize merging
    – y-axis: stack depth
    – color: random, or hue can be a dimension (eg, diff)
    – Top edge is on-CPU, beneath it is ancestry
    • Can also do Java & Node.js. Differentials.
    • We're working on a d3 version for Vector
    
slide 61:
    10.	
      Tools	
      Method	
      
    	
      
    An	
      An>gt;-­‐Methodology	
      
    
slide 62:
    Tools	
      Method	
      
    1. RUN EVERYTHING AND HOPE FOR THE BEST
    For SRE response: a mental checklist to see what might
    have been missed (no time to run them all)
    
slide 63:
    Linux	
      Perf	
      Observability	
      Tools	
      
    
slide 64:
    Linux	
      StaQc	
      Performance	
      Tools	
      
    
slide 65:
    Linux	
      perf-­‐tools	
      (mrace,	
      perf)	
      
    
slide 66:
    Linux	
      bcc	
      tools	
      (BPF)	
      
    Needs	
      Linux	
      4.x	
      
    CONFIG_BPF_SYSCALL=y	
      
    
slide 67:
    11.	
      USE	
      Method	
      
    	
      
    A	
      Methodology	
      
    
slide 68:
    The	
      USE	
      Method	
      
    • For every resource, check:
    Utilization
    Saturation
    Errors
    X	
      
    Resource	
      
    UQlizaQon	
      
    (%)	
      
    • Definitions:
    – Utilization: busy time
    – Saturation: queue length or queued time
    – Errors: easy to interpret (objective)
    Used to generate checklists. Starts with the questions,
    then finds the tools.
    
slide 69:
    USE	
      Method	
      for	
      Hardware	
      
    • For every resource, check:
    Utilization
    Saturation
    Errors
    • Including busses & interconnects
    
slide 70:
    (hap://www.brendangregg.com/USEmethod/use-­‐linux.html)	
      
    
slide 71:
    USE	
      Method	
      for	
      Distributed	
      Systems	
      
    • Draw a service diagram, and for every service:
    Utilization: resource usage (CPU, network)
    Saturation: request queueing, timeouts
    Errors
    • Turn into a dashboard
    
slide 72:
    NeSlix	
      Vector	
      
    • Real time instance analysis tool
    – https://github.com/netflix/vector
    – http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html
    • USE method-inspired metrics
    – More in development, incl. flame graphs
    
slide 73:
    NeSlix	
      Vector	
      
    
slide 74:
    CPU:
    utilization
    NeSlix	
      Vector	
      
    Network:
    utilization
    Memory:
    utilization
    Disk:
    load
    saturation
    saturation
    load
    saturation
    utilization
    saturation
    
slide 75:
    12.	
      Bonus:	
      External	
      Factor	
      Checklist	
      
    
slide 76:
    External	
      Factor	
      Checklist	
      
    1. Sports ball?
    2. Power outage?
    3. Snow storm?
    4. Internet/ISP down?
    5. Vendor firmware update?
    6. Public holiday/celebration?
    7. Chaos Kong?
    Social media searches (Twitter) often useful
    – Can also be NSFW
    
slide 77:
    Take	
      Aways	
      
    • Checklists are great
    – Speed, Completeness, Starting/Ending Point, Training
    – Can be ad hoc, or from a methodology (USE method)
    • Service dashboards
    – Serve as checklists
    – Metrics: Load, Errors, Latency, Saturation, Instances
    • System dashboards with Linux BPF
    – Latency histograms & heatmaps, etc. Free your mind.
    Please create and share more checklists
    
slide 78:
    References	
      
    Netflix Tech Blog:
    Linux Performance & BPF tools:
    http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
    Heat maps:
    http://www.brendangregg.com/USEmethod/use-linux.html
    Flame Graphs:
    http://www.brendangregg.com/linuxperf.html
    https://github.com/iovisor/bcc#tools
    USE Method Linux:
    http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html
    http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html
    http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html
    http://cacm.acm.org/magazines/2010/7/95062-visualizing-system-latency/fulltext
    http://www.brendangregg.com/heatmaps.html
    Books:
    Beyer, B., et al. Site Reliability Engineering. O'Reilly, Apr 2016
    Gawande, A. The Checklist Manifesto. Metropolitan Books, 2008
    Gregg, B. Systems Performance. Prentice Hall, 2013 (more checklists & methods!)
    Thanks: Netflix Perf & Core teams for predash, pretriage, Vector, etc
    
slide 79:
    Thanks	
      
    http://slideshare.net/brendangregg
    http://www.brendangregg.com
    bgregg@netflix.com
    @brendangregg
    Netflix is hiring SREs!