Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

YOW! 2018: Cloud Performance Root Cause Analysis at Netflix

Keynote by Brendan Gregg for YOW! 2018.

Video: https://www.youtube.com/watch?v=tAY8PnfrS_k

Description: "At Netflix, improving the performance of our cloud means happier customers and lower costs, and involves root cause analysis of applications, runtimes, operating systems, and hypervisors, in an environment of 150k cloud instances that undergo numerous production changes each week. Apart from the developers who regularly optimize their own code, we also have a dedicated performance team to help with any issue across the cloud, and to build tooling to aid in this analysis. In this session we will summarize the Netflix environment, procedures, and tools we use and build to do root cause analysis on cloud performance issues. The analysis performed may be cloud-wide, using self-service GUIs such as our open source Atlas tool, or focused on individual instances, and use our open source Vector tool, flame graphs, Java debuggers, and tooling that uses Linux perf, ftrace, and bcc/eBPF. You can use these open source tools in the same way to find performance wins in your own environment."

next
prev
1/101
next
prev
2/101
next
prev
3/101
next
prev
4/101
next
prev
5/101
next
prev
6/101
next
prev
7/101
next
prev
8/101
next
prev
9/101
next
prev
10/101
next
prev
11/101
next
prev
12/101
next
prev
13/101
next
prev
14/101
next
prev
15/101
next
prev
16/101
next
prev
17/101
next
prev
18/101
next
prev
19/101
next
prev
20/101
next
prev
21/101
next
prev
22/101
next
prev
23/101
next
prev
24/101
next
prev
25/101
next
prev
26/101
next
prev
27/101
next
prev
28/101
next
prev
29/101
next
prev
30/101
next
prev
31/101
next
prev
32/101
next
prev
33/101
next
prev
34/101
next
prev
35/101
next
prev
36/101
next
prev
37/101
next
prev
38/101
next
prev
39/101
next
prev
40/101
next
prev
41/101
next
prev
42/101
next
prev
43/101
next
prev
44/101
next
prev
45/101
next
prev
46/101
next
prev
47/101
next
prev
48/101
next
prev
49/101
next
prev
50/101
next
prev
51/101
next
prev
52/101
next
prev
53/101
next
prev
54/101
next
prev
55/101
next
prev
56/101
next
prev
57/101
next
prev
58/101
next
prev
59/101
next
prev
60/101
next
prev
61/101
next
prev
62/101
next
prev
63/101
next
prev
64/101
next
prev
65/101
next
prev
66/101
next
prev
67/101
next
prev
68/101
next
prev
69/101
next
prev
70/101
next
prev
71/101
next
prev
72/101
next
prev
73/101
next
prev
74/101
next
prev
75/101
next
prev
76/101
next
prev
77/101
next
prev
78/101
next
prev
79/101
next
prev
80/101
next
prev
81/101
next
prev
82/101
next
prev
83/101
next
prev
84/101
next
prev
85/101
next
prev
86/101
next
prev
87/101
next
prev
88/101
next
prev
89/101
next
prev
90/101
next
prev
91/101
next
prev
92/101
next
prev
93/101
next
prev
94/101
next
prev
95/101
next
prev
96/101
next
prev
97/101
next
prev
98/101
next
prev
99/101
next
prev
100/101
next
prev
101/101

PDF: YOW2018_CloudPerfRCANetflix.pdf

Keywords (from pdftotext):

slide 1:
    Cloud Performance
    Root Cause Analysis
    at Netflix
    Brendan Gregg
    Senior Performance Architect
    Cloud and Platform Engineering
    YOW! Conference Australia
    Nov-Dec 2018
    
slide 2:
    Experience: CPU Dips
    
slide 3:
slide 4:
    # perf record -F99 -a
    # perf script
    […]
    java 14327 [022] 252764.179741: cycles:
    java 14315 [014] 252764.183517: cycles:
    java 14310 [012] 252764.185317: cycles:
    java 14332 [015] 252764.188720: cycles:
    java 14341 [019] 252764.191307: cycles:
    java 14341 [019] 252764.198825: cycles:
    java 14341 [019] 252764.207057: cycles:
    java 14341 [019] 252764.215962: cycles:
    java 14341 [019] 252764.225141: cycles:
    java 14341 [019] 252764.234578: cycles:
    […]
    7f36570a4932 SpinPause (/usr/lib/jvm/java-8
    7f36570a4932 SpinPause (/usr/lib/jvm/java-8
    7f36570a4932 SpinPause (/usr/lib/jvm/java-8
    7f3658078350 pthread_cond_wait@@GLIBC_2.3.2
    7f3656d150c8 ClassLoaderDataGraph::do_unloa
    7f3656d140b8 ClassLoaderData::free_dealloca
    7f3657192400 nmethod::do_unloading(BoolObje
    7f3656ba807e Assembler::locate_operand(unsi
    7f36571922e8 nmethod::do_unloading(BoolObje
    7f3656ec4960 CodeHeap::block_start(void*) c
    
slide 5:
slide 6:
slide 7:
    Observability
    Methodology
    Velocity
    
slide 8:
    Root Cause Analysis at Netflix
    Devices
    gRPC
    Zuul 2
    Load
    Ribbon
    Hystrix
    Eureka
    Service
    Tomcat
    JVM
    Instances (Linux)
    AZ 3
    AZ 1
    AZ 2
    ASG 1
    ELB
    ASG Cluster
    Application
    Netflix
    Roots
    ASG 2
    Atlas
    Chronos
    Zipkin
    Vector
    sar, *stat
    ftrace
    bcc/eBPF
    bpftrace
    PMCs, MSRs
    
slide 9:
    Agenda
    1. The Netflix Cloud
    2. Methodology
    3. Cloud Analysis
    4. Instance Analysis
    
slide 10:
    Since 2014
    Asgard → Spinnaker Spinnaker
    Salp → Spinnaker Zipkin
    gRPC adoption
    New Atlas UI & Lumen
    Java frame pointer
    eBPF bcc & bpftrace
    PMCs in EC2
    From Clouds to Roots (2014 presentation): Old Atlas UI
    
slide 11:
    >gt;150k AWS EC2 server instances
    ~34% US Internet traffic at night
    >gt;130M members
    Performance is customer satisfaction & Netflix cost
    
slide 12:
    Acronyms
    AWS: Amazon Web Services
    EC2: AWS Elastic Compute 2 (cloud instances)
    S3: AWS Simple Storage Service (object store)
    ELB: AWS Elastic Load Balancers
    SQS: AWS Simple Queue Service
    SES: AWS Simple Email Service
    CDN: Content Delivery Network
    OCA: Netflix Open Connect Appliance (streaming CDN)
    QoS: Quality of Service
    AMI: Amazon Machine Image (instance image)
    ASG: Auto Scaling Group
    AZ: Availability Zone
    NIWS: Netflix Internal Web Service framework (Ribbon)
    gRPC: gRPC Remote Procedure Calls
    MSR: Model Specific Register (CPU info register)
    PMC: Performance Monitoring Counter (CPU perf counter)
    eBPF: extended Berkeley Packet Filter (kernel VM)
    
slide 13:
    1. The Netflix Cloud
    Overview
    
slide 14:
    The Netflix Cloud
    EC2
    ELB
    Cassandra
    Applications
    (Services)
    Elasticsearch
    EVCache
    SES
    SQS
    
slide 15:
    Netflix
    Microservices
    Authentication
    Web Site API
    User Data
    Personalization
    EC2
    Client
    Devices
    Streaming API
    Viewing Hist.
    DRM
    QoS Logging
    OCA CDN
    CDN Steering
    Encoding
    
slide 16:
    Freedom and Responsibility
    Culture deck memo is true
    https://jobs.netflix.com/culture
    Deployment freedom
    Purchase and use cloud instances without approvals
    Netflix environment changes fast!
    
slide 17:
    Cloud Technologies
    Usually open source
    Linux, Java, Cassandra,
    Node.js, …
    http://netflix.github.io/
    
slide 18:
    Cloud Instances
    Linux (Ubuntu)
    Optional Apache,
    memcached, nonJava apps (incl.
    Node.js, golang)
    Atlas monitoring,
    S3 log rotation,
    ftrace, perf,
    bcc/eBPF
    Java (JDK 8)
    GC and
    thread
    dump
    logging
    Tomcat
    Application war files, base
    servlet, platform, hystrix,
    health check, metrics (Servo)
    Typical BaseAMI
    
slide 19:
    5 Key Issues
    And How the Netflix Cloud is
    Architected to Solve Them
    
slide 20:
    1. Load Increases → Spinnaker Auto Scaling Groups
    Instances automatically
    added or removed by a
    custom scaling policy
    Alerts & monitoring used
    to check scaling is sane
    Good for customers: Fast workaround
    Good for engineers: Fix later, 9-5
    ASG
    CloudWatch, Servo
    Scaling Policy
    loadavg, latency, …
    Instance
    Instance
    Instance
    Instance
    
slide 21:
    2. Bad Push → Spinnaker ASG Cluster Rollback
    ASG red black clusters: how code
    versions are deployed
    Fast rollback for issues
    Traffic managed by Elastic Load
    Balancers (ELBs)
    Automated Canary Analysis (ACA)
    for testing
    ASG
    Cluster
    prod1
    ELB
    Canary
    ASG-v010
    ASG-v011
    Instance
    Instance
    Instance
    Instance
    Instance
    Instance
    
slide 22:
    3. Instance Failure → Spinnaker Hystrix Timeouts
    Hystrix: latency and fault tolerance
    for dependency services
    Fallbacks, degradation, fast fail and rapid
    recovery, timeouts, load shedding, circuit
    breaker, realtime monitoring
    Plus Ribbon or gRPC for more fault tolerance
    Tomcat
    Application
    get A
    Hystrix
    >gt;100ms
    Dependency
    Dependency
    
slide 23:
    4. Region failure → Spinnaker Zuul 2 Reroute Traffic
    All device traffic goes through the Zuul 2 proxy: dynamic routing, monitoring,
    resiliency, security
    Region or AZ failure: reroute traffic to another region
    Zuul 2, DNS
    Region 1
    Region 2
    Monitoring
    Region 3
    
slide 24:
    5. Overlooked Issue → Spinnaker Chaos Engineering
    lnstances: termination
    (Resilience)
    Availability Zones: artificial failures
    Latency: artificial delays by ChAP
    Conformity: kills non-best-practices instances
    Doctor: health checks
    Janitor: kills unused instances
    Security: checks violations
    10-18: geographic issues
    
slide 25:
    A Resilient Architecture
    Devices
    gRPC
    Zuul 2
    Load
    Ribbon
    Hystrix
    Eureka
    Service
    Tomcat
    JVM
    Instances (Linux)
    AZ 3
    AZ 1
    AZ 2
    ASG 1
    ELB
    Some services vary:
    - Apache Web Server
    - Node.js & Prana
    - golang
    ASG Cluster
    Application
    Netflix
    Chaos Engineering
    ASG 2
    
slide 26:
    2. Methodology
    Cloud & Instance
    
slide 27:
    Why Do Root Cause Perf Analysis?
    Netflix
    Application
    ASG Cluster
    … ASG 2
    Often for:
    High latency
    Growth
    Upgrades
    ELB
    ASG 1
    AZ 3
    AZ 2
    AZ 1
    Instances (Linux)
    JVM
    Tomcat
    Service
    
slide 28:
    Cloud Methodologies
    Resource Analysis
    Metric and event correlations
    Latency Drilldowns
    RED Method
    For each microservice, check:
    - Rate
    - Errors
    - Duration
    Service A
    Service C
    Service B
    Service D
    
slide 29:
    Instance Methodologies
    Log Analysis
    Micro-benchmarking
    Drill-down analysis
    USE Method
    For each resource, check:
    - Utilization
    - Saturation
    - Errors
    CPU
    Memory
    Disk
    Controller
    Network
    Controller
    Disk
    Net
    Disk
    Net
    
slide 30:
    Bad Instance Anti-Method
    1. Plot request latency per-instance
    2. Find the bad instance
    3. Terminate it
    4. Someone else’s problem now!
    Bad instance latency
    Terminate!
    Could be an early warning of a bigger issue
    
slide 31:
    3. Cloud Analysis
    Atlas, Lumen, Chronos, ...
    
slide 32:
    Netflix Cloud Analysis Process
    Example path
    enumerated
    Atlas Alerts
    PICSOU
    Slack
    Cost
    Chat
    1. Check Issue
    Atlas/Lumen Dashboards
    2. Check Events
    Chronos
    Create
    New Alert
    Plus some other
    tools not pictured
    Redirected to
    a new Target
    3. Drill Down
    Atlas Metrics
    4. Check Dependencies
    5. Root Cause
    Instance Analysis
    Slalom
    Zipkin
    
slide 33:
    Atlas: Alerts
    Custom alerts on streams per second (SPS) changes, CPU usage, latency, ASG
    growth, client errors, …
    
slide 34:
slide 35:
    Winston: Automated Diagnostics & Remediation
    Links to Atlas
    Dashboards & Metrics
    Chronos: Possible Related Events
    
slide 36:
    Atlas: Dashboards
    
slide 37:
    Atlas: Dashboards
    Netflix perf vitals dashboard
    1. RPS, CPU
    2. Volume
    3. Instances
    4. Scaling
    5. CPU/RPS
    6. Load avg
    7. Java heap
    8. ParNew
    9. Latency
    10. 99th tile
    
slide 38:
    Atlas & Lumen: Custom Dashboards
    Dashboards are a checklist methodology: what to show first,
    second, third...
    Starting point for issues
    1. Confirm and quantify issue
    2. Check historic trend
    3. Atlas metrics to drill down
    Lumen: more flexible dashboards
    eg, go/burger
    
slide 39:
    Atlas: Metrics
    
slide 40:
    Atlas: Metrics
    Region
    Application
    Metrics
    Presentation
    Interactive
    graph
    Summary
    statistics
    Time range
    
slide 41:
    Atlas: Metrics
    All metrics in one system
    System metrics: CPU usage, disk I/O, memory, …
    Application metrics: latency percentiles, errors, …
    Filters or breakdowns by region,
    application, ASG, metric, instance
    URL has session state: shareable
    
slide 42:
    Chronos: Change Tracking
    
slide 43:
    Chronos: Change Tracking
    Scope
    Time Range
    Event Log
    
slide 44:
    Slalom: Dependency Graphing
    
slide 45:
    Slalom: Dependency Graphing
    Dependency
    App
    Traffic Volume
    
slide 46:
    Zipkin UI: Dependency Tracing
    Dependency
    Latency
    
slide 47:
    PICSOU: AWS Usage
    Breakdowns
    Cost per hour
    Details (redacted)
    
slide 48:
    Slack: Chat
    Latency is high in us-east-1
    Sorry
    We just did a bad push
    
slide 49:
    Netflix Cloud Analysis Process
    Example path
    enumerated
    Atlas Alerts
    PICSOU
    Slack
    Cost
    Chat
    1. Check Issue
    Atlas/Lumen Dashboards
    2. Check Events
    Chronos
    Create
    New Alert
    Plus some other
    tools not pictured
    Redirected to
    a new Target
    3. Drill Down
    Atlas Metrics
    4. Check Dependencies
    5. Root Cause
    Instance Analysis
    Slalom
    Zipkin
    
slide 50:
    Generic Cloud Analysis Process
    Example path
    enumerated
    Alerts
    Usage Reports
    1. Check Issue
    Cost
    Custom Dashboards
    2. Check Events
    Change Tracking
    Create
    New Alert
    Plus other tools
    as needed
    Messaging
    Redirected to
    a new Target
    3. Drill Down
    Metric Analysis
    4. Check Dependencies
    5. Root Cause
    Instance Analysis
    Chat
    Dependency Analysis
    
slide 51:
    4. Instance Analysis
    1. Statistics
    2. Profiling
    3. Tracing
    4. Processor Analysis
    
slide 52:
slide 53:
slide 54:
    1. Statistics
    
slide 55:
    Linux Tools
    vmstat, pidstat, sar, etc, used mostly normally
    $ sar -n TCP,ETCP,DEV 1
    Linux 4.15.0-1027-aws (xxx)
    09:43:53 PM IFACE
    09:43:54 PM
    09:43:54 PM eth0
    rxpck/s
    12/03/2018
    txpck/s
    rxkB/s
    txkB/s
    33744.00 19361.43 28065.36
    09:43:53 PM
    09:43:54 PM
    active/s passive/s
    09:43:53 PM
    09:43:54 PM
    […]
    atmptf/s
    _x86_64_ (48 CPU)
    iseg/s
    rxcmp/s
    txcmp/s rxmcst/s %ifutil
    oseg/s
    estres/s retrans/s isegerr/s
    orsts/s
    Micro benchmarking can be used to investigate hypervisor
    behavior that can’t be observed directly
    
slide 56:
    Exception: Containers
    Most Linux tools are still not container aware
    From the container, will show the full host
    We expose cgroup metrics in our cloud GUIs: Vector
    
slide 57:
    Vector: Instance/Container Analysis
    
slide 58:
    2. Profiling
    
slide 59:
    Experience:
    “ZFS is eating my CPUs”
    
slide 60:
    CPU Mixed-Mode Flame Graph
    Application (truncated)
    38% kernel time (why?)
    
slide 61:
    Zoomed
    
slide 62:
    2014: Java Profiling
    Java Profilers
    System Profilers
    
slide 63:
    2018: Java Profiling
    Kernel
    Java
    JVM
    CPU Mixed-mode Flame Graph
    
slide 64:
    CPU Flame Graph
    
slide 65:
    CPU Flame Chart (same data)
    
slide 66:
    CPU Flame Graphs
    g()
    e()
    f()
    d()
    c()
    i()
    b()
    h()
    a()
    
slide 67:
    CPU Flame Graphs
    Y-axis: stack depth
    0 at bottom
    0 at top == icicle graph
    X-axis: alphabet
    Top edge:
    Who is running on CPU
    And how much (width)
    Time == flame chart
    Color: random
    g()
    Hues often used for
    language types
    Can be a dimension
    eg, CPI
    e()
    Ancestry
    f()
    d()
    c()
    i()
    b()
    h()
    a()
    
slide 68:
    Application Profiling
    Primary approach:
    CPU mixed-mode flame graphs (eg, via Linux perf)
    May need frame pointers (eg, Java -XX:+PreserveFramePointer)
    May need a symbol file (eg, Java perf-map-agent, Node.js --perf-basic-prof)
    Secondary:
    Application profiler (eg, via Lightweight Java Profiler)
    Application logs
    
slide 69:
    Vector: Push-button Flame Graphs
    
slide 70:
    Future: eBPF-based Profiling
    Linux 2.6
    Linux 4.9
    perf record
    profile.py
    perf.data
    perf script
    stackcollapse-perf.pl
    flamegraph.pl
    flamegraph.pl
    
slide 71:
    3. Tracing
    
slide 72:
slide 73:
    Core Linux Tracers
    Ftrace 2.6.27+ Tracing views
    Plus other kernel tech:
    kprobes, uprobes
    perf
    2.6.31+ Official profiler & tracer
    eBPF
    4.9+
    Programmatic engine
    bcc
    Complex tools
    bpftrace -
    Short scripts
    
slide 74:
    Experience: Disk %Busy
    
slide 75:
    # iostat –x 1
    […]
    avg-cpu: %user %nice %system %iowait %steal
    Device:
    xvda
    xvdb
    xvdj
    […]
    rrqm/s
    wrqm/s
    r/s
    0.00 139.00
    w/s
    rkB/s
    0.00 1056.00
    %idle
    wkB/s avgrq-sz avgqu-sz
    await r_await w_await svctm %util
    0.00 0.00 0.00
    0.00 0.00 0.00
    0.00 6.30 87.60
    
slide 76:
    # /apps/perf-tools/bin/iolatency 10
    Tracing block I/O. Output every 10 seconds. Ctrl-C to end.
    >gt;=(ms) .. gt; 1
    1 ->gt; 2
    2 ->gt; 4
    4 ->gt; 8
    8 ->gt; 16
    16 ->gt; 32
    32 ->gt; 64
    64 ->gt; 128
    : I/O
    : 421
    : 95
    : 48
    : 108
    : 363
    : 66
    : 3
    : 7
    |Distribution
    |######################################|
    |#########
    |#####
    |##########
    |#################################
    |######
    
slide 77:
    # /apps/perf-tools/bin/iosnoop
    Tracing block I/O. Ctrl-C to end.
    COMM
    PID
    TYPE DEV
    BLOCK
    java
    30603 RM
    202,144 1670768496
    cat
    202,0
    cat
    202,0
    cat
    202,0
    java
    30603 RM
    202,144 620864512
    java
    30603 RM
    202,144 584767616
    java
    30603 RM
    202,144 601721984
    java
    30603 RM
    202,144 603721568
    java
    30603 RM
    202,144 61067936
    java
    30603 RM
    202,144 1678557024
    java
    30603 RM
    202,144 55299456
    java
    30603 RM
    202,144 1625084928
    java
    30603 RM
    202,144 618895408
    java
    30603 RM
    202,144 581318480
    java
    30603 RM
    202,144 1167348016
    java
    30603 RM
    202,144 51561280
    [...]
    BYTES
    LATms
    
slide 78:
    # perf record -e block:block_rq_issue --filter rwbs ~ "*M*" -g -a
    # perf report -n –stdio
    [...]
    # Overhead
    Samples
    Command
    Shared Object
    Symbol
    # ........ ............ ............ ................. ....................
    70.70%
    java [kernel.kallsyms] [k] blk_peek_request
    --- blk_peek_request
    do_blkif_request
    __blk_run_queue
    queue_unplugged
    blk_flush_plug_list
    blk_finish_plug
    _xfs_buf_ioapply
    xfs_buf_iorequest
    |--88.84%-- _xfs_buf_read
    xfs_buf_read_map
    |--87.89%-- xfs_trans_read_buf_map
    |--97.96%-- xfs_imap_to_bp
    xfs_iread
    xfs_iget
    xfs_lookup
    xfs_vn_lookup
    lookup_real
    __lookup_hash
    lookup_slow
    path_lookupat
    filename_lookup
    user_path_at_empty
    user_path_at
    vfs_fstatat
    |--99.48%-- SYSC_newlstat
    sys_newlstat
    system_call_fastpath
    __lxstat64
    |Lsun/nio/fs/UnixNativeDispatcher;.lstat0
    0x7f8f963c847c
    
slide 79:
slide 80:
    # /usr/share/bcc/tools/biosnoop
    TIME(s)
    COMM
    PID
    tar
    tar
    tar
    tar
    tar
    tar
    tar
    tar
    tar
    tar
    tar
    tar
    tar
    tar
    tar
    tar
    tar
    tar
    tar
    tar
    tar
    [...]
    DISK
    xvda
    xvda
    xvda
    xvda
    xvda
    xvda
    xvda
    xvda
    xvda
    xvda
    xvda
    xvda
    xvda
    xvda
    xvda
    xvda
    xvda
    xvda
    xvda
    xvda
    xvda
    SECTOR
    BYTES
    LAT(ms)
    
slide 81:
    eBPF
    
slide 82:
    eBPF: extended Berkeley Packet Filter
    User-Defined BPF Programs
    SDN Configuration
    DDoS Mitigation
    Kernel
    Runtime
    Event Targets
    verifier
    sockets
    Intrusion Detection
    Container Security
    kprobes
    BPF
    Observability
    Firewalls (bpfilter)
    Device Drivers
    uprobes
    tracepoints
    BPF
    actions
    perf_events
    
slide 83:
slide 84:
    bcc
    # /usr/share/bcc/tools/tcplife
    PID
    COMM
    LADDR
    2509 java
    2509 java
    2509 java
    2509 java
    2509 java
    12030 upload-mes 127.0.0.1
    12030 upload-mes 127.0.0.1
    3964 mesos-slav 127.0.0.1
    12021 upload-sys 127.0.0.1
    2509 java
    2235 dockerd
    2235 dockerd
    [...]
    LPORT RADDR
    8078 100.82.130.159
    8078 100.82.78.215
    60778 100.82.207.252
    38884 100.82.208.178
    4243 127.0.0.1
    34020 127.0.0.1
    21196 127.0.0.1
    7101 127.0.0.1
    34022 127.0.0.1
    8078 127.0.0.1
    13730 100.82.136.233
    34314 100.82.64.53
    RPORT TX_KB RX_KB MS
    0 5.44
    0 135.32
    13 15126.87
    0 15568.25
    0 0.61
    0 3.38
    0 12.61
    0 12.64
    0 15.28
    372 15.31
    4 18.50
    8 56.73
    
slide 85:
    bpftrace
    # biolatency.bt
    Attaching 3 probes...
    Tracing block device I/O... Hit Ctrl-C to end.
    @usecs:
    [256, 512)
    [512, 1K)
    [1K, 2K)
    [2K, 4K)
    [4K, 8K)
    [8K, 16K)
    [16K, 32K)
    [32K, 64K)
    [64K, 128K)
    [128K, 256K)
    2 |
    10 |@
    426 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
    230 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    9 |@
    128 |@@@@@@@@@@@@@@@
    68 |@@@@@@@@
    0 |
    0 |
    10 |@
    
slide 86:
    bpftrace: biolatency.bt
    #!/usr/local/bin/bpftrace
    BEGIN
    printf("Tracing block device I/O... Hit Ctrl-C to end.\n");
    kprobe:blk_account_io_start
    @start[arg0] = nsecs;
    kprobe:blk_account_io_completion
    /@start[arg0]/
    @usecs = hist((nsecs - @start[arg0]) / 1000);
    delete(@start[arg0]);
    
slide 87:
    Future: eBPF GUIs
    
slide 88:
    4. Processor Analysis
    
slide 89:
    What “90% CPU Utilization” might suggest:
    What it typically means on the Netflix cloud:
    
slide 90:
    PMCs
    Performance Monitoring Counters help you analyze stalls
    Some instances (eg. Xen-based m4.16xl) have the architectural set:
    
slide 91:
    Instructions Per Cycle (IPC)
    “good*”
    >gt;2.0
    Instruction bound
    IPC
    “bad”
    * probably; exception: spin locks
    
slide 92:
    PMCs: EC2 Xen Hypervisor
    # perf stat -a -- sleep 30
    Performance counter stats for 'system wide':
    1,103,112
    189,173
    4,044
    2,057,164,531,949
    gt;
    gt;
    1,357,979,592,699
    243,244,156,173
    4,391,259,112
    task-clock (msec)
    context-switches
    cpu-migrations
    page-faults
    cycles
    stalled-cycles-frontend
    stalled-cycles-backend
    instructions
    branches
    branch-misses
    64.034 CPUs utilized
    0.574 K/sec
    0.098 K/sec
    0.002 K/sec
    1.071 GHz
    (100.00%)
    (100.00%)
    (100.00%)
    0.66 insns per cycle
    126.617 M/sec
    1.81% of all branches
    (75.01%)
    (74.99%)
    (75.00%)
    (75.00%)
    30.001112466 seconds time elapsed
    # ./pmcarch 1
    CYCLES
    INSTRUCTIONS
    [...]
    IPC BR_RETIRED
    0.66 4692322525
    0.65 5286747667
    0.70 4616980753
    0.69 5055959631
    BR_MISPRED
    BMR% LLCREF
    1.95 780435112
    1.81 751335355
    1.87 709841242
    1.83 787333902
    LLCMISS
    LLC%
    
slide 93:
    PMCs: EC2 Nitro Hypervisor
    Some instance types (large, Nitro-based) support most PMCs!
    Meltdown KPTI patch TLB miss analysis on a c5.9xl:
    nopti:
    # tlbstat -C0 1
    K_CYCLES
    K_INSTR
    [...]
    pti, nopcid:
    # tlbstat -C0 1
    K_CYCLES
    K_INSTR
    [...]
    IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC
    0.86 565
    0.86 950
    0.86 396
    K_ITLBCYC
    DTLB% ITLB%
    0.00 0.00
    0.00 0.00
    0.00 0.00
    IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC
    0.10 89709496
    0.10 88829158
    0.10 89683045
    0.10 79055465
    K_ITLBCYC
    DTLB% ITLB%
    27.40 22.63
    27.28 22.52
    27.29 22.55
    27.40 22.63
    worst case
    
slide 94:
    MSRs
    Model Specific Registers
    System config info, including current clock rate:
    # showboost
    Base CPU MHz : 2500
    Set CPU MHz : 2500
    Turbo MHz(s) : 3100 3200 3300 3500
    Turbo Ratios : 124% 128% 132% 140%
    CPU 0 summary every 1 seconds...
    TIME
    23:39:07
    23:39:08
    23:39:09
    C0_MCYC
    C0_ACYC
    UTIL
    64%
    70%
    99%
    RATIO
    MHz
    
slide 95:
    Summary
    Take-aways
    
slide 96:
    Take Aways
    1. Get push-button CPU flame graphs: kernel & user
    2. Check out eBPF perf tools: bcc, bpftrace
    3. Measure IPC as well as CPU utilization using PMCs
    90% CPU busy:
    … really means:
    
slide 97:
    Observability
    Methodology
    Velocity
    
slide 98:
    Observability
    Statistics, Flame Graphs, eBPF Tracing, Cloud PMCs
    Methodology
    USE method, RED method, Drill-down Analysis, …
    Velocity
    Self-service GUIs: Vector, FlameScope, …
    
slide 99:
    Resources
    2014 talk From Clouds to Roots: http://www.slideshare.net/brendangregg/netflix-from-clouds-to-roots
    http://www.youtube.com/watch?v=H-E0MQTID0g
    Chaos: https://medium.com/netflix-techblog/chap-chaos-automation-platform-53e6d528371f https://principlesofchaos.org/
    Atlas: https://github.com/Netflix/Atlas
    Atlas: https://medium.com/netflix-techblog/introducing-atlas-netflixs-primary-telemetry-platform-bd31f4d8ed9a
    RED method: https://thenewstack.io/monitoring-microservices-red-method/
    USE method: https://queue.acm.org/detail.cfm?id=2413037
    Winston: https://medium.com/netflix-techblog/introducing-winston-event-driven-diagnostic-and-remediation-platform-46ce39aa81cc
    Lumen: https://medium.com/netflix-techblog/lumen-custom-self-service-dashboarding-for-netflix-8c56b541548c
    Flame graphs: http://www.brendangregg.com/flamegraphs.html
    Java flame graphs: https://medium.com/netflix-techblog/java-in-flames-e763b3d32166
    Vector: http://vectoross.io https://github.com/Netflix/Vector
    FlameScope: https://github.com/Netflix/FlameScope
    Tracing ponies: thanks Deirdré Straughan & General Zoi's Pony Creator
    ftrace: http://lwn.net/Articles/608497/ - usually already in your kernel
    perf: http://www.brendangregg.com/perf.html - perf is usually packaged in linux-tools-common
    tcplife: https://github.com/iovisor/bcc - often available as a bcc or bcc-tools package
    bpftrace: https://github.com/iovisor/bpftrace
    pmcarch: https://github.com/brendangregg/pmc-cloud-tools
    showboost: https://github.com/brendangregg/msr-cloud-tools - also try turbostat
    
slide 100:
    Netflix Tech Blog
    
slide 101:
    Thank you.
    Brendan Gregg
    @brendangregg