Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

Surge 2014: Netflix, From Clouds to Roots

Video: http://www.youtube.com/watch?v=H-E0MQTID0g

From Clouds to Roots: root cause performance analysis at Netflix. Talk at Surge 2014 by Brendan Gregg.

Description: "At Netflix, high scale and fast deployment rule. The possibilities for failure are endless, and the environment excels at handling this, regularly tested and exercised by the simian army. But, when this environment automatically works around systemic issues that aren’t root-caused, they can grow over time. This talk describes the challenge of not just handling failures of scale on the Netflix cloud, but also new approaches and tools for quickly diagnosing their root cause in an ever changing environment."

next
prev
1/97
next
prev
2/97
next
prev
3/97
next
prev
4/97
next
prev
5/97
next
prev
6/97
next
prev
7/97
next
prev
8/97
next
prev
9/97
next
prev
10/97
next
prev
11/97
next
prev
12/97
next
prev
13/97
next
prev
14/97
next
prev
15/97
next
prev
16/97
next
prev
17/97
next
prev
18/97
next
prev
19/97
next
prev
20/97
next
prev
21/97
next
prev
22/97
next
prev
23/97
next
prev
24/97
next
prev
25/97
next
prev
26/97
next
prev
27/97
next
prev
28/97
next
prev
29/97
next
prev
30/97
next
prev
31/97
next
prev
32/97
next
prev
33/97
next
prev
34/97
next
prev
35/97
next
prev
36/97
next
prev
37/97
next
prev
38/97
next
prev
39/97
next
prev
40/97
next
prev
41/97
next
prev
42/97
next
prev
43/97
next
prev
44/97
next
prev
45/97
next
prev
46/97
next
prev
47/97
next
prev
48/97
next
prev
49/97
next
prev
50/97
next
prev
51/97
next
prev
52/97
next
prev
53/97
next
prev
54/97
next
prev
55/97
next
prev
56/97
next
prev
57/97
next
prev
58/97
next
prev
59/97
next
prev
60/97
next
prev
61/97
next
prev
62/97
next
prev
63/97
next
prev
64/97
next
prev
65/97
next
prev
66/97
next
prev
67/97
next
prev
68/97
next
prev
69/97
next
prev
70/97
next
prev
71/97
next
prev
72/97
next
prev
73/97
next
prev
74/97
next
prev
75/97
next
prev
76/97
next
prev
77/97
next
prev
78/97
next
prev
79/97
next
prev
80/97
next
prev
81/97
next
prev
82/97
next
prev
83/97
next
prev
84/97
next
prev
85/97
next
prev
86/97
next
prev
87/97
next
prev
88/97
next
prev
89/97
next
prev
90/97
next
prev
91/97
next
prev
92/97
next
prev
93/97
next
prev
94/97
next
prev
95/97
next
prev
96/97
next
prev
97/97

PDF: Surge2014_CloudsToRoots.pdf

Keywords (from pdftotext):

slide 1:
    From	
      Clouds	
      
    to	
      Roots	
      
    Brendan	
      Gregg	
      
    Senior	
      Performance	
      Architect	
      
    Performance	
      Engineering	
      Team	
      
    	
      
    bgregg@ne5lix.com,	
      @brendangregg	
      
    September,	
      2014	
      
    
slide 2:
    Root	
      Cause	
      Analysis	
      at	
      Ne5lix	
      
    Devices	
      
    Ribbon	
      
    Hystrix	
      
    Zuul	
      
    Service	
      
    Tomcat	
      
    Roots	
      
    …	
      
    JVM	
      
    Load	
      
    Instances	
      (Linux)	
      
    AZ	
      1	
       AZ	
      2	
       AZ	
      3	
      
    ASG	
      2	
       …	
      
    ASG	
      1	
      
    ELB	
      
    ASG	
      Cluster	
      
    ApplicaFon	
      
    Ne5lix	
      
    Atlas	
      
    Chronos	
      
    Mogul	
      
    Vector	
      
    sar,	
      *stat	
      
    stap,	
      Vrace	
      
    rdmsr	
      
    …	
      
    SG	
      
    …	
      
    
slide 3:
    • Massive	
      AWS	
      EC2	
      Linux	
      cloud	
      
    – Tens	
      of	
      thousands	
      of	
      server	
      instances	
      
    – Autoscale	
      by	
      ~3k	
      each	
      day	
      
    – CentOS	
      and	
      Ubuntu	
      
    • FreeBSD	
      for	
      content	
      delivery	
      
    – Approx	
      33%	
      of	
      US	
      Internet	
      traffic	
      at	
      night	
      
    • Performance	
      is	
      criFcal	
      
    – Customer	
      saFsfacFon:	
      >gt;50M	
      subscribers	
      
    – $$$	
      price/performance	
      
    – Develop	
      tools	
      for	
      cloud-­‐wide	
      analysis	
      
    
slide 4:
    Brendan	
      Gregg	
      
    • Senior	
      Performance	
      Architect,	
      Ne5lix	
      
    – Linux	
      and	
      FreeBSD	
      performance	
      
    – Performance	
      Engineering	
      team	
      (@coburnw)	
      
    • Recent	
      work:	
      
    – Linux	
      perf-­‐tools,	
      using	
      Vrace	
      &	
      perf_events	
      
    – Systems	
      Performance,	
      PrenFce	
      Hall	
      
    • Previous	
      work	
      includes:	
      
    – USE	
      Method,	
      flame	
      graphs,	
      latency	
      &	
      
    uFlizaFon	
      heat	
      maps,	
      DTraceToolkit,	
      
    iosnoop	
      and	
      others	
      on	
      OS	
      X,	
      ZFS	
      L2ARC	
      
    • Twiler	
      @brendangregg	
      
    
slide 5:
    Last	
      year	
      at	
      Surge…	
      
    • I	
      saw	
      a	
      great	
      Ne5lix	
      talk	
      by	
      Coburn	
      Watson:	
      
    • hlps://www.youtube.com/watch?v=7-­‐13wV3WO8Q	
      
    • He’s	
      now	
      my	
      manager	
      (and	
      also	
      sFll	
      hiring!)	
      
    
slide 6:
    Agenda	
      
    • The	
      Ne5lix	
      Cloud	
      
    – How	
      it	
      works:	
      ASG	
      clusters,	
      Hystrix,	
      monkeys	
      
    – And	
      how	
      it	
      may	
      fail	
      
    • Root	
      Cause	
      Performance	
      Analysis	
      
    – Why	
      it’s	
      sFll	
      needed	
      
    • Cloud	
      analysis	
      
    • Instance	
      analysis	
      
    
slide 7:
    Terms	
      
    AWS:	
      Amazon	
      Web	
      Services	
      
    EC2:	
      AWS	
      ElasFc	
      Compute	
      2	
      (cloud	
      instances)	
      
    S3:	
      AWS	
      Simple	
      Storage	
      Service	
      (object	
      store)	
      
    ELB:	
      AWS	
      ElasFc	
      Load	
      Balancers	
      
    SQS:	
      AWS	
      Simple	
      Queue	
      Service	
      
    SES:	
      AWS	
      Simple	
      Email	
      Service	
      
    CDN:	
      Content	
      Delivery	
      Network	
      
    OCA:	
      Ne5lix	
      Open	
      Connect	
      Appliance	
      (streaming	
      CDN)	
      
    QoS:	
      Quality	
      of	
      Service	
      
    AMI:	
      Amazon	
      Machine	
      Image	
      (instance	
      image)	
      
    ASG:	
      Auto	
      Scaling	
      Group	
      
    AZ:	
      Availability	
      Zone	
      
    NIWS:	
      Ne5lix	
      Internal	
      Web	
      Service	
      framework	
      (Ribbon)	
      
    MSR:	
      Model	
      Specific	
      Register	
      (CPU	
      info	
      register)	
      
    PMC:	
      Performance	
      Monitoring	
      Counter	
      (CPU	
      perf	
      counter)	
      
    
slide 8:
    The	
      Ne5lix	
      Cloud	
      
    
slide 9:
    The	
      Ne5lix	
      Cloud	
      
    • Tens	
      of	
      thousands	
      of	
      cloud	
      instances	
      on	
      AWS	
      EC2,	
      
    with	
      S3	
      and	
      Cassandra	
      for	
      storage	
      
    EC2	
      
    ELB	
      
    Cassandra	
      
    ApplicaFons	
      
    (Services)	
      
    S3	
      
    ElasFcsearch	
      
    EVCache	
      
    SES	
       SQS	
      
    • Ne5lix	
      is	
      implemented	
      by	
      mulFple	
      logical	
      services	
      
    
slide 10:
    Ne5lix	
      Services	
      
    • Open	
      Connect	
      
    Appliances	
      used	
      
    for	
      content	
      
    delivery	
      
    Client	
      
    Devices	
      
    AuthenFcaFon	
      
    Web	
      Site	
      API	
      
    Streaming	
      API	
      
    User	
      Data	
      
    PersonalizaFon	
      
    Viewing	
      Hist.	
      
    …	
      
    DRM	
      
    QoS	
      Logging	
      
    OCA	
      CDN	
      
    CDN	
      Steering	
      
    Encoding	
      
    
slide 11:
    Freedom	
      and	
      Responsibility	
      
    • Culture	
      deck	
      is	
      true	
      
    – hlp://www.slideshare.net/reed2001/culture-­‐1798664	
      (9M	
      views!)	
      
    • Deployment	
      freedom	
      
    – Service	
      teams	
      choose	
      their	
      own	
      tech	
      &	
      schedules	
      
    – Purchase	
      and	
      use	
      cloud	
      instances	
      without	
      approvals	
      
    – Ne5lix	
      environment	
      changes	
      fast!	
      
    
slide 12:
    Cloud	
      Technologies	
      
    • Numerous	
      open	
      source	
      technologies	
      are	
      in	
      use:	
      
    – Linux,	
      Java,	
      Cassandra,	
      Node.js,	
      …	
      
    • Ne5lix	
      also	
      open	
      sources:	
      ne5lix.github.io	
      
    	
      
    
slide 13:
    Cloud	
      Instances	
      
    • Base	
      server	
      instance	
      image	
      +	
      customizaFons	
      by	
      
    service	
      teams	
      (BaseAMI).	
      Typically:	
      
    Linux	
      (CentOS	
      or	
      Ubuntu)	
      
    OpFonal	
      Apache,	
      
    memcached,	
      
    non-­‐Java	
      apps	
      
    (incl.	
      Node.js)	
      
    Atlas	
      monitoring,	
      
    S3	
      log	
      rotaFon,	
      
    Vrace,	
      perf,	
      stap,	
      
    custom	
      perf	
      tools	
      
    Java	
      (JDK	
      7	
      or	
      8)	
      
    GC	
      and	
      
    thread	
      
    dump	
      
    logging	
      
    Tomcat	
      
    ApplicaFon	
      war	
      files,	
      base	
      
    servlet,	
      pla5orm,	
      hystrix,	
      
    health	
      check,	
      metrics	
      (Servo)	
      
    
slide 14:
    Scalability	
      and	
      Reliability	
      
    #	
       Problem	
      
    1	
       Load	
      increases	
      
    2	
       Poor	
      performing	
      
    code	
      push	
      
    Solu,on	
      
    Auto	
      scale	
      with	
      ASGs	
      
    Rapid	
      rollback	
      with	
      red/black	
      ASG	
      
    clusters	
      
    3	
       Instance	
      failure	
      
    Hystrix	
      Fmeouts	
      and	
      secondaries	
      
    4	
       Zone/Region	
      failure	
      
    Zuul	
      to	
      reroute	
      traffic	
      
    5	
       Overlooked	
      and	
      
    unhandled	
      issues	
      
    Simian	
      army	
      
    6	
       Poor	
      performance	
      
    Atlas	
      metrics,	
      alerts,	
      Chronos	
      
    
slide 15:
    1.	
      Auto	
      Scaling	
      Groups	
      
    ASG	
      
    • Instances	
      automaFcally	
      
    added	
      or	
      removed	
      by	
      a	
      
    custom	
      scaling	
      policy	
      
    – A	
      broken	
      policy	
      could	
      cause	
      
    false	
      scaling	
      
    • Alerts	
      &	
      audits	
      used	
      to	
      
    check	
      scaling	
      is	
      sane	
      
    CloudWatch,	
      Servo	
      
    Cloud	
      ConfiguraFon	
      
    Management	
      
    Scaling	
      Policy	
      
    loadavg,	
      latency,	
      …	
      
    	
      
    Instance	
      
    Instance	
      
    Instance	
      
    
slide 16:
    2.	
      ASG	
      Clusters	
      
    • How	
      code	
      versions	
       ASG	
      Cluster	
      
    are	
      really	
      deployed	
       prod1	
      
    • Traffic	
      managed	
      
    by	
      ElasFc	
      Load	
      
    Balancers	
      (ELBs)	
      
    • Fast	
      rollback	
      if	
      
    ASG-­‐v010	
      
    issues	
      are	
      found	
      
    …	
      
    – Might	
      rollback	
      
    undiagnosed	
      issues	
      
    • Canaries	
      can	
      also	
      
    be	
      used	
      for	
      tesFng	
      
    (and	
      automated)	
      
    ELB	
      
    Canary	
      
    ASG-­‐v011	
      
    …	
      
    Instance	
      
    Instance	
      
    Instance	
      
    Instance	
      
    Instance	
      
    Instance	
      
    
slide 17:
    3.	
      Hystrix	
      
    • A	
      library	
      for	
      latency	
      and	
      
    fault	
      tolerance	
      for	
      
    dependency	
      services	
      
    – Fallbacks,	
      degradaFon,	
      
    fast	
      fail	
      and	
      rapid	
      recovery	
      
    – Supports	
      Fmeouts,	
      load	
      
    shedding,	
      circuit	
      breaker	
      
    – Uses	
      thread	
      pools	
      for	
      
    dependency	
      services	
      
    – RealFme	
      monitoring	
      
    Tomcat	
      
    ApplicaFon	
      
    get	
      A	
      
    Hystrix	
      
    >gt;100ms	
      
    Dependency	
      
    A1	
      
    • Plus	
      the	
      Ribbon	
      IPC	
      
    library	
      (NIWS),	
      which	
      
    adds	
      even	
      more	
      fault	
      tolerance	
      
    Dependency	
      
    A2	
      
    
slide 18:
    4.	
      Redundancy	
      
    • All	
      device	
      traffic	
      goes	
      through	
      the	
      Zuul	
      proxy:	
      
    – dynamic	
      rouFng,	
      monitoring,	
      resiliency,	
      security	
      
    • Availability	
      Zone	
      failure:	
      run	
      from	
      2	
      of	
      3	
      zones	
      
    • Region	
      failure:	
      reroute	
      traffic	
      
    Monitoring	
      
    Zuul	
      
    AZ1	
      
    AZ2	
      
    AZ3	
      
    
slide 19:
    5.	
      Simian	
      Army	
      
    • Ensures	
      cloud	
      handles	
      failures	
      
    through	
      regular	
      tesFng	
      
    • Monkeys:	
      
    – Latency:	
      arFficial	
      delays	
      
    – Conformity:	
      kills	
      non-­‐	
      
    best-­‐pracFces	
      instances	
      
    – Doctor:	
      health	
      checks	
      
    – Janitor:	
      unused	
      instances	
      
    – Security:	
      checks	
      violaFons	
      
    – 10-­‐18:	
      geographic	
      issues	
      
    – Chaos	
      Gorilla:	
      AZ	
      failure	
      
    • We’re	
      hiring	
      Chaos	
      Engineers!	
      
    
slide 20:
    6.	
      Atlas,	
      alerts,	
      Chronos	
      
    • Atlas:	
      Cloud-­‐wide	
      
    monitoring	
      tool	
      
    – Millions	
      of	
      metrics,	
      
    quick	
      rollups,	
      
    custom	
      dashboards:	
      
    • Alerts:	
      Custom,	
      
    using	
      Atlas	
      metrics	
      
    – In	
      parFcular,	
      error	
      &	
      Fmeout	
      rates	
      on	
      client	
      devices	
      
    • Chronos:	
      Change	
      tracking	
      
    – Used	
      during	
      incident	
      invesFgaFons	
      
    
slide 21:
    In	
      Summary	
      
    • Ne5lix	
      is	
      very	
      good	
      
    at	
      automaFcally	
      
    handling	
      failure	
      
    – Issues	
      oVen	
      lead	
      
    to	
      rapid	
      instance	
      
    growth	
      (ASGs)	
      
    • Good	
      for	
      customers	
      
    – Fast	
      workaround	
      
    • Good	
      for	
      engineers	
      
    – Fix	
      later,	
      9-­‐5	
      
    #	
       Problem	
      
    Solu,on	
      
    1	
       Load	
      increases	
      
    ASGs	
      
    2	
       Poor	
      performing	
      
    code	
      push	
      
    ASG	
      clusters	
      
    3	
       Instance	
      issue	
      
    Hystrix	
      
    4	
       Zone/Region	
      issue	
       Zuul	
      
    5	
       Overlooked	
      and	
       Monkeys	
      
    unhandled	
      issues	
      
    6	
       Poor	
      performance	
       Atlas,	
      alerts,	
      
    Chronos	
      
    
slide 22:
    Typical	
      Ne5lix	
      Stack	
      
    Problems/soluFons	
      
    enumerated	
      
    Devices	
      
    4.	
      
    Zuul	
      
    Ribbon	
      
    3.	
      
    Hystrix	
      
    Dependencies,	
      
    Atlas	
      (monitoring),	
      
    Discovery,	
      …	
       6.	
      
    Service	
      
    Tomcat	
      
    5.	
      
    …	
      
    JVM	
      
    Load	
      
    Monkeys	
      
    Instances	
      (Linux)	
      
    AZ	
      1	
       AZ	
      2	
       AZ	
      3	
      
    1.	
      
    ASG	
      2	
       …	
      
    ASG	
      1	
      
    ELB	
       2.	
      
    ASG	
      Cluster	
      
    SG	
      
    …	
      
    ApplicaFon	
      
    Ne5lix	
      
    
slide 23:
    *	
      ExcepFons	
      
    • Apache	
      Web	
      Server	
      
    • Node.js	
      
    • …	
      
    
slide 24:
    Root	
      Cause	
      Performance	
      Analysis	
      
    
slide 25:
    Root	
      Cause	
      Performance	
      Analysis	
      
    • Conducted	
      when:	
      
    – Growth	
      becomes	
      a	
      cost	
      problem	
      
    – More	
      instances	
      or	
      roll	
      backs	
      don’t	
      work	
      
    • Eg:	
      dependency	
      issue,	
      networking,	
      …	
      
    – A	
      fix	
      is	
      needed	
      for	
      forward	
      progress	
      
    • “But	
      it’s	
      faster	
      on	
      Linux	
      2.6.21	
      m2.xlarge!”	
      
    • Staying	
      on	
      older	
      versions	
      for	
      an	
      undiagnosed	
      (and	
      
    fixable)	
      reason	
      prevents	
      gains	
      from	
      later	
      improvements	
      
    – To	
      understand	
      scalability	
      factors	
      
    • IdenFfies	
      the	
      origin	
      of	
      poor	
      performance	
      
    
slide 26:
    Root	
      Cause	
      Analysis	
      Process	
      
    • From	
      cloud	
      to	
      instance:	
      
    …	
      
    SG	
      
    Ne5lix	
      
    ApplicaFon	
      
    ASG	
      Cluster	
      
    ASG	
      1	
      
    …	
       ASG	
      2	
      
    AZ	
      3	
       AZ	
      2	
       AZ	
      1	
      
    Instances	
      (Linux)	
      
    JVM	
      
    …	
      
    Tomcat	
      
    Service	
      
    ELB	
      
    
slide 27:
    Cloud	
      Methodologies	
      
    • Resource	
      Analysis	
      
    – Any	
      resources	
      exhausted?	
      CPU,	
      disk,	
      network	
      
    • Metric	
      and	
      event	
      correlaFons	
      
    – When	
      things	
      got	
      bad,	
      what	
      else	
      happened?	
      
    – Correlate	
      with	
      distributed	
      dependencies	
      
    • Latency	
      Drilldowns	
      
    – Trace	
      origin	
      of	
      high	
      latency	
      from	
      request	
      down	
      
    through	
      dependencies	
      
    • USE	
      Method	
      
    – For	
      every	
      service,	
      check:	
      uFlizaFon,	
      saturaFon,	
      errors	
      
    
slide 28:
    Instance	
      Methodologies	
      
    • Log	
      Analysis	
      
    – dmesg,	
      GC,	
      Apache,	
      Tomcat,	
      custom	
      
    • USE	
      Method	
      
    – For	
      every	
      resource,	
      check:	
      uFlizaFon,	
      saturaFon,	
      errors	
      
    • Micro-­‐benchmarking	
      
    – Test	
      and	
      measure	
      components	
      in	
      isolaFon	
      
    • Drill-­‐down	
      analysis	
      
    – Decompose	
      request	
      latency,	
      repeat	
      
    • And	
      other	
      system	
      performance	
      methodologies	
      
    
slide 29:
    Bad	
      Instances	
      
    • Not	
      all	
      issues	
      root	
      caused	
      
    – “bad	
      instance”	
      !=	
      root	
      cause	
      
    • SomeFmes	
      efficient	
      to	
      just	
      kill	
      “bad	
      instances”	
      
    – They	
      could	
      be	
      a	
      lone	
      hardware	
      issue,	
      which	
      could	
      
    take	
      days	
      for	
      you	
      to	
      analyze	
      
    • But	
      they	
      could	
      also	
      be	
      an	
      early	
      warning	
      of	
      a	
      
    global	
      issue.	
      If	
      you	
      kill	
      them,	
      you	
      don’t	
      know.	
      
    Instance	
      
    Bad	
      Instance	
      
    
slide 30:
    Bad	
      Instance	
      AnF-­‐Method	
      
    1. Plot	
      request	
      latency	
      
    per-­‐instance	
      
    2. Find	
      the	
      bad	
      instance	
      
    3. Terminate	
      bad	
      instance	
      
    4. Someone	
      else’s	
      problem	
      now!	
      
    Bad	
      instance	
      
    Terminate!	
      
    95th	
      percenFle	
      latency	
      
    (Atlas	
      Exploder)	
      
    
slide 31:
    Cloud	
      Analysis	
      
    
slide 32:
    Cloud	
      Analysis	
      
    • Cloud	
      analysis	
      tools	
      made	
      and	
      used	
      at	
      Ne5lix	
      include:	
      
    Tool	
      
    Purpose	
      
    Atlas	
      
    Metrics,	
      dashboards,	
      alerts	
      
    Chronos	
      
    Change	
      tracking	
      
    Mogul	
      
    Metric	
      correlaFon	
      
    Salp	
      
    Dependency	
      graphing	
      
    ICE	
      
    Cloud	
      usage	
      dashboard	
      
    • Monitor	
      everything:	
      you	
      can’t	
      tune	
      what	
      you	
      can’t	
      see	
      
    
slide 33:
    Ne5lix	
      Cloud	
      Analysis	
      Process	
      
    Atlas	
      Alerts	
      
    Example	
      
    path	
      
    enumerated	
      
    ICE	
      
    1.	
      Check	
      Issue	
      
    Cost	
      
    Atlas	
      Dashboards	
      
    2.	
      Check	
      Events	
      
    Chronos	
      
    Create	
      
    New	
      Alert	
      
    Redirected	
      to	
      
    a	
      new	
      Target	
      
    3.	
      Drill	
      Down	
      
    Atlas	
      Metrics	
      
    4.	
      Check	
      Dependencies	
      
    5.	
      Root	
      
    Cause	
      
    Mogul	
      
    Instance	
      Analysis	
      
    Salp	
      
    
slide 34:
    Atlas:	
      Alerts	
      
    • Custom	
      alerts	
      based	
      on	
      the	
      Atlas	
      metrics	
      
    – CPU	
      usage,	
      latency,	
      instance	
      count	
      growth,	
      …	
      
    • Usually	
      email	
      or	
      pager	
      
    – Can	
      also	
      deacFvate	
      instances,	
      terminate,	
      reboot	
      
    • Next	
      step:	
      check	
      the	
      dashboards	
      
    
slide 35:
    Atlas:	
      Dashboards	
      
    
slide 36:
    Atlas:	
      Dashboards	
      
    Custom	
      Graphs	
      
    Set	
      Time	
      
    Breakdowns	
      
    Interac,ve	
      
    Click	
      Graphs	
      for	
      
    More	
      Metrics	
      
    
slide 37:
    Atlas:	
      Dashboards	
      
    • Cloud	
      wide	
      and	
      per-­‐service	
      (all	
      custom)	
      
    • StarFng	
      point	
      for	
      issue	
      invesFgaFons	
      
    1. Confirm	
      and	
      quanFfy	
      issue	
      
    2. Check	
      historic	
      trend	
      
    3. Launch	
      Atlas	
      metrics	
      view	
      to	
      drill	
      down	
      
    Cloud	
      wide:	
      streams	
      per	
      second	
      (SPS)	
      dashboard	
      
    
slide 38:
    Atlas:	
      Metrics	
      
    
slide 39:
    Atlas:	
      Metrics	
      
    Region	
      
    Breakdowns	
      
    App	
      
    Interac,ve	
      
    Graph	
      
    Metrics	
      
    Op,ons	
      
    Summary	
      Sta,s,cs	
      
    
slide 40:
    Atlas:	
      Metrics	
      
    • All	
      metrics	
      in	
      one	
      system	
      
    • System	
      metrics:	
      
    – CPU	
      usage,	
      disk	
      I/O,	
      memory,	
      …	
      
    • ApplicaFon	
      metrics:	
      
    – latency	
      percenFles,	
      errors,	
      …	
      
    • Filters	
      or	
      breakdowns	
      by	
      
    region,	
      applicaFon,	
      ASG,	
      metric,	
      instance,	
      …	
      
    – Quickly	
      narrow	
      an	
      invesFgaFon	
      
    • URL	
      contains	
      session	
      state:	
      sharable	
      
    
slide 41:
    Chronos:	
      Change	
      Tracking	
      
    
slide 42:
    Chronos:	
      Change	
      Tracking	
      
    Breakdown	
      
    Legend	
      
    Historic	
      
    Cri,cality	
      
    App	
      
    Event	
      List	
      
    
slide 43:
    Chronos:	
      Change	
      Tracking	
      
    • Quickly	
      filter	
      uninteresFng	
      events	
      
    • Performance	
      issues	
      oVen	
      coincide	
      with	
      changes	
      
    • The	
      size	
      and	
      velocity	
      of	
      Ne5lix	
      engineering	
      makes	
      
    Chronos	
      crucial	
      for	
      communicaFng	
      change	
      
    
slide 44:
    Mogul:	
      CorrelaFons	
      
    • Comparing	
      performance	
      with	
      per-­‐resource	
      demand	
      
    
slide 45:
    Mogul:	
      CorrelaFons	
      
    • Comparing	
      performance	
      with	
      per-­‐resource	
      demand	
      
    Latency	
       Throughput	
      
    Correla,on	
      
    App	
      
    Resource	
      
    Demand	
      
    
slide 46:
    Mogul:	
      CorrelaFons	
      
    • Measures	
      demand	
      using	
      Lille’s	
      Law	
      
    – D	
      =	
      R	
      *	
      X	
      
    D	
      =	
      Demand	
      (in	
      seconds	
      per	
      second)	
      
    R	
      =	
      Average	
      Response	
      Time	
      
    X	
      =	
      Throughput	
      
    • Discover	
      unexpected	
      problem	
      dependencies	
      
    – That	
      aren’t	
      on	
      the	
      service	
      dashboards	
      
    • Mogul	
      checks	
      many	
      other	
      correlaFons	
      
    – Weeds	
      through	
      thousands	
      of	
      applicaFon	
      metrics,	
      showing	
      
    you	
      the	
      most	
      related/interesFng	
      ones	
      
    – (Scol/MarFn	
      should	
      give	
      a	
      talk	
      just	
      on	
      these)	
      
    • Bearing	
      in	
      mind	
      correlaFon	
      is	
      not	
      causaFon	
      
    
slide 47:
    Salp:	
      Dependency	
      Graphing	
      
    • Dependency	
      
    graphs	
      based	
      on	
      
    live	
      trace	
      data	
      
    • InteracFve	
      
    • See	
      architectural	
      
    issues	
      
    
slide 48:
    Salp:	
      Dependency	
      Graphing	
      
    Their	
      Dependencies	
      
    …	
      
    Dependencies	
      
    App	
      
    • Dependency	
      
    graphs	
      based	
      on	
      
    live	
      trace	
      data	
      
    • InteracFve	
      
    • See	
      architectural	
      
    issues	
      
    
slide 49:
    ICE:	
      AWS	
      Usage	
      
    
slide 50:
    ICE:	
      AWS	
      Usage	
      
    Cost	
      per	
      hour	
      
    Services	
      
    
slide 51:
    ICE:	
      AWS	
      Usage	
      
    • Cost	
      per	
      hour	
      by	
      AWS	
      service,	
      and	
      Ne5lix	
      
    applicaFon	
      (service	
      team)	
      
    – IdenFfy	
      issues	
      of	
      slow	
      growth	
      
    • Directs	
      engineering	
      effort	
      to	
      reduce	
      cost	
      
    
slide 52:
    Ne5lix	
      Cloud	
      Analysis	
      Process	
      
    Atlas	
      Alerts	
      
    In	
      summary…	
      
    	
      
    Example	
      
    path	
      
    enumerated	
      
    ICE	
      
    1.	
      Check	
      Issue	
      
    Cost	
      
    Atlas	
      Dashboards	
      
    2.	
      Check	
      Events	
      
    Chronos	
      
    Create	
      
    New	
      Alert	
      
    Plus	
      some	
      other	
      
    tools	
      not	
      pictured	
      
    Redirected	
      to	
      
    a	
      new	
      Target	
      
    3.	
      Drill	
      Down	
      
    Atlas	
      Metrics	
      
    4.	
      Check	
      Dependencies	
      
    5.	
      Root	
      
    Cause	
      
    Mogul	
      
    Instance	
      Analysis	
      
    Salp	
      
    
slide 53:
    Generic	
      Cloud	
      Analysis	
      Process	
      
    Alerts	
      
    Example	
      
    path	
      
    enumerated	
      
    Usage	
      Reports	
      
    1.	
      Check	
      Issue	
      
    Cost	
      
    Custom	
      Dashboards	
      
    2.	
      Check	
      Events	
      
    Change	
      Tracking	
      
    Create	
      
    New	
      Alert	
      
    Redirected	
      to	
      
    a	
      new	
      Target	
      
    3.	
      Drill	
      Down	
      
    Metric	
      Analysis	
      
    4.	
      Check	
      Dependencies	
      
    5.	
      Root	
      
    Cause	
      
    Dependency	
      Analysis	
      
    Instance	
      Analysis	
      
    
slide 54:
    Instance	
      Analysis	
      
    
slide 55:
    Instance	
      Analysis	
      
    Locate,	
      quanFfy,	
      and	
      fix	
      performance	
      issues	
      anywhere	
      in	
      the	
      system	
      
    
slide 56:
    Instance	
      Tools	
      
    • Linux	
      
    – top,	
      ps,	
      pidstat,	
      vmstat,	
      iostat,	
      mpstat,	
      netstat,	
      
    nicstat,	
      sar,	
      strace,	
      tcpdump,	
      ss,	
      …	
      
    • System	
      Tracing	
      
    – Vrace,	
      perf_events,	
      SystemTap	
      
    • CPU	
      Performance	
      Counters	
      
    – perf_events,	
      rdmsr	
      
    • ApplicaFon	
      Profiling	
      
    – applicaFon	
      logs,	
      perf_events,	
      Google	
      Lightweight	
      Java	
      
    Profiler	
      (LJP),	
      Java	
      Flight	
      Recorder	
      (JFR)	
      
    
slide 57:
    Tools	
      in	
      an	
      AWS	
      EC2	
      Linux	
      Instance	
      
    
slide 58:
    Linux	
      Performance	
      Analysis	
      
    • vmstat,	
      pidstat,	
      sar,	
      etc,	
      used	
      mostly	
      normally	
      
    $ sar -n TCP,ETCP,DEV 1!
    Linux 3.2.55 (test-e4f1a80b)
    !08/18/2014
    !_x86_64_ !(8 CPU)!
    09:10:43 PM IFACE rxpck/s txpck/s
    rxkB/s
    txkB/s rxcmp/s txcmp/s
    09:10:44 PM
    09:10:44 PM
    eth0 4114.00 4186.00 4537.46 28513.24
    09:10:43 PM active/s passive/s
    iseg/s
    oseg/s!
    09:10:44 PM
    4107.00 22511.00!
    09:10:43 PM atmptf/s estres/s retrans/s isegerr/s
    orsts/s!
    09:10:44 PM
    1.00!
    […]!
    rxmcst/s!
    0.00!
    0.00!
    	
      
    • Micro	
      benchmarking	
      can	
      be	
      used	
      to	
      invesFgate	
      
    hypervisor	
      behavior	
      that	
      can’t	
      be	
      observed	
      directly	
      
    
slide 59:
    Instance	
      Challenges	
      
    • ApplicaFon	
      Profiling	
      
    – For	
      Java,	
      Node.js	
      
    • System	
      Tracing	
      
    – On	
      Linux	
      
    • Accessing	
      CPU	
      Performance	
      Counters	
      
    – From	
      cloud	
      guests	
      
    
slide 60:
    ApplicaFon	
      Profiling	
      
    • We’ve	
      found	
      many	
      tools	
      are	
      inaccurate	
      or	
      broken	
      
    – Eg,	
      those	
      based	
      on	
      java	
      hprof	
      
    • Stack	
      profiling	
      can	
      be	
      problemaFc:	
      
    – Linux	
      perf_events:	
      frame	
      pointer	
      for	
      the	
      JVM	
      is	
      oVen	
      
    missing	
      (by	
      hotspot),	
      breaking	
      stacks.	
      Also	
      needs	
      perf-­‐
    map-­‐agent	
      loaded	
      for	
      symbol	
      translaFon.	
      
    – DTrace:	
      jstack()	
      also	
      broken	
      by	
      missing	
      FPs	
      
    hlps://bugs.openjdk.java.net/browse/JDK-­‐6276264,	
      2005	
      
    • Flame	
      graphs	
      are	
      solving	
      many	
      performance	
      issues.	
      
    These	
      need	
      working	
      stacks.	
      
    
slide 61:
    ApplicaFon	
      Profiling:	
      Java	
      
    • Java	
      Flight	
      Recorder	
      
    – CPU	
      &	
      memory	
      profiling.	
      Oracle.	
      $$$	
      
    • Google	
      Lightweight	
      Java	
      Profiler	
      
    – Basic,	
      open	
      source,	
      free,	
      asynchronous	
      CPU	
      profiler	
      
    – Uses	
      an	
      agent	
      that	
      dumps	
      hprof-­‐like	
      output	
      
    • hlps://code.google.com/p/lightweight-­‐java-­‐profiler/wiki/Ge~ngStarted	
      
    • hlp://www.brendangregg.com/blog/2014-­‐06-­‐12/java-­‐flame-­‐graphs.html	
      
    • Plus	
      others	
      at	
      various	
      Fmes	
      (YourKit,	
      …)	
      
    
slide 62:
    LJP	
      CPU	
      Flame	
      Graph	
      (Java)	
      
    
slide 63:
    LJP	
      CPU	
      Flame	
      Graph	
      (Java)	
      
    Stack	
      frame	
      
    Ancestry	
      
    Mouse-­‐over	
      
    frames	
      to	
      
    quanFfy	
      
    
slide 64:
    Linux	
      System	
      Profiling	
      
    • Previous	
      profilers	
      only	
      show	
      Java	
      CPU	
      Fme	
      
    • We	
      use	
      perf_events	
      (aka	
      the	
      “perf”	
      command)	
      to	
      
    sample	
      everything	
      else:	
      
    – JVM	
      internals	
      &	
      libraries	
      
    – The	
      Linux	
      kernel	
      
    – Other	
      apps,	
      incl.	
      Node.js	
      
    • perf	
      CPU	
      Flame	
      graphs:	
      
    # git clone https://github.com/brendangregg/FlameGraph!
    # cd FlameGraph!
    # perf record -F 99 -ag -- sleep 60!
    # perf script | ./stackcollapse-perf.pl | ./flamegraph.pl >gt; perf.svg!
    
slide 65:
    perf	
      CPU	
      Flame	
      Graph	
      
    
slide 66:
    perf	
      CPU	
      Flame	
      Graph	
      
    Kernel	
      
    TCP/IP	
      
    Broken	
      
    Java	
      stacks	
      
    (missing	
      
    frame	
      
    pointer)	
      
    GC	
      
    Locks	
      
    Time	
      
    Idle	
      
    thread	
      
    epoll	
      
    
slide 67:
    ApplicaFon	
      Profiling:	
      Node.js	
      
    • Performance	
      analysis	
      on	
      Linux	
      a	
      growing	
      area	
      
    – Eg,	
      new	
      postmortem	
      tools	
      from	
      2	
      weeks	
      ago:	
      
    hlps://github.com/tjfontaine/lldb-­‐v8	
      
    • Flame	
      graphs	
      are	
      possible	
      using	
      Linux	
      perf_events	
      
    (perf)	
      and	
      v8	
      -­‐-­‐perf_basic_prof	
      
    (node	
      v0.11.13+)	
      
    – Although	
      there	
      is	
      currently	
      a	
      map	
      
    growth	
      bug;	
      see:	
      
    hlp://www.brendangregg.com/blog/	
      
    2014-­‐09-­‐17/node-­‐flame-­‐graphs-­‐on-­‐linux.html	
      
    • Also	
      do	
      heap	
      analysis	
      
    – node-­‐heapdump	
      
    
slide 68:
    Flame	
      Graphs	
      
    • CPU	
      sample	
      flame	
      graphs	
      solve	
      many	
      issues	
      
    – We’re	
      automaFng	
      their	
      collecFon	
      
    – If	
      you	
      aren’t	
      using	
      them	
      yet,	
      you’re	
      missing	
      out	
      on	
      
    low	
      hanging	
      fruit!	
      
    • Other	
      flame	
      graph	
      types	
      useful	
      as	
      well	
      
    – Disk	
      I/O,	
      network	
      I/O,	
      memory	
      events,	
      etc	
      
    – Any	
      profile	
      that	
      includes	
      more	
      stacks	
      than	
      can	
      be	
      
    quickly	
      read	
      
    
slide 69:
    Linux	
      Tracing	
      
    • ...	
      now	
      for	
      something	
      more	
      challenging	
      
    
slide 70:
    Linux	
      Tracing	
      
    • Too	
      many	
      choices,	
      and	
      many	
      are	
      sFll	
      in-­‐
    development:	
      
    – Vrace	
      
    – perf_events	
      
    – eBPF	
      
    – SystemTap	
      
    – ktap	
      
    – LTTng	
      
    – dtrace4linux	
      
    – sysdig	
      
    
slide 71:
    Linux	
      Tracing	
      
    • A	
      system	
      tracer	
      is	
      needed	
      to	
      root	
      cause	
      many	
      
    issues:	
      kernel,	
      library,	
      app	
      
    – (There’s	
      a	
      prely	
      good	
      book	
      covering	
      use	
      cases)	
      
    • DTrace	
      is	
      awesome,	
      but	
      the	
      
    Linux	
      ports	
      are	
      incomplete	
      
    • Linux	
      does	
      have	
      has	
      Vrace	
      and	
      
    perf_events	
      in	
      the	
      kernel	
      source,	
      
    which	
      –	
      it	
      turns	
      out	
      –	
      can	
      saFsfy	
      
    many	
      needs	
      already!	
      
    
slide 72:
    Linux	
      Tracing:	
      Vrace	
      
    • Added	
      by	
      Steven	
      Rostedt	
      and	
      others	
      since	
      2.6.27	
      
    • Already	
      enabled	
      on	
      our	
      servers	
      (3.2+)	
      
    – CONFIG_FTRACE,	
      CONFIG_FUNCTION_PROFILER,	
      …	
      
    – Use	
      directly	
      via	
      /sys/kernel/debug/tracing	
      
    • Front-­‐end	
      tools	
      to	
      aid	
      usage:	
      perf-­‐tools	
      
    – hlps://github.com/brendangregg/perf-­‐tools	
      
    – Unsupported	
      hacks:	
      see	
      WARNINGs	
      
    – Also	
      see	
      the	
      trace-­‐cmd	
      front-­‐end,	
      as	
      well	
      as	
      perf	
      
    • lwn.net:	
      “Ftrace:	
      The	
      Hidden	
      Light	
      Switch”	
      
    
slide 73:
    perf-­‐tools:	
      iosnoop	
      
    • Block	
      I/O	
      (disk)	
      events	
      with	
      latency:	
      
    # ./iosnoop –ts!
    Tracing block I/O. Ctrl-C to end.!
    STARTs
    ENDs
    COMM
    5982800.302061 5982800.302679 supervise
    5982800.302423 5982800.302842 supervise
    5982800.304962 5982800.305446 supervise
    5982800.305250 5982800.305676 supervise
    […]!
    PID
    TYPE DEV
    202,1
    202,1
    202,1
    202,1
    BLOCK
    BYTES LATms!
    0.62!
    0.42!
    0.48!
    0.43!
    # ./iosnoop –h!
    USAGE: iosnoop [-hQst] [-d device] [-i iotype] [-p PID] [-n name] [duration]!
    -d device
    # device string (eg, "202,1)!
    -i iotype
    # match type (eg, '*R*' for all reads)!
    -n name
    # process name to match on I/O issue!
    -p PID
    # PID to match on I/O issue!
    # include queueing time in LATms!
    # include start time of I/O (s)!
    # include completion time of I/O (s)!
    # this usage message!
    duration
    # duration seconds, and use buffers!
    […]!
    
slide 74:
    perf-­‐tools:	
      iolatency	
      
    • Block	
      I/O	
      (disk)	
      latency	
      distribuFons:	
      
    # ./iolatency !
    Tracing block I/O. Output every 1 seconds. Ctrl-C to end.!
    >gt;=(ms) .. gt; 1
    : 2104
    |######################################|!
    1 ->gt; 2
    : 280
    |######
    2 ->gt; 4
    : 2
    4 ->gt; 8
    : 0
    8 ->gt; 16
    : 202
    |####
    >gt;=(ms) .. gt; 1
    : 1144
    |######################################|!
    1 ->gt; 2
    : 267
    |#########
    2 ->gt; 4
    : 10
    4 ->gt; 8
    : 5
    8 ->gt; 16
    : 248
    |#########
    16 ->gt; 32
    : 601
    |####################
    32 ->gt; 64
    : 117
    |####
    […]!
    
slide 75:
    perf-­‐tools:	
      opensnoop	
      
    • Trace	
      open()	
      syscalls	
      showing	
      filenames:	
      
    # ./opensnoop -t!
    Tracing open()s. Ctrl-C to end.!
    TIMEs
    COMM
    PID
    postgres
    postgres
    postgres
    postgres
    postgres
    postgres
    postgres
    svstat
    svstat
    stat
    stat
    stat
    stat
    stat
    stat
    […]!
    FD FILE!
    0x8 /proc/self/oom_adj!
    0x5 global/pg_filenode.map!
    0x5 global/pg_internal.init!
    0x5 base/16384/PG_VERSION!
    0x5 base/16384/pg_filenode.map!
    0x5 base/16384/pg_internal.init!
    0x5 base/16384/11725!
    0x4 supervise/ok!
    0x4 supervise/status!
    0x3 /etc/ld.so.cache!
    0x3 /lib/x86_64-linux-gnu/libselinux…!
    0x3 /lib/x86_64-linux-gnu/libc.so.6!
    0x3 /lib/x86_64-linux-gnu/libdl.so.2!
    0x3 /proc/filesystems!
    0x3 /etc/nsswitch.conf!
    
slide 76:
    perf-­‐tools:	
      funcgraph	
      
    • Trace	
      a	
      graph	
      of	
      kernel	
      code	
      flow:	
      
    # ./funcgraph -Htp 5363 vfs_read!
    Tracing "vfs_read" for PID 5363... Ctrl-C to end.!
    # tracer: function_graph!
    TIME
    CPU DURATION
    FUNCTION CALLS!
    4346366.073832 |
    | vfs_read() {!
    4346366.073834 |
    rw_verify_area() {!
    4346366.073834 |
    security_file_permission() {!
    4346366.073834 |
    apparmor_file_permission() {!
    4346366.073835 |
    0.153 us
    common_file_perm();!
    4346366.073836 |
    0.947 us
    4346366.073836 |
    0.066 us
    __fsnotify_parent();!
    4346366.073836 |
    0.080 us
    fsnotify();!
    4346366.073837 |
    2.174 us
    4346366.073837 |
    2.656 us
    4346366.073837 |
    tty_read() {!
    4346366.073837 |
    0.060 us
    tty_paranoia_check();!
    […]!
    
slide 77:
    perf-­‐tools:	
      kprobe	
      
    • Dynamically	
      trace	
      a	
      kernel	
      funcFon	
      call	
      or	
      return,	
      
    with	
      variables,	
      and	
      in-­‐kernel	
      filtering:	
      
    # ./kprobe 'p:open do_sys_open filename=+0(%si):string' 'filename ~ "*stat"'!
    Tracing kprobe myopen. Ctrl-C to end.!
    postgres-1172 [000] d... 6594028.787166: open: (do_sys_open
    +0x0/0x220) filename="pg_stat_tmp/pgstat.stat"!
    postgres-1172 [001] d... 6594028.797410: open: (do_sys_open
    +0x0/0x220) filename="pg_stat_tmp/pgstat.stat"!
    postgres-1172 [001] d... 6594028.797467: open: (do_sys_open
    +0x0/0x220) filename="pg_stat_tmp/pgstat.stat”!
    ^C!
    Ending tracing...!
    • Add	
      -­‐s	
      for	
      stack	
      traces;	
      -­‐p	
      for	
      PID	
      filter	
      in-­‐kernel.	
      
    • Quickly	
      confirm	
      kernel	
      behavior;	
      eg:	
      did	
      a	
      
    tunable	
      take	
      effect?	
      
    
slide 78:
    perf-­‐tools	
      (so	
      far…)	
      
    
slide 79:
    Heat	
      Maps	
      
    • Vrace	
      or	
      perf_events	
      for	
      tracing	
      disk	
      I/O	
      and	
      
    other	
      latencies	
      as	
      a	
      heat	
      map:	
      
    
slide 80:
    Other	
      Tracing	
      OpFons	
      
    • SystemTap	
      
    – The	
      most	
      powerful	
      of	
      the	
      system	
      tracers	
      
    – We’ll	
      use	
      it	
      as	
      a	
      last	
      resort:	
      deep	
      custom	
      tracing	
      
    – I’ve	
      historically	
      had	
      issues	
      with	
      panics	
      and	
      freezes	
      
    • SFll	
      present	
      in	
      the	
      latest	
      version?	
      
    • The	
      Ne5lix	
      fault	
      tolerant	
      architecture	
      makes	
      panics	
      
    much	
      less	
      of	
      a	
      problem	
      (that	
      was	
      the	
      panic	
      monkey)	
      
    • Instance	
      canaries	
      with	
      DTrace	
      are	
      possible	
      too	
      
    – OmniOS	
      
    – FreeBSD	
      
    
slide 81:
    Linux	
      Tracing	
      Future	
      
    • Vrace	
      +	
      perf_events	
      cover	
      much,	
      but	
      not	
      custom	
      
    in-­‐kernel	
      aggregaFons	
      
    • eBPF	
      may	
      provide	
      this	
      missing	
      feature	
      
    – eg,	
      in-­‐kernel	
      latency	
      heat	
      map	
      (showing	
      bimodal):	
      
    
slide 82:
    Linux	
      Tracing	
      Future	
      
    • Vrace	
      +	
      perf_events	
      cover	
      much,	
      but	
      not	
      custom	
      
    in-­‐kernel	
      aggregaFons	
      
    • eBPF	
      may	
      provide	
      this	
      missing	
      feature	
      
    Time	
      
    – eg,	
      in-­‐kernel	
      latency	
      heat	
      map	
      (showing	
      bimodal):	
      
    Low	
      
    latency	
      
    cache	
      
    hits	
      
    High	
      
    latency	
      
    device	
      
    I/O	
      
    
slide 83:
    CPU	
      Performance	
      Counters	
      
    • …	
      is	
      this	
      even	
      possible	
      from	
      a	
      cloud	
      guest?	
      
    
slide 84:
    CPU	
      Performance	
      Counters	
      
    • Model	
      Specific	
      Registers	
      (MSRs)	
      
    – Basic	
      details:	
      Fmestamp	
      clock,	
      temperature,	
      power	
      
    – Some	
      are	
      available	
      in	
      EC2	
      
    • Performance	
      Monitoring	
      Counters	
      (PMCs)	
      
    – Advanced	
      details:	
      cycles,	
      stall	
      cycles,	
      cache	
      misses,	
      …	
      
    – Not	
      available	
      in	
      EC2	
      (by	
      default)	
      
    • Root	
      cause	
      CPU	
      usage	
      at	
      the	
      cycle	
      level	
      
    – Eg,	
      higher	
      CPU	
      usage	
      due	
      to	
      more	
      memory	
      stall	
      cycles	
      
    
slide 85:
    msr-­‐cloud-­‐tools	
      
    • Uses	
      the	
      msr-­‐tools	
      package	
      and	
      rdmsr(1)	
      
    – hlps://github.com/brendangregg/msr-­‐cloud-­‐tools	
      
    ec2-guest# ./cputemp 1!
    CPU1 CPU2 CPU3 CPU4!
    61 61 60 59!
    CPU	
      Temperature	
      
    60 61 60 60!
    [...]!
    ec2-guest# ./showboost!
    CPU MHz
    : 2500!
    Turbo MHz
    : 2900 (10 active)!
    Turbo Ratio : 116% (10 active)!
    CPU 0 summary every 5 seconds...!
    TIME
    C0_MCYC
    C0_ACYC
    UTIL
    06:11:35
    51%
    06:11:40
    50%
    06:11:45
    49%
    06:11:50
    49%
    [...]!
    Real	
      CPU	
      MHz	
      
    RATIO
    116%
    115%
    115%
    116%
    MHz!
    2900!
    2899!
    2899!
    2900!
    
slide 86:
    MSRs:	
      CPU	
      Temperature	
      
    • Useful	
      to	
      explain	
      variaFon	
      in	
      turbo	
      boost	
      (if	
      seen)	
      
    • Temperature	
      for	
      a	
      syntheFc	
      workload:	
      
    
slide 87:
    MSRs:	
      Intel	
      Turbo	
      Boost	
      
    • Can	
      dynamically	
      increase	
      CPU	
      speed	
      up	
      to	
      30+%	
      
    • This	
      can	
      mess	
      up	
      all	
      performance	
      comparisons	
      
    • Clock	
      speed	
      can	
      be	
      observed	
      from	
      MSRs	
      using	
      
    – IA32_MPERF:	
      Bits	
      63:0	
      is	
      TSC	
      Frequency	
      Clock	
      
    Counter	
      C0_MCNT	
      TSC	
      relaFve	
      
    – IA32_APERF:	
      Bits	
      63:0	
      is	
      TSC	
      Frequency	
      Clock	
      Counter	
      
    C0_ACNT	
      actual	
      clocks	
      
    • This	
      is	
      how	
      msr-­‐cloud-­‐tools	
      showturbo	
      works	
      
    
slide 88:
    PMCs	
      
    
slide 89:
    PMCs	
      
    • Needed	
      for	
      remaining	
      low-­‐level	
      CPU	
      analysis:	
      
    – CPU	
      stall	
      cycles,	
      and	
      stall	
      cycle	
      breakdowns	
      
    – L1,	
      L2,	
      L3	
      cache	
      hit/miss	
      raFo	
      
    – Memory,	
      CPU	
      Interconnect,	
      and	
      bus	
      I/O	
      
    • Not	
      enabled	
      by	
      default	
      in	
      EC2.	
      Is	
      possible,	
      eg:	
      
    # perf stat -e cycles,instructions,r0480,r01A2 -p `pgrep -n java` sleep 10!
    Performance counter stats for process id '17190':!
    71,208,028,133 cycles
    0.000 GHz
    [100.00%]!
    41,603,452,060 instructions
    0.58 insns per cycle
    [100.00%]!
    23,489,032,742 r0480
    [100.00%]!
    ICACHE.IFETCH_STALL	
      
    20,241,290,520 r01A2!
    RESOURCE_STALLS.ANY	
      
    10.000894718 seconds time elapsed!
    
slide 90:
    Using	
      Advanced	
      Perf	
      Tools	
      
    • Everyone	
      doesn’t	
      need	
      to	
      learn	
      these	
      
    • Reality:	
      
    – A.	
      Your	
      company	
      has	
      one	
      or	
      more	
      people	
      for	
      advanced	
      
    perf	
      analysis	
      (perf	
      team).	
      Ask	
      them.	
      
    – B.	
      You	
      are	
      that	
      person	
      
    – C.	
      You	
      buy	
      a	
      product	
      that	
      does	
      it.	
      Ask	
      them.	
      
    • If	
      you	
      aren’t	
      the	
      advanced	
      perf	
      engineer,	
      you	
      need	
      
    to	
      know	
      what	
      to	
      ask	
      for	
      
    – Flame	
      graphs,	
      latency	
      heat	
      maps,	
      Vrace,	
      PMCs,	
      etc…	
      
    • At	
      Ne5lix,	
      we’re	
      building	
      the	
      (C)	
      opFon:	
      Vector	
      
    
slide 91:
    Future	
      Work:	
      Vector	
      
    
slide 92:
    Future	
      Work:	
      Vector	
      
    U,liza,on	
      
    Satura,on	
      
    Errors	
      
    Per	
      device	
      
    Breakdowns	
      
    
slide 93:
    Future	
      Work:	
      Vector	
      
    • Real-­‐Fme,	
      per-­‐second,	
      instance	
      metrics	
      
    • On-­‐demand	
      CPU	
      flame	
      
    Atlas	
      Alerts	
      
    ICE	
      
    graphs,	
      heat	
      maps,	
      
    Vrace	
      metrics,	
      and	
      
    Atlas	
      Dashboards	
      
    SystemTap	
      metrics	
      
    • Analyze	
      from	
      clouds	
      
    Chronos	
      
    to	
      roots	
      quickly,	
      and	
      
    Atlas	
      Metrics	
      
    from	
      a	
      web	
      interface	
      
    Mogul	
      
    • Scalable:	
      other	
      teams	
      
    can	
      use	
      it	
      easily	
      
    Vector	
      
    Salp	
      
    
slide 94:
    In	
      Summary	
      
    • 1.	
      Ne5lix	
      architecture	
      
    – Fault	
      tolerance:	
      ASGs,	
      ASG	
      clusters,	
      Hystrix	
      (dependency	
      API),	
      
    Zuul	
      (proxy),	
      Simian	
      army	
      (tesFng)	
      
    – Reduces	
      the	
      severity	
      and	
      urgency	
      of	
      issues	
      
    • 2.	
      Cloud	
      Analysis	
      
    – Atlas	
      (alerts/dashboards/metrics),	
      Chronos	
      (event	
      tracking),	
      
    Mogul	
      &	
      Salp	
      (dependency	
      analysis),	
      ICE	
      (AWS	
      usage)	
      
    – Quickly	
      narrow	
      focus	
      from	
      cloud	
      to	
      ASG	
      to	
      instance	
      
    • 3.	
      Instance	
      Analysis	
      
    – Linux	
      tools	
      (*stat,	
      sar,	
      …),	
      perf_events,	
      Vrace,	
      perf-­‐tools,	
      rdmsr,	
      
    msr-­‐cloud-­‐tools,	
      Vector	
      
    – Read	
      logs,	
      profile	
      &	
      trace	
      all	
      soVware,	
      read	
      CPU	
      counters	
      
    
slide 95:
    References	
      &	
      Links	
      
    hlps://ne5lix.github.io/#repo	
      
    hlp://techblog.ne5lix.com/2012/01/auto-­‐scaling-­‐in-­‐amazon-­‐cloud.html	
      
    hlp://techblog.ne5lix.com/2012/06/asgard-­‐web-­‐based-­‐cloud-­‐management-­‐and.html	
      
    hlp://www.slideshare.net/benjchristensen/performance-­‐and-­‐fault-­‐tolerance-­‐for-­‐the-­‐
    ne5lix-­‐api-­‐qcon-­‐sao-­‐paulo	
      
    hlp://www.slideshare.net/adrianco/ne5lix-­‐nosql-­‐search	
      
    hlp://www.slideshare.net/ufried/resilience-­‐with-­‐hystrix	
      
    hlps://github.com/Ne5lix/Hystrix,	
      hlps://github.com/Ne5lix/Zuul	
      
    hlp://techblog.ne5lix.com/2011/07/ne5lix-­‐simian-­‐army.html	
      
    hlp://techblog.ne5lix.com/2014/09/introducing-­‐chaos-­‐engineering.html	
      
    hlp://www.brendangregg.com/blog/2014-­‐06-­‐12/java-­‐flame-­‐graphs.html	
      
    hlp://www.brendangregg.com/blog/2014-­‐09-­‐17/node-­‐flame-­‐graphs-­‐on-­‐linux.html	
      
    Systems	
      Performance:	
      Enterprise	
      and	
      the	
      Cloud,	
      PrenFce	
      Hall,	
      2014	
      
    hlp://sourceforge.net/projects/nicstat/	
      
    perf-­‐tools:	
      hlps://github.com/brendangregg/perf-­‐tools	
      
    Ftrace:	
      The	
      Hidden	
      Light	
      Switch:	
      hlp://lwn.net/ArFcles/608497/	
      
    msr-­‐cloud-­‐tools:	
      hlps://github.com/brendangregg/msr-­‐cloud-­‐tools	
      
    
slide 96:
    Thanks	
      
    Coburn	
      Watson,	
      Adrian	
      CockcroV	
      
    Atlas:	
      Insight	
      Engineering	
      (Roy	
      Rapoport,	
      etc.)	
      
    Mogul:	
      Performance	
      Engineering	
      (Scol	
      Emmons,	
      MarFn	
      Spier)	
      
    Vector:	
      Performance	
      Engineering	
      (MarFn	
      Spier,	
      Amer	
      Ather)	
      
    
slide 97:
    Thanks	
      
    • QuesFons?	
      
    • hlp://techblog.ne5lix.com	
      
    • hlp://slideshare.net/brendangregg	
      	
      
    • hlp://www.brendangregg.com	
      
    • bgregg@ne5lix.com	
      
    • @brendangregg