Surge2014_CloudsToRoots.pdf

Surge 2014: Netflix, From Clouds to Roots

Video: http://www.youtube.com/watch?v=H-E0MQTID0g

From Clouds to Roots: root cause performance analysis at Netflix. Talk at Surge 2014 by Brendan Gregg.

Description: "At Netflix, high scale and fast deployment rule. The possibilities for failure are endless, and the environment excels at handling this, regularly tested and exercised by the simian army. But, when this environment automatically works around systemic issues that aren’t root-caused, they can grow over time. This talk describes the challenge of not just handling failures of scale on the Netflix cloud, but also new approaches and tools for quickly diagnosing their root cause in an ever changing environment."

	next prev 1/97
	next prev 2/97
	next prev 3/97
	next prev 4/97
	next prev 5/97
	next prev 6/97
	next prev 7/97
	next prev 8/97
	next prev 9/97
	next prev 10/97
	next prev 11/97
	next prev 12/97
	next prev 13/97
	next prev 14/97
	next prev 15/97
	next prev 16/97
	next prev 17/97
	next prev 18/97
	next prev 19/97
	next prev 20/97
	next prev 21/97
	next prev 22/97
	next prev 23/97
	next prev 24/97
	next prev 25/97
	next prev 26/97
	next prev 27/97
	next prev 28/97
	next prev 29/97
	next prev 30/97
	next prev 31/97
	next prev 32/97
	next prev 33/97
	next prev 34/97
	next prev 35/97
	next prev 36/97
	next prev 37/97
	next prev 38/97
	next prev 39/97
	next prev 40/97
	next prev 41/97
	next prev 42/97
	next prev 43/97
	next prev 44/97
	next prev 45/97
	next prev 46/97
	next prev 47/97
	next prev 48/97
	next prev 49/97
	next prev 50/97
	next prev 51/97
	next prev 52/97
	next prev 53/97
	next prev 54/97
	next prev 55/97
	next prev 56/97
	next prev 57/97
	next prev 58/97
	next prev 59/97
	next prev 60/97
	next prev 61/97
	next prev 62/97
	next prev 63/97
	next prev 64/97
	next prev 65/97
	next prev 66/97
	next prev 67/97
	next prev 68/97
	next prev 69/97
	next prev 70/97
	next prev 71/97
	next prev 72/97
	next prev 73/97
	next prev 74/97
	next prev 75/97
	next prev 76/97
	next prev 77/97
	next prev 78/97
	next prev 79/97
	next prev 80/97
	next prev 81/97
	next prev 82/97
	next prev 83/97
	next prev 84/97
	next prev 85/97
	next prev 86/97
	next prev 87/97
	next prev 88/97
	next prev 89/97
	next prev 90/97
	next prev 91/97
	next prev 92/97
	next prev 93/97
	next prev 94/97
	next prev 95/97
	next prev 96/97
	next prev 97/97

PDF: Surge2014_CloudsToRoots.pdf

Keywords (from pdftotext):

slide 1:

From	
  Clouds	
  
to	
  Roots	
  
Brendan	
  Gregg	
  
Senior	
  Performance	
  Architect	
  
Performance	
  Engineering	
  Team	
  
	
  
bgregg@ne5lix.com,	
  @brendangregg	
  
September,	
  2014

slide 2:

Root	
  Cause	
  Analysis	
  at	
  Ne5lix	
  
Devices	
  
Ribbon	
  
Hystrix	
  
Zuul	
  
Service	
  
Tomcat	
  
Roots	
  
…	
  
JVM	
  
Load	
  
Instances	
  (Linux)	
  
AZ	
  1	
   AZ	
  2	
   AZ	
  3	
  
ASG	
  2	
   …	
  
ASG	
  1	
  
ELB	
  
ASG	
  Cluster	
  
ApplicaFon	
  
Ne5lix	
  
Atlas	
  
Chronos	
  
Mogul	
  
Vector	
  
sar,	
  *stat	
  
stap,	
  Vrace	
  
rdmsr	
  
…	
  
SG	
  
…

slide 3:

• Massive	
  AWS	
  EC2	
  Linux	
  cloud	
  
– Tens	
  of	
  thousands	
  of	
  server	
  instances	
  
– Autoscale	
  by	
  ~3k	
  each	
  day	
  
– CentOS	
  and	
  Ubuntu	
  
• FreeBSD	
  for	
  content	
  delivery	
  
– Approx	
  33%	
  of	
  US	
  Internet	
  traﬃc	
  at	
  night	
  
• Performance	
  is	
  criFcal	
  
– Customer	
  saFsfacFon:	
  >gt;50M	
  subscribers	
  
– $$$	
  price/performance	
  
– Develop	
  tools	
  for	
  cloud-‐wide	
  analysis

slide 4:

Brendan	
  Gregg	
  
• Senior	
  Performance	
  Architect,	
  Ne5lix	
  
– Linux	
  and	
  FreeBSD	
  performance	
  
– Performance	
  Engineering	
  team	
  (@coburnw)	
  
• Recent	
  work:	
  
– Linux	
  perf-‐tools,	
  using	
  Vrace	
  &	
  perf_events	
  
– Systems	
  Performance,	
  PrenFce	
  Hall	
  
• Previous	
  work	
  includes:	
  
– USE	
  Method,	
  ﬂame	
  graphs,	
  latency	
  &	
  
uFlizaFon	
  heat	
  maps,	
  DTraceToolkit,	
  
iosnoop	
  and	
  others	
  on	
  OS	
  X,	
  ZFS	
  L2ARC	
  
• Twiler	
  @brendangregg

slide 5:

Last	
  year	
  at	
  Surge…	
  
• I	
  saw	
  a	
  great	
  Ne5lix	
  talk	
  by	
  Coburn	
  Watson:	
  
• hlps://www.youtube.com/watch?v=7-‐13wV3WO8Q	
  
• He’s	
  now	
  my	
  manager	
  (and	
  also	
  sFll	
  hiring!)

slide 6:

Agenda	
  
• The	
  Ne5lix	
  Cloud	
  
– How	
  it	
  works:	
  ASG	
  clusters,	
  Hystrix,	
  monkeys	
  
– And	
  how	
  it	
  may	
  fail	
  
• Root	
  Cause	
  Performance	
  Analysis	
  
– Why	
  it’s	
  sFll	
  needed	
  
• Cloud	
  analysis	
  
• Instance	
  analysis

slide 7:

Terms	
  
AWS:	
  Amazon	
  Web	
  Services	
  
EC2:	
  AWS	
  ElasFc	
  Compute	
  2	
  (cloud	
  instances)	
  
S3:	
  AWS	
  Simple	
  Storage	
  Service	
  (object	
  store)	
  
ELB:	
  AWS	
  ElasFc	
  Load	
  Balancers	
  
SQS:	
  AWS	
  Simple	
  Queue	
  Service	
  
SES:	
  AWS	
  Simple	
  Email	
  Service	
  
CDN:	
  Content	
  Delivery	
  Network	
  
OCA:	
  Ne5lix	
  Open	
  Connect	
  Appliance	
  (streaming	
  CDN)	
  
QoS:	
  Quality	
  of	
  Service	
  
AMI:	
  Amazon	
  Machine	
  Image	
  (instance	
  image)	
  
ASG:	
  Auto	
  Scaling	
  Group	
  
AZ:	
  Availability	
  Zone	
  
NIWS:	
  Ne5lix	
  Internal	
  Web	
  Service	
  framework	
  (Ribbon)	
  
MSR:	
  Model	
  Speciﬁc	
  Register	
  (CPU	
  info	
  register)	
  
PMC:	
  Performance	
  Monitoring	
  Counter	
  (CPU	
  perf	
  counter)

slide 8:

The	
  Ne5lix	
  Cloud

slide 9:

The	
  Ne5lix	
  Cloud	
  
• Tens	
  of	
  thousands	
  of	
  cloud	
  instances	
  on	
  AWS	
  EC2,	
  
with	
  S3	
  and	
  Cassandra	
  for	
  storage	
  
EC2	
  
ELB	
  
Cassandra	
  
ApplicaFons	
  
(Services)	
  
S3	
  
ElasFcsearch	
  
EVCache	
  
SES	
   SQS	
  
• Ne5lix	
  is	
  implemented	
  by	
  mulFple	
  logical	
  services

slide 10:

Ne5lix	
  Services	
  
• Open	
  Connect	
  
Appliances	
  used	
  
for	
  content	
  
delivery	
  
Client	
  
Devices	
  
AuthenFcaFon	
  
Web	
  Site	
  API	
  
Streaming	
  API	
  
User	
  Data	
  
PersonalizaFon	
  
Viewing	
  Hist.	
  
…	
  
DRM	
  
QoS	
  Logging	
  
OCA	
  CDN	
  
CDN	
  Steering	
  
Encoding

slide 11:

Freedom	
  and	
  Responsibility	
  
• Culture	
  deck	
  is	
  true	
  
– hlp://www.slideshare.net/reed2001/culture-‐1798664	
  (9M	
  views!)	
  
• Deployment	
  freedom	
  
– Service	
  teams	
  choose	
  their	
  own	
  tech	
  &	
  schedules	
  
– Purchase	
  and	
  use	
  cloud	
  instances	
  without	
  approvals	
  
– Ne5lix	
  environment	
  changes	
  fast!

slide 12:

Cloud	
  Technologies	
  
• Numerous	
  open	
  source	
  technologies	
  are	
  in	
  use:	
  
– Linux,	
  Java,	
  Cassandra,	
  Node.js,	
  …	
  
• Ne5lix	
  also	
  open	
  sources:	
  ne5lix.github.io

slide 13:

Cloud	
  Instances	
  
• Base	
  server	
  instance	
  image	
  +	
  customizaFons	
  by	
  
service	
  teams	
  (BaseAMI).	
  Typically:	
  
Linux	
  (CentOS	
  or	
  Ubuntu)	
  
OpFonal	
  Apache,	
  
memcached,	
  
non-‐Java	
  apps	
  
(incl.	
  Node.js)	
  
Atlas	
  monitoring,	
  
S3	
  log	
  rotaFon,	
  
Vrace,	
  perf,	
  stap,	
  
custom	
  perf	
  tools	
  
Java	
  (JDK	
  7	
  or	
  8)	
  
GC	
  and	
  
thread	
  
dump	
  
logging	
  
Tomcat	
  
ApplicaFon	
  war	
  ﬁles,	
  base	
  
servlet,	
  pla5orm,	
  hystrix,	
  
health	
  check,	
  metrics	
  (Servo)

slide 14:

Scalability	
  and	
  Reliability	
  
#	
   Problem	
  
1	
   Load	
  increases	
  
2	
   Poor	
  performing	
  
code	
  push	
  
Solu,on	
  
Auto	
  scale	
  with	
  ASGs	
  
Rapid	
  rollback	
  with	
  red/black	
  ASG	
  
clusters	
  
3	
   Instance	
  failure	
  
Hystrix	
  Fmeouts	
  and	
  secondaries	
  
4	
   Zone/Region	
  failure	
  
Zuul	
  to	
  reroute	
  traﬃc	
  
5	
   Overlooked	
  and	
  
unhandled	
  issues	
  
Simian	
  army	
  
6	
   Poor	
  performance	
  
Atlas	
  metrics,	
  alerts,	
  Chronos

slide 15:

1.	
  Auto	
  Scaling	
  Groups	
  
ASG	
  
• Instances	
  automaFcally	
  
added	
  or	
  removed	
  by	
  a	
  
custom	
  scaling	
  policy	
  
– A	
  broken	
  policy	
  could	
  cause	
  
false	
  scaling	
  
• Alerts	
  &	
  audits	
  used	
  to	
  
check	
  scaling	
  is	
  sane	
  
CloudWatch,	
  Servo	
  
Cloud	
  ConﬁguraFon	
  
Management	
  
Scaling	
  Policy	
  
loadavg,	
  latency,	
  …	
  
	
  
Instance	
  
Instance	
  
Instance

slide 16:

2.	
  ASG	
  Clusters	
  
• How	
  code	
  versions	
   ASG	
  Cluster	
  
are	
  really	
  deployed	
   prod1	
  
• Traﬃc	
  managed	
  
by	
  ElasFc	
  Load	
  
Balancers	
  (ELBs)	
  
• Fast	
  rollback	
  if	
  
ASG-‐v010	
  
issues	
  are	
  found	
  
…	
  
– Might	
  rollback	
  
undiagnosed	
  issues	
  
• Canaries	
  can	
  also	
  
be	
  used	
  for	
  tesFng	
  
(and	
  automated)	
  
ELB	
  
Canary	
  
ASG-‐v011	
  
…	
  
Instance	
  
Instance	
  
Instance	
  
Instance	
  
Instance	
  
Instance

slide 17:

3.	
  Hystrix	
  
• A	
  library	
  for	
  latency	
  and	
  
fault	
  tolerance	
  for	
  
dependency	
  services	
  
– Fallbacks,	
  degradaFon,	
  
fast	
  fail	
  and	
  rapid	
  recovery	
  
– Supports	
  Fmeouts,	
  load	
  
shedding,	
  circuit	
  breaker	
  
– Uses	
  thread	
  pools	
  for	
  
dependency	
  services	
  
– RealFme	
  monitoring	
  
Tomcat	
  
ApplicaFon	
  
get	
  A	
  
Hystrix	
  
>gt;100ms	
  
Dependency	
  
A1	
  
• Plus	
  the	
  Ribbon	
  IPC	
  
library	
  (NIWS),	
  which	
  
adds	
  even	
  more	
  fault	
  tolerance	
  
Dependency	
  
A2

slide 18:

4.	
  Redundancy	
  
• All	
  device	
  traﬃc	
  goes	
  through	
  the	
  Zuul	
  proxy:	
  
– dynamic	
  rouFng,	
  monitoring,	
  resiliency,	
  security	
  
• Availability	
  Zone	
  failure:	
  run	
  from	
  2	
  of	
  3	
  zones	
  
• Region	
  failure:	
  reroute	
  traﬃc	
  
Monitoring	
  
Zuul	
  
AZ1	
  
AZ2	
  
AZ3

slide 19:

5.	
  Simian	
  Army	
  
• Ensures	
  cloud	
  handles	
  failures	
  
through	
  regular	
  tesFng	
  
• Monkeys:	
  
– Latency:	
  arFﬁcial	
  delays	
  
– Conformity:	
  kills	
  non-‐	
  
best-‐pracFces	
  instances	
  
– Doctor:	
  health	
  checks	
  
– Janitor:	
  unused	
  instances	
  
– Security:	
  checks	
  violaFons	
  
– 10-‐18:	
  geographic	
  issues	
  
– Chaos	
  Gorilla:	
  AZ	
  failure	
  
• We’re	
  hiring	
  Chaos	
  Engineers!

slide 20:

6.	
  Atlas,	
  alerts,	
  Chronos	
  
• Atlas:	
  Cloud-‐wide	
  
monitoring	
  tool	
  
– Millions	
  of	
  metrics,	
  
quick	
  rollups,	
  
custom	
  dashboards:	
  
• Alerts:	
  Custom,	
  
using	
  Atlas	
  metrics	
  
– In	
  parFcular,	
  error	
  &	
  Fmeout	
  rates	
  on	
  client	
  devices	
  
• Chronos:	
  Change	
  tracking	
  
– Used	
  during	
  incident	
  invesFgaFons

slide 21:

In	
  Summary	
  
• Ne5lix	
  is	
  very	
  good	
  
at	
  automaFcally	
  
handling	
  failure	
  
– Issues	
  oVen	
  lead	
  
to	
  rapid	
  instance	
  
growth	
  (ASGs)	
  
• Good	
  for	
  customers	
  
– Fast	
  workaround	
  
• Good	
  for	
  engineers	
  
– Fix	
  later,	
  9-‐5	
  
#	
   Problem	
  
Solu,on	
  
1	
   Load	
  increases	
  
ASGs	
  
2	
   Poor	
  performing	
  
code	
  push	
  
ASG	
  clusters	
  
3	
   Instance	
  issue	
  
Hystrix	
  
4	
   Zone/Region	
  issue	
   Zuul	
  
5	
   Overlooked	
  and	
   Monkeys	
  
unhandled	
  issues	
  
6	
   Poor	
  performance	
   Atlas,	
  alerts,	
  
Chronos

slide 22:

Typical	
  Ne5lix	
  Stack	
  
Problems/soluFons	
  
enumerated	
  
Devices	
  
4.	
  
Zuul	
  
Ribbon	
  
3.	
  
Hystrix	
  
Dependencies,	
  
Atlas	
  (monitoring),	
  
Discovery,	
  …	
   6.	
  
Service	
  
Tomcat	
  
5.	
  
…	
  
JVM	
  
Load	
  
Monkeys	
  
Instances	
  (Linux)	
  
AZ	
  1	
   AZ	
  2	
   AZ	
  3	
  
1.	
  
ASG	
  2	
   …	
  
ASG	
  1	
  
ELB	
   2.	
  
ASG	
  Cluster	
  
SG	
  
…	
  
ApplicaFon	
  
Ne5lix

slide 23:

*	
  ExcepFons	
  
• Apache	
  Web	
  Server	
  
• Node.js	
  
• …

slide 24:

Root	
  Cause	
  Performance	
  Analysis

slide 25:

Root	
  Cause	
  Performance	
  Analysis	
  
• Conducted	
  when:	
  
– Growth	
  becomes	
  a	
  cost	
  problem	
  
– More	
  instances	
  or	
  roll	
  backs	
  don’t	
  work	
  
• Eg:	
  dependency	
  issue,	
  networking,	
  …	
  
– A	
  ﬁx	
  is	
  needed	
  for	
  forward	
  progress	
  
• “But	
  it’s	
  faster	
  on	
  Linux	
  2.6.21	
  m2.xlarge!”	
  
• Staying	
  on	
  older	
  versions	
  for	
  an	
  undiagnosed	
  (and	
  
ﬁxable)	
  reason	
  prevents	
  gains	
  from	
  later	
  improvements	
  
– To	
  understand	
  scalability	
  factors	
  
• IdenFﬁes	
  the	
  origin	
  of	
  poor	
  performance

slide 26:

Root	
  Cause	
  Analysis	
  Process	
  
• From	
  cloud	
  to	
  instance:	
  
…	
  
SG	
  
Ne5lix	
  
ApplicaFon	
  
ASG	
  Cluster	
  
ASG	
  1	
  
…	
   ASG	
  2	
  
AZ	
  3	
   AZ	
  2	
   AZ	
  1	
  
Instances	
  (Linux)	
  
JVM	
  
…	
  
Tomcat	
  
Service	
  
ELB

slide 27:

Cloud	
  Methodologies	
  
• Resource	
  Analysis	
  
– Any	
  resources	
  exhausted?	
  CPU,	
  disk,	
  network	
  
• Metric	
  and	
  event	
  correlaFons	
  
– When	
  things	
  got	
  bad,	
  what	
  else	
  happened?	
  
– Correlate	
  with	
  distributed	
  dependencies	
  
• Latency	
  Drilldowns	
  
– Trace	
  origin	
  of	
  high	
  latency	
  from	
  request	
  down	
  
through	
  dependencies	
  
• USE	
  Method	
  
– For	
  every	
  service,	
  check:	
  uFlizaFon,	
  saturaFon,	
  errors

slide 28:

Instance	
  Methodologies	
  
• Log	
  Analysis	
  
– dmesg,	
  GC,	
  Apache,	
  Tomcat,	
  custom	
  
• USE	
  Method	
  
– For	
  every	
  resource,	
  check:	
  uFlizaFon,	
  saturaFon,	
  errors	
  
• Micro-‐benchmarking	
  
– Test	
  and	
  measure	
  components	
  in	
  isolaFon	
  
• Drill-‐down	
  analysis	
  
– Decompose	
  request	
  latency,	
  repeat	
  
• And	
  other	
  system	
  performance	
  methodologies

slide 29:

Bad	
  Instances	
  
• Not	
  all	
  issues	
  root	
  caused	
  
– “bad	
  instance”	
  !=	
  root	
  cause	
  
• SomeFmes	
  eﬃcient	
  to	
  just	
  kill	
  “bad	
  instances”	
  
– They	
  could	
  be	
  a	
  lone	
  hardware	
  issue,	
  which	
  could	
  
take	
  days	
  for	
  you	
  to	
  analyze	
  
• But	
  they	
  could	
  also	
  be	
  an	
  early	
  warning	
  of	
  a	
  
global	
  issue.	
  If	
  you	
  kill	
  them,	
  you	
  don’t	
  know.	
  
Instance	
  
Bad	
  Instance

slide 30:

Bad	
  Instance	
  AnF-‐Method	
  
1. Plot	
  request	
  latency	
  
per-‐instance	
  
2. Find	
  the	
  bad	
  instance	
  
3. Terminate	
  bad	
  instance	
  
4. Someone	
  else’s	
  problem	
  now!	
  
Bad	
  instance	
  
Terminate!	
  
95th	
  percenFle	
  latency	
  
(Atlas	
  Exploder)

slide 31:

Cloud	
  Analysis

slide 32:

Cloud	
  Analysis	
  
• Cloud	
  analysis	
  tools	
  made	
  and	
  used	
  at	
  Ne5lix	
  include:	
  
Tool	
  
Purpose	
  
Atlas	
  
Metrics,	
  dashboards,	
  alerts	
  
Chronos	
  
Change	
  tracking	
  
Mogul	
  
Metric	
  correlaFon	
  
Salp	
  
Dependency	
  graphing	
  
ICE	
  
Cloud	
  usage	
  dashboard	
  
• Monitor	
  everything:	
  you	
  can’t	
  tune	
  what	
  you	
  can’t	
  see

slide 33:

Ne5lix	
  Cloud	
  Analysis	
  Process	
  
Atlas	
  Alerts	
  
Example	
  
path	
  
enumerated	
  
ICE	
  
1.	
  Check	
  Issue	
  
Cost	
  
Atlas	
  Dashboards	
  
2.	
  Check	
  Events	
  
Chronos	
  
Create	
  
New	
  Alert	
  
Redirected	
  to	
  
a	
  new	
  Target	
  
3.	
  Drill	
  Down	
  
Atlas	
  Metrics	
  
4.	
  Check	
  Dependencies	
  
5.	
  Root	
  
Cause	
  
Mogul	
  
Instance	
  Analysis	
  
Salp

slide 34:

Atlas:	
  Alerts	
  
• Custom	
  alerts	
  based	
  on	
  the	
  Atlas	
  metrics	
  
– CPU	
  usage,	
  latency,	
  instance	
  count	
  growth,	
  …	
  
• Usually	
  email	
  or	
  pager	
  
– Can	
  also	
  deacFvate	
  instances,	
  terminate,	
  reboot	
  
• Next	
  step:	
  check	
  the	
  dashboards

slide 35:

Atlas:	
  Dashboards

slide 36:

Atlas:	
  Dashboards	
  
Custom	
  Graphs	
  
Set	
  Time	
  
Breakdowns	
  
Interac,ve	
  
Click	
  Graphs	
  for	
  
More	
  Metrics

slide 37:

Atlas:	
  Dashboards	
  
• Cloud	
  wide	
  and	
  per-‐service	
  (all	
  custom)	
  
• StarFng	
  point	
  for	
  issue	
  invesFgaFons	
  
1. Conﬁrm	
  and	
  quanFfy	
  issue	
  
2. Check	
  historic	
  trend	
  
3. Launch	
  Atlas	
  metrics	
  view	
  to	
  drill	
  down	
  
Cloud	
  wide:	
  streams	
  per	
  second	
  (SPS)	
  dashboard

slide 38:

Atlas:	
  Metrics

slide 39:

Atlas:	
  Metrics	
  
Region	
  
Breakdowns	
  
App	
  
Interac,ve	
  
Graph	
  
Metrics	
  
Op,ons	
  
Summary	
  Sta,s,cs

slide 40:

Atlas:	
  Metrics	
  
• All	
  metrics	
  in	
  one	
  system	
  
• System	
  metrics:	
  
– CPU	
  usage,	
  disk	
  I/O,	
  memory,	
  …	
  
• ApplicaFon	
  metrics:	
  
– latency	
  percenFles,	
  errors,	
  …	
  
• Filters	
  or	
  breakdowns	
  by	
  
region,	
  applicaFon,	
  ASG,	
  metric,	
  instance,	
  …	
  
– Quickly	
  narrow	
  an	
  invesFgaFon	
  
• URL	
  contains	
  session	
  state:	
  sharable

slide 41:

Chronos:	
  Change	
  Tracking

slide 42:

Chronos:	
  Change	
  Tracking	
  
Breakdown	
  
Legend	
  
Historic	
  
Cri,cality	
  
App	
  
Event	
  List

slide 43:

Chronos:	
  Change	
  Tracking	
  
• Quickly	
  ﬁlter	
  uninteresFng	
  events	
  
• Performance	
  issues	
  oVen	
  coincide	
  with	
  changes	
  
• The	
  size	
  and	
  velocity	
  of	
  Ne5lix	
  engineering	
  makes	
  
Chronos	
  crucial	
  for	
  communicaFng	
  change

slide 44:

Mogul:	
  CorrelaFons	
  
• Comparing	
  performance	
  with	
  per-‐resource	
  demand

slide 45:

Mogul:	
  CorrelaFons	
  
• Comparing	
  performance	
  with	
  per-‐resource	
  demand	
  
Latency	
   Throughput	
  
Correla,on	
  
App	
  
Resource	
  
Demand

slide 46:

Mogul:	
  CorrelaFons	
  
• Measures	
  demand	
  using	
  Lille’s	
  Law	
  
– D	
  =	
  R	
  *	
  X	
  
D	
  =	
  Demand	
  (in	
  seconds	
  per	
  second)	
  
R	
  =	
  Average	
  Response	
  Time	
  
X	
  =	
  Throughput	
  
• Discover	
  unexpected	
  problem	
  dependencies	
  
– That	
  aren’t	
  on	
  the	
  service	
  dashboards	
  
• Mogul	
  checks	
  many	
  other	
  correlaFons	
  
– Weeds	
  through	
  thousands	
  of	
  applicaFon	
  metrics,	
  showing	
  
you	
  the	
  most	
  related/interesFng	
  ones	
  
– (Scol/MarFn	
  should	
  give	
  a	
  talk	
  just	
  on	
  these)	
  
• Bearing	
  in	
  mind	
  correlaFon	
  is	
  not	
  causaFon

slide 47:

Salp:	
  Dependency	
  Graphing	
  
• Dependency	
  
graphs	
  based	
  on	
  
live	
  trace	
  data	
  
• InteracFve	
  
• See	
  architectural	
  
issues

slide 48:

Salp:	
  Dependency	
  Graphing	
  
Their	
  Dependencies	
  
…	
  
Dependencies	
  
App	
  
• Dependency	
  
graphs	
  based	
  on	
  
live	
  trace	
  data	
  
• InteracFve	
  
• See	
  architectural	
  
issues

slide 49:

ICE:	
  AWS	
  Usage

slide 50:

ICE:	
  AWS	
  Usage	
  
Cost	
  per	
  hour	
  
Services

slide 51:

ICE:	
  AWS	
  Usage	
  
• Cost	
  per	
  hour	
  by	
  AWS	
  service,	
  and	
  Ne5lix	
  
applicaFon	
  (service	
  team)	
  
– IdenFfy	
  issues	
  of	
  slow	
  growth	
  
• Directs	
  engineering	
  eﬀort	
  to	
  reduce	
  cost

slide 52:

Ne5lix	
  Cloud	
  Analysis	
  Process	
  
Atlas	
  Alerts	
  
In	
  summary…	
  
	
  
Example	
  
path	
  
enumerated	
  
ICE	
  
1.	
  Check	
  Issue	
  
Cost	
  
Atlas	
  Dashboards	
  
2.	
  Check	
  Events	
  
Chronos	
  
Create	
  
New	
  Alert	
  
Plus	
  some	
  other	
  
tools	
  not	
  pictured	
  
Redirected	
  to	
  
a	
  new	
  Target	
  
3.	
  Drill	
  Down	
  
Atlas	
  Metrics	
  
4.	
  Check	
  Dependencies	
  
5.	
  Root	
  
Cause	
  
Mogul	
  
Instance	
  Analysis	
  
Salp

slide 53:

Generic	
  Cloud	
  Analysis	
  Process	
  
Alerts	
  
Example	
  
path	
  
enumerated	
  
Usage	
  Reports	
  
1.	
  Check	
  Issue	
  
Cost	
  
Custom	
  Dashboards	
  
2.	
  Check	
  Events	
  
Change	
  Tracking	
  
Create	
  
New	
  Alert	
  
Redirected	
  to	
  
a	
  new	
  Target	
  
3.	
  Drill	
  Down	
  
Metric	
  Analysis	
  
4.	
  Check	
  Dependencies	
  
5.	
  Root	
  
Cause	
  
Dependency	
  Analysis	
  
Instance	
  Analysis

slide 54:

Instance	
  Analysis

slide 55:

Instance	
  Analysis	
  
Locate,	
  quanFfy,	
  and	
  ﬁx	
  performance	
  issues	
  anywhere	
  in	
  the	
  system

slide 56:

Instance	
  Tools	
  
• Linux	
  
– top,	
  ps,	
  pidstat,	
  vmstat,	
  iostat,	
  mpstat,	
  netstat,	
  
nicstat,	
  sar,	
  strace,	
  tcpdump,	
  ss,	
  …	
  
• System	
  Tracing	
  
– Vrace,	
  perf_events,	
  SystemTap	
  
• CPU	
  Performance	
  Counters	
  
– perf_events,	
  rdmsr	
  
• ApplicaFon	
  Proﬁling	
  
– applicaFon	
  logs,	
  perf_events,	
  Google	
  Lightweight	
  Java	
  
Proﬁler	
  (LJP),	
  Java	
  Flight	
  Recorder	
  (JFR)

slide 57:

Tools	
  in	
  an	
  AWS	
  EC2	
  Linux	
  Instance

slide 58:

Linux	
  Performance	
  Analysis	
  
• vmstat,	
  pidstat,	
  sar,	
  etc,	
  used	
  mostly	
  normally	
  
$ sar -n TCP,ETCP,DEV 1!
Linux 3.2.55 (test-e4f1a80b)
!08/18/2014
!_x86_64_ !(8 CPU)!
09:10:43 PM IFACE rxpck/s txpck/s
rxkB/s
txkB/s rxcmp/s txcmp/s
09:10:44 PM
09:10:44 PM
eth0 4114.00 4186.00 4537.46 28513.24
09:10:43 PM active/s passive/s
iseg/s
oseg/s!
09:10:44 PM
4107.00 22511.00!
09:10:43 PM atmptf/s estres/s retrans/s isegerr/s
orsts/s!
09:10:44 PM
1.00!
[…]!
rxmcst/s!
0.00!
0.00!
	
  
• Micro	
  benchmarking	
  can	
  be	
  used	
  to	
  invesFgate	
  
hypervisor	
  behavior	
  that	
  can’t	
  be	
  observed	
  directly

slide 59:

Instance	
  Challenges	
  
• ApplicaFon	
  Proﬁling	
  
– For	
  Java,	
  Node.js	
  
• System	
  Tracing	
  
– On	
  Linux	
  
• Accessing	
  CPU	
  Performance	
  Counters	
  
– From	
  cloud	
  guests

slide 60:

ApplicaFon	
  Proﬁling	
  
• We’ve	
  found	
  many	
  tools	
  are	
  inaccurate	
  or	
  broken	
  
– Eg,	
  those	
  based	
  on	
  java	
  hprof	
  
• Stack	
  proﬁling	
  can	
  be	
  problemaFc:	
  
– Linux	
  perf_events:	
  frame	
  pointer	
  for	
  the	
  JVM	
  is	
  oVen	
  
missing	
  (by	
  hotspot),	
  breaking	
  stacks.	
  Also	
  needs	
  perf-‐
map-‐agent	
  loaded	
  for	
  symbol	
  translaFon.	
  
– DTrace:	
  jstack()	
  also	
  broken	
  by	
  missing	
  FPs	
  
hlps://bugs.openjdk.java.net/browse/JDK-‐6276264,	
  2005	
  
• Flame	
  graphs	
  are	
  solving	
  many	
  performance	
  issues.	
  
These	
  need	
  working	
  stacks.

slide 61:

ApplicaFon	
  Proﬁling:	
  Java	
  
• Java	
  Flight	
  Recorder	
  
– CPU	
  &	
  memory	
  proﬁling.	
  Oracle.	
  $$$	
  
• Google	
  Lightweight	
  Java	
  Proﬁler	
  
– Basic,	
  open	
  source,	
  free,	
  asynchronous	
  CPU	
  proﬁler	
  
– Uses	
  an	
  agent	
  that	
  dumps	
  hprof-‐like	
  output	
  
• hlps://code.google.com/p/lightweight-‐java-‐proﬁler/wiki/Ge~ngStarted	
  
• hlp://www.brendangregg.com/blog/2014-‐06-‐12/java-‐ﬂame-‐graphs.html	
  
• Plus	
  others	
  at	
  various	
  Fmes	
  (YourKit,	
  …)

slide 62:

LJP	
  CPU	
  Flame	
  Graph	
  (Java)

slide 63:

LJP	
  CPU	
  Flame	
  Graph	
  (Java)	
  
Stack	
  frame	
  
Ancestry	
  
Mouse-‐over	
  
frames	
  to	
  
quanFfy

slide 64:

Linux	
  System	
  Proﬁling	
  
• Previous	
  proﬁlers	
  only	
  show	
  Java	
  CPU	
  Fme	
  
• We	
  use	
  perf_events	
  (aka	
  the	
  “perf”	
  command)	
  to	
  
sample	
  everything	
  else:	
  
– JVM	
  internals	
  &	
  libraries	
  
– The	
  Linux	
  kernel	
  
– Other	
  apps,	
  incl.	
  Node.js	
  
• perf	
  CPU	
  Flame	
  graphs:	
  
# git clone https://github.com/brendangregg/FlameGraph!
# cd FlameGraph!
# perf record -F 99 -ag -- sleep 60!
# perf script | ./stackcollapse-perf.pl | ./flamegraph.pl >gt; perf.svg!

slide 65:

perf	
  CPU	
  Flame	
  Graph

slide 66:

perf	
  CPU	
  Flame	
  Graph	
  
Kernel	
  
TCP/IP	
  
Broken	
  
Java	
  stacks	
  
(missing	
  
frame	
  
pointer)	
  
GC	
  
Locks	
  
Time	
  
Idle	
  
thread	
  
epoll

slide 67:

ApplicaFon	
  Proﬁling:	
  Node.js	
  
• Performance	
  analysis	
  on	
  Linux	
  a	
  growing	
  area	
  
– Eg,	
  new	
  postmortem	
  tools	
  from	
  2	
  weeks	
  ago:	
  
hlps://github.com/tjfontaine/lldb-‐v8	
  
• Flame	
  graphs	
  are	
  possible	
  using	
  Linux	
  perf_events	
  
(perf)	
  and	
  v8	
  -‐-‐perf_basic_prof	
  
(node	
  v0.11.13+)	
  
– Although	
  there	
  is	
  currently	
  a	
  map	
  
growth	
  bug;	
  see:	
  
hlp://www.brendangregg.com/blog/	
  
2014-‐09-‐17/node-‐ﬂame-‐graphs-‐on-‐linux.html	
  
• Also	
  do	
  heap	
  analysis	
  
– node-‐heapdump

slide 68:

Flame	
  Graphs	
  
• CPU	
  sample	
  ﬂame	
  graphs	
  solve	
  many	
  issues	
  
– We’re	
  automaFng	
  their	
  collecFon	
  
– If	
  you	
  aren’t	
  using	
  them	
  yet,	
  you’re	
  missing	
  out	
  on	
  
low	
  hanging	
  fruit!	
  
• Other	
  ﬂame	
  graph	
  types	
  useful	
  as	
  well	
  
– Disk	
  I/O,	
  network	
  I/O,	
  memory	
  events,	
  etc	
  
– Any	
  proﬁle	
  that	
  includes	
  more	
  stacks	
  than	
  can	
  be	
  
quickly	
  read

slide 69:

Linux	
  Tracing	
  
• ...	
  now	
  for	
  something	
  more	
  challenging

slide 70:

Linux	
  Tracing	
  
• Too	
  many	
  choices,	
  and	
  many	
  are	
  sFll	
  in-‐
development:	
  
– Vrace	
  
– perf_events	
  
– eBPF	
  
– SystemTap	
  
– ktap	
  
– LTTng	
  
– dtrace4linux	
  
– sysdig

slide 71:

Linux	
  Tracing	
  
• A	
  system	
  tracer	
  is	
  needed	
  to	
  root	
  cause	
  many	
  
issues:	
  kernel,	
  library,	
  app	
  
– (There’s	
  a	
  prely	
  good	
  book	
  covering	
  use	
  cases)	
  
• DTrace	
  is	
  awesome,	
  but	
  the	
  
Linux	
  ports	
  are	
  incomplete	
  
• Linux	
  does	
  have	
  has	
  Vrace	
  and	
  
perf_events	
  in	
  the	
  kernel	
  source,	
  
which	
  –	
  it	
  turns	
  out	
  –	
  can	
  saFsfy	
  
many	
  needs	
  already!

slide 72:

Linux	
  Tracing:	
  Vrace	
  
• Added	
  by	
  Steven	
  Rostedt	
  and	
  others	
  since	
  2.6.27	
  
• Already	
  enabled	
  on	
  our	
  servers	
  (3.2+)	
  
– CONFIG_FTRACE,	
  CONFIG_FUNCTION_PROFILER,	
  …	
  
– Use	
  directly	
  via	
  /sys/kernel/debug/tracing	
  
• Front-‐end	
  tools	
  to	
  aid	
  usage:	
  perf-‐tools	
  
– hlps://github.com/brendangregg/perf-‐tools	
  
– Unsupported	
  hacks:	
  see	
  WARNINGs	
  
– Also	
  see	
  the	
  trace-‐cmd	
  front-‐end,	
  as	
  well	
  as	
  perf	
  
• lwn.net:	
  “Ftrace:	
  The	
  Hidden	
  Light	
  Switch”

slide 73:

perf-‐tools:	
  iosnoop	
  
• Block	
  I/O	
  (disk)	
  events	
  with	
  latency:	
  
# ./iosnoop –ts!
Tracing block I/O. Ctrl-C to end.!
STARTs
ENDs
COMM
5982800.302061 5982800.302679 supervise
5982800.302423 5982800.302842 supervise
5982800.304962 5982800.305446 supervise
5982800.305250 5982800.305676 supervise
[…]!
PID
TYPE DEV
202,1
202,1
202,1
202,1
BLOCK
BYTES LATms!
0.62!
0.42!
0.48!
0.43!
# ./iosnoop –h!
USAGE: iosnoop [-hQst] [-d device] [-i iotype] [-p PID] [-n name] [duration]!
-d device
# device string (eg, "202,1)!
-i iotype
# match type (eg, '*R*' for all reads)!
-n name
# process name to match on I/O issue!
-p PID
# PID to match on I/O issue!
# include queueing time in LATms!
# include start time of I/O (s)!
# include completion time of I/O (s)!
# this usage message!
duration
# duration seconds, and use buffers!
[…]!

slide 74:

perf-‐tools:	
  iolatency	
  
• Block	
  I/O	
  (disk)	
  latency	
  distribuFons:	
  
# ./iolatency !
Tracing block I/O. Output every 1 seconds. Ctrl-C to end.!
>gt;=(ms) .. gt; 1
: 2104
|######################################|!
1 ->gt; 2
: 280
|######
2 ->gt; 4
: 2
4 ->gt; 8
: 0
8 ->gt; 16
: 202
|####
>gt;=(ms) .. gt; 1
: 1144
|######################################|!
1 ->gt; 2
: 267
|#########
2 ->gt; 4
: 10
4 ->gt; 8
: 5
8 ->gt; 16
: 248
|#########
16 ->gt; 32
: 601
|####################
32 ->gt; 64
: 117
|####
[…]!

slide 75:

perf-‐tools:	
  opensnoop	
  
• Trace	
  open()	
  syscalls	
  showing	
  ﬁlenames:	
  
# ./opensnoop -t!
Tracing open()s. Ctrl-C to end.!
TIMEs
COMM
PID
postgres
postgres
postgres
postgres
postgres
postgres
postgres
svstat
svstat
stat
stat
stat
stat
stat
stat
[…]!
FD FILE!
0x8 /proc/self/oom_adj!
0x5 global/pg_filenode.map!
0x5 global/pg_internal.init!
0x5 base/16384/PG_VERSION!
0x5 base/16384/pg_filenode.map!
0x5 base/16384/pg_internal.init!
0x5 base/16384/11725!
0x4 supervise/ok!
0x4 supervise/status!
0x3 /etc/ld.so.cache!
0x3 /lib/x86_64-linux-gnu/libselinux…!
0x3 /lib/x86_64-linux-gnu/libc.so.6!
0x3 /lib/x86_64-linux-gnu/libdl.so.2!
0x3 /proc/filesystems!
0x3 /etc/nsswitch.conf!

slide 76:

perf-‐tools:	
  funcgraph	
  
• Trace	
  a	
  graph	
  of	
  kernel	
  code	
  ﬂow:	
  
# ./funcgraph -Htp 5363 vfs_read!
Tracing "vfs_read" for PID 5363... Ctrl-C to end.!
# tracer: function_graph!
TIME
CPU DURATION
FUNCTION CALLS!
4346366.073832 |
| vfs_read() {!
4346366.073834 |
rw_verify_area() {!
4346366.073834 |
security_file_permission() {!
4346366.073834 |
apparmor_file_permission() {!
4346366.073835 |
0.153 us
common_file_perm();!
4346366.073836 |
0.947 us
4346366.073836 |
0.066 us
__fsnotify_parent();!
4346366.073836 |
0.080 us
fsnotify();!
4346366.073837 |
2.174 us
4346366.073837 |
2.656 us
4346366.073837 |
tty_read() {!
4346366.073837 |
0.060 us
tty_paranoia_check();!
[…]!

slide 77:

perf-‐tools:	
  kprobe	
  
• Dynamically	
  trace	
  a	
  kernel	
  funcFon	
  call	
  or	
  return,	
  
with	
  variables,	
  and	
  in-‐kernel	
  ﬁltering:	
  
# ./kprobe 'p:open do_sys_open filename=+0(%si):string' 'filename ~ "*stat"'!
Tracing kprobe myopen. Ctrl-C to end.!
postgres-1172 [000] d... 6594028.787166: open: (do_sys_open
+0x0/0x220) filename="pg_stat_tmp/pgstat.stat"!
postgres-1172 [001] d... 6594028.797410: open: (do_sys_open
+0x0/0x220) filename="pg_stat_tmp/pgstat.stat"!
postgres-1172 [001] d... 6594028.797467: open: (do_sys_open
+0x0/0x220) filename="pg_stat_tmp/pgstat.stat”!
^C!
Ending tracing...!
• Add	
  -‐s	
  for	
  stack	
  traces;	
  -‐p	
  for	
  PID	
  ﬁlter	
  in-‐kernel.	
  
• Quickly	
  conﬁrm	
  kernel	
  behavior;	
  eg:	
  did	
  a	
  
tunable	
  take	
  eﬀect?

slide 78:

perf-‐tools	
  (so	
  far…)

slide 79:

Heat	
  Maps	
  
• Vrace	
  or	
  perf_events	
  for	
  tracing	
  disk	
  I/O	
  and	
  
other	
  latencies	
  as	
  a	
  heat	
  map:

slide 80:

Other	
  Tracing	
  OpFons	
  
• SystemTap	
  
– The	
  most	
  powerful	
  of	
  the	
  system	
  tracers	
  
– We’ll	
  use	
  it	
  as	
  a	
  last	
  resort:	
  deep	
  custom	
  tracing	
  
– I’ve	
  historically	
  had	
  issues	
  with	
  panics	
  and	
  freezes	
  
• SFll	
  present	
  in	
  the	
  latest	
  version?	
  
• The	
  Ne5lix	
  fault	
  tolerant	
  architecture	
  makes	
  panics	
  
much	
  less	
  of	
  a	
  problem	
  (that	
  was	
  the	
  panic	
  monkey)	
  
• Instance	
  canaries	
  with	
  DTrace	
  are	
  possible	
  too	
  
– OmniOS	
  
– FreeBSD

slide 81:

Linux	
  Tracing	
  Future	
  
• Vrace	
  +	
  perf_events	
  cover	
  much,	
  but	
  not	
  custom	
  
in-‐kernel	
  aggregaFons	
  
• eBPF	
  may	
  provide	
  this	
  missing	
  feature	
  
– eg,	
  in-‐kernel	
  latency	
  heat	
  map	
  (showing	
  bimodal):

slide 82:

Linux	
  Tracing	
  Future	
  
• Vrace	
  +	
  perf_events	
  cover	
  much,	
  but	
  not	
  custom	
  
in-‐kernel	
  aggregaFons	
  
• eBPF	
  may	
  provide	
  this	
  missing	
  feature	
  
Time	
  
– eg,	
  in-‐kernel	
  latency	
  heat	
  map	
  (showing	
  bimodal):	
  
Low	
  
latency	
  
cache	
  
hits	
  
High	
  
latency	
  
device	
  
I/O

slide 83:

CPU	
  Performance	
  Counters	
  
• …	
  is	
  this	
  even	
  possible	
  from	
  a	
  cloud	
  guest?

slide 84:

CPU	
  Performance	
  Counters	
  
• Model	
  Speciﬁc	
  Registers	
  (MSRs)	
  
– Basic	
  details:	
  Fmestamp	
  clock,	
  temperature,	
  power	
  
– Some	
  are	
  available	
  in	
  EC2	
  
• Performance	
  Monitoring	
  Counters	
  (PMCs)	
  
– Advanced	
  details:	
  cycles,	
  stall	
  cycles,	
  cache	
  misses,	
  …	
  
– Not	
  available	
  in	
  EC2	
  (by	
  default)	
  
• Root	
  cause	
  CPU	
  usage	
  at	
  the	
  cycle	
  level	
  
– Eg,	
  higher	
  CPU	
  usage	
  due	
  to	
  more	
  memory	
  stall	
  cycles

slide 85:

msr-‐cloud-‐tools	
  
• Uses	
  the	
  msr-‐tools	
  package	
  and	
  rdmsr(1)	
  
– hlps://github.com/brendangregg/msr-‐cloud-‐tools	
  
ec2-guest# ./cputemp 1!
CPU1 CPU2 CPU3 CPU4!
61 61 60 59!
CPU	
  Temperature	
  
60 61 60 60!
[...]!
ec2-guest# ./showboost!
CPU MHz
: 2500!
Turbo MHz
: 2900 (10 active)!
Turbo Ratio : 116% (10 active)!
CPU 0 summary every 5 seconds...!
TIME
C0_MCYC
C0_ACYC
UTIL
06:11:35
51%
06:11:40
50%
06:11:45
49%
06:11:50
49%
[...]!
Real	
  CPU	
  MHz	
  
RATIO
116%
115%
115%
116%
MHz!
2900!
2899!
2899!
2900!

slide 86:

MSRs:	
  CPU	
  Temperature	
  
• Useful	
  to	
  explain	
  variaFon	
  in	
  turbo	
  boost	
  (if	
  seen)	
  
• Temperature	
  for	
  a	
  syntheFc	
  workload:

slide 87:

MSRs:	
  Intel	
  Turbo	
  Boost	
  
• Can	
  dynamically	
  increase	
  CPU	
  speed	
  up	
  to	
  30+%	
  
• This	
  can	
  mess	
  up	
  all	
  performance	
  comparisons	
  
• Clock	
  speed	
  can	
  be	
  observed	
  from	
  MSRs	
  using	
  
– IA32_MPERF:	
  Bits	
  63:0	
  is	
  TSC	
  Frequency	
  Clock	
  
Counter	
  C0_MCNT	
  TSC	
  relaFve	
  
– IA32_APERF:	
  Bits	
  63:0	
  is	
  TSC	
  Frequency	
  Clock	
  Counter	
  
C0_ACNT	
  actual	
  clocks	
  
• This	
  is	
  how	
  msr-‐cloud-‐tools	
  showturbo	
  works

slide 88:

PMCs

slide 89:

PMCs	
  
• Needed	
  for	
  remaining	
  low-‐level	
  CPU	
  analysis:	
  
– CPU	
  stall	
  cycles,	
  and	
  stall	
  cycle	
  breakdowns	
  
– L1,	
  L2,	
  L3	
  cache	
  hit/miss	
  raFo	
  
– Memory,	
  CPU	
  Interconnect,	
  and	
  bus	
  I/O	
  
• Not	
  enabled	
  by	
  default	
  in	
  EC2.	
  Is	
  possible,	
  eg:	
  
# perf stat -e cycles,instructions,r0480,r01A2 -p `pgrep -n java` sleep 10!
Performance counter stats for process id '17190':!
71,208,028,133 cycles
0.000 GHz
[100.00%]!
41,603,452,060 instructions
0.58 insns per cycle
[100.00%]!
23,489,032,742 r0480
[100.00%]!
ICACHE.IFETCH_STALL	
  
20,241,290,520 r01A2!
RESOURCE_STALLS.ANY	
  
10.000894718 seconds time elapsed!

slide 90:

Using	
  Advanced	
  Perf	
  Tools	
  
• Everyone	
  doesn’t	
  need	
  to	
  learn	
  these	
  
• Reality:	
  
– A.	
  Your	
  company	
  has	
  one	
  or	
  more	
  people	
  for	
  advanced	
  
perf	
  analysis	
  (perf	
  team).	
  Ask	
  them.	
  
– B.	
  You	
  are	
  that	
  person	
  
– C.	
  You	
  buy	
  a	
  product	
  that	
  does	
  it.	
  Ask	
  them.	
  
• If	
  you	
  aren’t	
  the	
  advanced	
  perf	
  engineer,	
  you	
  need	
  
to	
  know	
  what	
  to	
  ask	
  for	
  
– Flame	
  graphs,	
  latency	
  heat	
  maps,	
  Vrace,	
  PMCs,	
  etc…	
  
• At	
  Ne5lix,	
  we’re	
  building	
  the	
  (C)	
  opFon:	
  Vector

slide 91:

Future	
  Work:	
  Vector

slide 92:

Future	
  Work:	
  Vector	
  
U,liza,on	
  
Satura,on	
  
Errors	
  
Per	
  device	
  
Breakdowns

slide 93:

Future	
  Work:	
  Vector	
  
• Real-‐Fme,	
  per-‐second,	
  instance	
  metrics	
  
• On-‐demand	
  CPU	
  ﬂame	
  
Atlas	
  Alerts	
  
ICE	
  
graphs,	
  heat	
  maps,	
  
Vrace	
  metrics,	
  and	
  
Atlas	
  Dashboards	
  
SystemTap	
  metrics	
  
• Analyze	
  from	
  clouds	
  
Chronos	
  
to	
  roots	
  quickly,	
  and	
  
Atlas	
  Metrics	
  
from	
  a	
  web	
  interface	
  
Mogul	
  
• Scalable:	
  other	
  teams	
  
can	
  use	
  it	
  easily	
  
Vector	
  
Salp

slide 94:

In	
  Summary	
  
• 1.	
  Ne5lix	
  architecture	
  
– Fault	
  tolerance:	
  ASGs,	
  ASG	
  clusters,	
  Hystrix	
  (dependency	
  API),	
  
Zuul	
  (proxy),	
  Simian	
  army	
  (tesFng)	
  
– Reduces	
  the	
  severity	
  and	
  urgency	
  of	
  issues	
  
• 2.	
  Cloud	
  Analysis	
  
– Atlas	
  (alerts/dashboards/metrics),	
  Chronos	
  (event	
  tracking),	
  
Mogul	
  &	
  Salp	
  (dependency	
  analysis),	
  ICE	
  (AWS	
  usage)	
  
– Quickly	
  narrow	
  focus	
  from	
  cloud	
  to	
  ASG	
  to	
  instance	
  
• 3.	
  Instance	
  Analysis	
  
– Linux	
  tools	
  (*stat,	
  sar,	
  …),	
  perf_events,	
  Vrace,	
  perf-‐tools,	
  rdmsr,	
  
msr-‐cloud-‐tools,	
  Vector	
  
– Read	
  logs,	
  proﬁle	
  &	
  trace	
  all	
  soVware,	
  read	
  CPU	
  counters

slide 95:

References	
  &	
  Links	
  
hlps://ne5lix.github.io/#repo	
  
hlp://techblog.ne5lix.com/2012/01/auto-‐scaling-‐in-‐amazon-‐cloud.html	
  
hlp://techblog.ne5lix.com/2012/06/asgard-‐web-‐based-‐cloud-‐management-‐and.html	
  
hlp://www.slideshare.net/benjchristensen/performance-‐and-‐fault-‐tolerance-‐for-‐the-‐
ne5lix-‐api-‐qcon-‐sao-‐paulo	
  
hlp://www.slideshare.net/adrianco/ne5lix-‐nosql-‐search	
  
hlp://www.slideshare.net/ufried/resilience-‐with-‐hystrix	
  
hlps://github.com/Ne5lix/Hystrix,	
  hlps://github.com/Ne5lix/Zuul	
  
hlp://techblog.ne5lix.com/2011/07/ne5lix-‐simian-‐army.html	
  
hlp://techblog.ne5lix.com/2014/09/introducing-‐chaos-‐engineering.html	
  
hlp://www.brendangregg.com/blog/2014-‐06-‐12/java-‐ﬂame-‐graphs.html	
  
hlp://www.brendangregg.com/blog/2014-‐09-‐17/node-‐ﬂame-‐graphs-‐on-‐linux.html	
  
Systems	
  Performance:	
  Enterprise	
  and	
  the	
  Cloud,	
  PrenFce	
  Hall,	
  2014	
  
hlp://sourceforge.net/projects/nicstat/	
  
perf-‐tools:	
  hlps://github.com/brendangregg/perf-‐tools	
  
Ftrace:	
  The	
  Hidden	
  Light	
  Switch:	
  hlp://lwn.net/ArFcles/608497/	
  
msr-‐cloud-‐tools:	
  hlps://github.com/brendangregg/msr-‐cloud-‐tools

slide 96:

Thanks	
  
Coburn	
  Watson,	
  Adrian	
  CockcroV	
  
Atlas:	
  Insight	
  Engineering	
  (Roy	
  Rapoport,	
  etc.)	
  
Mogul:	
  Performance	
  Engineering	
  (Scol	
  Emmons,	
  MarFn	
  Spier)	
  
Vector:	
  Performance	
  Engineering	
  (MarFn	
  Spier,	
  Amer	
  Ather)

slide 97:

Thanks	
  
• QuesFons?	
  
• hlp://techblog.ne5lix.com	
  
• hlp://slideshare.net/brendangregg	
  	
  
• hlp://www.brendangregg.com	
  
• bgregg@ne5lix.com	
  
• @brendangregg