Surge 2014: Netflix, From Clouds to Roots
Video: http://www.youtube.com/watch?v=H-E0MQTID0gFrom Clouds to Roots: root cause performance analysis at Netflix. Talk at Surge 2014 by Brendan Gregg.
Description: "At Netflix, high scale and fast deployment rule. The possibilities for failure are endless, and the environment excels at handling this, regularly tested and exercised by the simian army. But, when this environment automatically works around systemic issues that aren’t root-caused, they can grow over time. This talk describes the challenge of not just handling failures of scale on the Netflix cloud, but also new approaches and tools for quickly diagnosing their root cause in an ever changing environment."
PDF: Surge2014_CloudsToRoots.pdf
Keywords (from pdftotext):
slide 1:
From Clouds to Roots Brendan Gregg Senior Performance Architect Performance Engineering Team bgregg@ne5lix.com, @brendangregg September, 2014slide 2:
Root Cause Analysis at Ne5lix Devices Ribbon Hystrix Zuul Service Tomcat Roots … JVM Load Instances (Linux) AZ 1 AZ 2 AZ 3 ASG 2 … ASG 1 ELB ASG Cluster ApplicaFon Ne5lix Atlas Chronos Mogul Vector sar, *stat stap, Vrace rdmsr … SG …slide 3:
• Massive AWS EC2 Linux cloud – Tens of thousands of server instances – Autoscale by ~3k each day – CentOS and Ubuntu • FreeBSD for content delivery – Approx 33% of US Internet traffic at night • Performance is criFcal – Customer saFsfacFon: >gt;50M subscribers – $$$ price/performance – Develop tools for cloud-‐wide analysisslide 4:
Brendan Gregg • Senior Performance Architect, Ne5lix – Linux and FreeBSD performance – Performance Engineering team (@coburnw) • Recent work: – Linux perf-‐tools, using Vrace & perf_events – Systems Performance, PrenFce Hall • Previous work includes: – USE Method, flame graphs, latency & uFlizaFon heat maps, DTraceToolkit, iosnoop and others on OS X, ZFS L2ARC • Twiler @brendangreggslide 5:
Last year at Surge… • I saw a great Ne5lix talk by Coburn Watson: • hlps://www.youtube.com/watch?v=7-‐13wV3WO8Q • He’s now my manager (and also sFll hiring!)slide 6:
Agenda • The Ne5lix Cloud – How it works: ASG clusters, Hystrix, monkeys – And how it may fail • Root Cause Performance Analysis – Why it’s sFll needed • Cloud analysis • Instance analysisslide 7:
Terms AWS: Amazon Web Services EC2: AWS ElasFc Compute 2 (cloud instances) S3: AWS Simple Storage Service (object store) ELB: AWS ElasFc Load Balancers SQS: AWS Simple Queue Service SES: AWS Simple Email Service CDN: Content Delivery Network OCA: Ne5lix Open Connect Appliance (streaming CDN) QoS: Quality of Service AMI: Amazon Machine Image (instance image) ASG: Auto Scaling Group AZ: Availability Zone NIWS: Ne5lix Internal Web Service framework (Ribbon) MSR: Model Specific Register (CPU info register) PMC: Performance Monitoring Counter (CPU perf counter)slide 8:
The Ne5lix Cloudslide 9:
The Ne5lix Cloud • Tens of thousands of cloud instances on AWS EC2, with S3 and Cassandra for storage EC2 ELB Cassandra ApplicaFons (Services) S3 ElasFcsearch EVCache SES SQS • Ne5lix is implemented by mulFple logical servicesslide 10:
Ne5lix Services • Open Connect Appliances used for content delivery Client Devices AuthenFcaFon Web Site API Streaming API User Data PersonalizaFon Viewing Hist. … DRM QoS Logging OCA CDN CDN Steering Encodingslide 11:
Freedom and Responsibility • Culture deck is true – hlp://www.slideshare.net/reed2001/culture-‐1798664 (9M views!) • Deployment freedom – Service teams choose their own tech & schedules – Purchase and use cloud instances without approvals – Ne5lix environment changes fast!slide 12:
Cloud Technologies • Numerous open source technologies are in use: – Linux, Java, Cassandra, Node.js, … • Ne5lix also open sources: ne5lix.github.ioslide 13:
Cloud Instances • Base server instance image + customizaFons by service teams (BaseAMI). Typically: Linux (CentOS or Ubuntu) OpFonal Apache, memcached, non-‐Java apps (incl. Node.js) Atlas monitoring, S3 log rotaFon, Vrace, perf, stap, custom perf tools Java (JDK 7 or 8) GC and thread dump logging Tomcat ApplicaFon war files, base servlet, pla5orm, hystrix, health check, metrics (Servo)slide 14:
Scalability and Reliability # Problem 1 Load increases 2 Poor performing code push Solu,on Auto scale with ASGs Rapid rollback with red/black ASG clusters 3 Instance failure Hystrix Fmeouts and secondaries 4 Zone/Region failure Zuul to reroute traffic 5 Overlooked and unhandled issues Simian army 6 Poor performance Atlas metrics, alerts, Chronosslide 15:
1. Auto Scaling Groups ASG • Instances automaFcally added or removed by a custom scaling policy – A broken policy could cause false scaling • Alerts & audits used to check scaling is sane CloudWatch, Servo Cloud ConfiguraFon Management Scaling Policy loadavg, latency, … Instance Instance Instanceslide 16:
2. ASG Clusters • How code versions ASG Cluster are really deployed prod1 • Traffic managed by ElasFc Load Balancers (ELBs) • Fast rollback if ASG-‐v010 issues are found … – Might rollback undiagnosed issues • Canaries can also be used for tesFng (and automated) ELB Canary ASG-‐v011 … Instance Instance Instance Instance Instance Instanceslide 17:
3. Hystrix • A library for latency and fault tolerance for dependency services – Fallbacks, degradaFon, fast fail and rapid recovery – Supports Fmeouts, load shedding, circuit breaker – Uses thread pools for dependency services – RealFme monitoring Tomcat ApplicaFon get A Hystrix >gt;100ms Dependency A1 • Plus the Ribbon IPC library (NIWS), which adds even more fault tolerance Dependency A2slide 18:
4. Redundancy • All device traffic goes through the Zuul proxy: – dynamic rouFng, monitoring, resiliency, security • Availability Zone failure: run from 2 of 3 zones • Region failure: reroute traffic Monitoring Zuul AZ1 AZ2 AZ3slide 19:
5. Simian Army • Ensures cloud handles failures through regular tesFng • Monkeys: – Latency: arFficial delays – Conformity: kills non-‐ best-‐pracFces instances – Doctor: health checks – Janitor: unused instances – Security: checks violaFons – 10-‐18: geographic issues – Chaos Gorilla: AZ failure • We’re hiring Chaos Engineers!slide 20:
6. Atlas, alerts, Chronos • Atlas: Cloud-‐wide monitoring tool – Millions of metrics, quick rollups, custom dashboards: • Alerts: Custom, using Atlas metrics – In parFcular, error & Fmeout rates on client devices • Chronos: Change tracking – Used during incident invesFgaFonsslide 21:
In Summary • Ne5lix is very good at automaFcally handling failure – Issues oVen lead to rapid instance growth (ASGs) • Good for customers – Fast workaround • Good for engineers – Fix later, 9-‐5 # Problem Solu,on 1 Load increases ASGs 2 Poor performing code push ASG clusters 3 Instance issue Hystrix 4 Zone/Region issue Zuul 5 Overlooked and Monkeys unhandled issues 6 Poor performance Atlas, alerts, Chronosslide 22:
Typical Ne5lix Stack Problems/soluFons enumerated Devices 4. Zuul Ribbon 3. Hystrix Dependencies, Atlas (monitoring), Discovery, … 6. Service Tomcat 5. … JVM Load Monkeys Instances (Linux) AZ 1 AZ 2 AZ 3 1. ASG 2 … ASG 1 ELB 2. ASG Cluster SG … ApplicaFon Ne5lixslide 23:
* ExcepFons • Apache Web Server • Node.js • …slide 24:
Root Cause Performance Analysisslide 25:
Root Cause Performance Analysis • Conducted when: – Growth becomes a cost problem – More instances or roll backs don’t work • Eg: dependency issue, networking, … – A fix is needed for forward progress • “But it’s faster on Linux 2.6.21 m2.xlarge!” • Staying on older versions for an undiagnosed (and fixable) reason prevents gains from later improvements – To understand scalability factors • IdenFfies the origin of poor performanceslide 26:
Root Cause Analysis Process • From cloud to instance: … SG Ne5lix ApplicaFon ASG Cluster ASG 1 … ASG 2 AZ 3 AZ 2 AZ 1 Instances (Linux) JVM … Tomcat Service ELBslide 27:
Cloud Methodologies • Resource Analysis – Any resources exhausted? CPU, disk, network • Metric and event correlaFons – When things got bad, what else happened? – Correlate with distributed dependencies • Latency Drilldowns – Trace origin of high latency from request down through dependencies • USE Method – For every service, check: uFlizaFon, saturaFon, errorsslide 28:
Instance Methodologies • Log Analysis – dmesg, GC, Apache, Tomcat, custom • USE Method – For every resource, check: uFlizaFon, saturaFon, errors • Micro-‐benchmarking – Test and measure components in isolaFon • Drill-‐down analysis – Decompose request latency, repeat • And other system performance methodologiesslide 29:
Bad Instances • Not all issues root caused – “bad instance” != root cause • SomeFmes efficient to just kill “bad instances” – They could be a lone hardware issue, which could take days for you to analyze • But they could also be an early warning of a global issue. If you kill them, you don’t know. Instance Bad Instanceslide 30:
Bad Instance AnF-‐Method 1. Plot request latency per-‐instance 2. Find the bad instance 3. Terminate bad instance 4. Someone else’s problem now! Bad instance Terminate! 95th percenFle latency (Atlas Exploder)slide 31:
Cloud Analysisslide 32:
Cloud Analysis • Cloud analysis tools made and used at Ne5lix include: Tool Purpose Atlas Metrics, dashboards, alerts Chronos Change tracking Mogul Metric correlaFon Salp Dependency graphing ICE Cloud usage dashboard • Monitor everything: you can’t tune what you can’t seeslide 33:
Ne5lix Cloud Analysis Process Atlas Alerts Example path enumerated ICE 1. Check Issue Cost Atlas Dashboards 2. Check Events Chronos Create New Alert Redirected to a new Target 3. Drill Down Atlas Metrics 4. Check Dependencies 5. Root Cause Mogul Instance Analysis Salpslide 34:
Atlas: Alerts • Custom alerts based on the Atlas metrics – CPU usage, latency, instance count growth, … • Usually email or pager – Can also deacFvate instances, terminate, reboot • Next step: check the dashboardsslide 35:
Atlas: Dashboardsslide 36:
Atlas: Dashboards Custom Graphs Set Time Breakdowns Interac,ve Click Graphs for More Metricsslide 37:
Atlas: Dashboards • Cloud wide and per-‐service (all custom) • StarFng point for issue invesFgaFons 1. Confirm and quanFfy issue 2. Check historic trend 3. Launch Atlas metrics view to drill down Cloud wide: streams per second (SPS) dashboardslide 38:
Atlas: Metricsslide 39:
Atlas: Metrics Region Breakdowns App Interac,ve Graph Metrics Op,ons Summary Sta,s,csslide 40:
Atlas: Metrics • All metrics in one system • System metrics: – CPU usage, disk I/O, memory, … • ApplicaFon metrics: – latency percenFles, errors, … • Filters or breakdowns by region, applicaFon, ASG, metric, instance, … – Quickly narrow an invesFgaFon • URL contains session state: sharableslide 41:
Chronos: Change Trackingslide 42:
Chronos: Change Tracking Breakdown Legend Historic Cri,cality App Event Listslide 43:
Chronos: Change Tracking • Quickly filter uninteresFng events • Performance issues oVen coincide with changes • The size and velocity of Ne5lix engineering makes Chronos crucial for communicaFng changeslide 44:
Mogul: CorrelaFons • Comparing performance with per-‐resource demandslide 45:
Mogul: CorrelaFons • Comparing performance with per-‐resource demand Latency Throughput Correla,on App Resource Demandslide 46:
Mogul: CorrelaFons • Measures demand using Lille’s Law – D = R * X D = Demand (in seconds per second) R = Average Response Time X = Throughput • Discover unexpected problem dependencies – That aren’t on the service dashboards • Mogul checks many other correlaFons – Weeds through thousands of applicaFon metrics, showing you the most related/interesFng ones – (Scol/MarFn should give a talk just on these) • Bearing in mind correlaFon is not causaFonslide 47:
Salp: Dependency Graphing • Dependency graphs based on live trace data • InteracFve • See architectural issuesslide 48:
Salp: Dependency Graphing Their Dependencies … Dependencies App • Dependency graphs based on live trace data • InteracFve • See architectural issuesslide 49:
ICE: AWS Usageslide 50:
ICE: AWS Usage Cost per hour Servicesslide 51:
ICE: AWS Usage • Cost per hour by AWS service, and Ne5lix applicaFon (service team) – IdenFfy issues of slow growth • Directs engineering effort to reduce costslide 52:
Ne5lix Cloud Analysis Process Atlas Alerts In summary… Example path enumerated ICE 1. Check Issue Cost Atlas Dashboards 2. Check Events Chronos Create New Alert Plus some other tools not pictured Redirected to a new Target 3. Drill Down Atlas Metrics 4. Check Dependencies 5. Root Cause Mogul Instance Analysis Salpslide 53:
Generic Cloud Analysis Process Alerts Example path enumerated Usage Reports 1. Check Issue Cost Custom Dashboards 2. Check Events Change Tracking Create New Alert Redirected to a new Target 3. Drill Down Metric Analysis 4. Check Dependencies 5. Root Cause Dependency Analysis Instance Analysisslide 54:
Instance Analysisslide 55:
Instance Analysis Locate, quanFfy, and fix performance issues anywhere in the systemslide 56:
Instance Tools • Linux – top, ps, pidstat, vmstat, iostat, mpstat, netstat, nicstat, sar, strace, tcpdump, ss, … • System Tracing – Vrace, perf_events, SystemTap • CPU Performance Counters – perf_events, rdmsr • ApplicaFon Profiling – applicaFon logs, perf_events, Google Lightweight Java Profiler (LJP), Java Flight Recorder (JFR)slide 57:
Tools in an AWS EC2 Linux Instanceslide 58:
Linux Performance Analysis • vmstat, pidstat, sar, etc, used mostly normally $ sar -n TCP,ETCP,DEV 1! Linux 3.2.55 (test-e4f1a80b) !08/18/2014 !_x86_64_ !(8 CPU)! 09:10:43 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s 09:10:44 PM 09:10:44 PM eth0 4114.00 4186.00 4537.46 28513.24 09:10:43 PM active/s passive/s iseg/s oseg/s! 09:10:44 PM 4107.00 22511.00! 09:10:43 PM atmptf/s estres/s retrans/s isegerr/s orsts/s! 09:10:44 PM 1.00! […]! rxmcst/s! 0.00! 0.00! • Micro benchmarking can be used to invesFgate hypervisor behavior that can’t be observed directlyslide 59:
Instance Challenges • ApplicaFon Profiling – For Java, Node.js • System Tracing – On Linux • Accessing CPU Performance Counters – From cloud guestsslide 60:
ApplicaFon Profiling • We’ve found many tools are inaccurate or broken – Eg, those based on java hprof • Stack profiling can be problemaFc: – Linux perf_events: frame pointer for the JVM is oVen missing (by hotspot), breaking stacks. Also needs perf-‐ map-‐agent loaded for symbol translaFon. – DTrace: jstack() also broken by missing FPs hlps://bugs.openjdk.java.net/browse/JDK-‐6276264, 2005 • Flame graphs are solving many performance issues. These need working stacks.slide 61:
ApplicaFon Profiling: Java • Java Flight Recorder – CPU & memory profiling. Oracle. $$$ • Google Lightweight Java Profiler – Basic, open source, free, asynchronous CPU profiler – Uses an agent that dumps hprof-‐like output • hlps://code.google.com/p/lightweight-‐java-‐profiler/wiki/Ge~ngStarted • hlp://www.brendangregg.com/blog/2014-‐06-‐12/java-‐flame-‐graphs.html • Plus others at various Fmes (YourKit, …)slide 62:
LJP CPU Flame Graph (Java)slide 63:
LJP CPU Flame Graph (Java) Stack frame Ancestry Mouse-‐over frames to quanFfyslide 64:
Linux System Profiling • Previous profilers only show Java CPU Fme • We use perf_events (aka the “perf” command) to sample everything else: – JVM internals & libraries – The Linux kernel – Other apps, incl. Node.js • perf CPU Flame graphs: # git clone https://github.com/brendangregg/FlameGraph! # cd FlameGraph! # perf record -F 99 -ag -- sleep 60! # perf script | ./stackcollapse-perf.pl | ./flamegraph.pl >gt; perf.svg!slide 65:
perf CPU Flame Graphslide 66:
perf CPU Flame Graph Kernel TCP/IP Broken Java stacks (missing frame pointer) GC Locks Time Idle thread epollslide 67:
ApplicaFon Profiling: Node.js • Performance analysis on Linux a growing area – Eg, new postmortem tools from 2 weeks ago: hlps://github.com/tjfontaine/lldb-‐v8 • Flame graphs are possible using Linux perf_events (perf) and v8 -‐-‐perf_basic_prof (node v0.11.13+) – Although there is currently a map growth bug; see: hlp://www.brendangregg.com/blog/ 2014-‐09-‐17/node-‐flame-‐graphs-‐on-‐linux.html • Also do heap analysis – node-‐heapdumpslide 68:
Flame Graphs • CPU sample flame graphs solve many issues – We’re automaFng their collecFon – If you aren’t using them yet, you’re missing out on low hanging fruit! • Other flame graph types useful as well – Disk I/O, network I/O, memory events, etc – Any profile that includes more stacks than can be quickly readslide 69:
Linux Tracing • ... now for something more challengingslide 70:
Linux Tracing • Too many choices, and many are sFll in-‐ development: – Vrace – perf_events – eBPF – SystemTap – ktap – LTTng – dtrace4linux – sysdigslide 71:
Linux Tracing • A system tracer is needed to root cause many issues: kernel, library, app – (There’s a prely good book covering use cases) • DTrace is awesome, but the Linux ports are incomplete • Linux does have has Vrace and perf_events in the kernel source, which – it turns out – can saFsfy many needs already!slide 72:
Linux Tracing: Vrace • Added by Steven Rostedt and others since 2.6.27 • Already enabled on our servers (3.2+) – CONFIG_FTRACE, CONFIG_FUNCTION_PROFILER, … – Use directly via /sys/kernel/debug/tracing • Front-‐end tools to aid usage: perf-‐tools – hlps://github.com/brendangregg/perf-‐tools – Unsupported hacks: see WARNINGs – Also see the trace-‐cmd front-‐end, as well as perf • lwn.net: “Ftrace: The Hidden Light Switch”slide 73:
perf-‐tools: iosnoop • Block I/O (disk) events with latency: # ./iosnoop –ts! Tracing block I/O. Ctrl-C to end.! STARTs ENDs COMM 5982800.302061 5982800.302679 supervise 5982800.302423 5982800.302842 supervise 5982800.304962 5982800.305446 supervise 5982800.305250 5982800.305676 supervise […]! PID TYPE DEV 202,1 202,1 202,1 202,1 BLOCK BYTES LATms! 0.62! 0.42! 0.48! 0.43! # ./iosnoop –h! USAGE: iosnoop [-hQst] [-d device] [-i iotype] [-p PID] [-n name] [duration]! -d device # device string (eg, "202,1)! -i iotype # match type (eg, '*R*' for all reads)! -n name # process name to match on I/O issue! -p PID # PID to match on I/O issue! # include queueing time in LATms! # include start time of I/O (s)! # include completion time of I/O (s)! # this usage message! duration # duration seconds, and use buffers! […]!slide 74:
perf-‐tools: iolatency • Block I/O (disk) latency distribuFons: # ./iolatency ! Tracing block I/O. Output every 1 seconds. Ctrl-C to end.! >gt;=(ms) ..slide 75:gt; 1 : 2104 |######################################|! 1 ->gt; 2 : 280 |###### 2 ->gt; 4 : 2 4 ->gt; 8 : 0 8 ->gt; 16 : 202 |#### >gt;=(ms) .. gt; 1 : 1144 |######################################|! 1 ->gt; 2 : 267 |######### 2 ->gt; 4 : 10 4 ->gt; 8 : 5 8 ->gt; 16 : 248 |######### 16 ->gt; 32 : 601 |#################### 32 ->gt; 64 : 117 |#### […]!
perf-‐tools: opensnoop • Trace open() syscalls showing filenames: # ./opensnoop -t! Tracing open()s. Ctrl-C to end.! TIMEs COMM PID postgres postgres postgres postgres postgres postgres postgres svstat svstat stat stat stat stat stat stat […]! FD FILE! 0x8 /proc/self/oom_adj! 0x5 global/pg_filenode.map! 0x5 global/pg_internal.init! 0x5 base/16384/PG_VERSION! 0x5 base/16384/pg_filenode.map! 0x5 base/16384/pg_internal.init! 0x5 base/16384/11725! 0x4 supervise/ok! 0x4 supervise/status! 0x3 /etc/ld.so.cache! 0x3 /lib/x86_64-linux-gnu/libselinux…! 0x3 /lib/x86_64-linux-gnu/libc.so.6! 0x3 /lib/x86_64-linux-gnu/libdl.so.2! 0x3 /proc/filesystems! 0x3 /etc/nsswitch.conf!slide 76:
perf-‐tools: funcgraph • Trace a graph of kernel code flow: # ./funcgraph -Htp 5363 vfs_read! Tracing "vfs_read" for PID 5363... Ctrl-C to end.! # tracer: function_graph! TIME CPU DURATION FUNCTION CALLS! 4346366.073832 | | vfs_read() {! 4346366.073834 | rw_verify_area() {! 4346366.073834 | security_file_permission() {! 4346366.073834 | apparmor_file_permission() {! 4346366.073835 | 0.153 us common_file_perm();! 4346366.073836 | 0.947 us 4346366.073836 | 0.066 us __fsnotify_parent();! 4346366.073836 | 0.080 us fsnotify();! 4346366.073837 | 2.174 us 4346366.073837 | 2.656 us 4346366.073837 | tty_read() {! 4346366.073837 | 0.060 us tty_paranoia_check();! […]!slide 77:
perf-‐tools: kprobe • Dynamically trace a kernel funcFon call or return, with variables, and in-‐kernel filtering: # ./kprobe 'p:open do_sys_open filename=+0(%si):string' 'filename ~ "*stat"'! Tracing kprobe myopen. Ctrl-C to end.! postgres-1172 [000] d... 6594028.787166: open: (do_sys_open +0x0/0x220) filename="pg_stat_tmp/pgstat.stat"! postgres-1172 [001] d... 6594028.797410: open: (do_sys_open +0x0/0x220) filename="pg_stat_tmp/pgstat.stat"! postgres-1172 [001] d... 6594028.797467: open: (do_sys_open +0x0/0x220) filename="pg_stat_tmp/pgstat.stat”! ^C! Ending tracing...! • Add -‐s for stack traces; -‐p for PID filter in-‐kernel. • Quickly confirm kernel behavior; eg: did a tunable take effect?slide 78:
perf-‐tools (so far…)slide 79:
Heat Maps • Vrace or perf_events for tracing disk I/O and other latencies as a heat map:slide 80:
Other Tracing OpFons • SystemTap – The most powerful of the system tracers – We’ll use it as a last resort: deep custom tracing – I’ve historically had issues with panics and freezes • SFll present in the latest version? • The Ne5lix fault tolerant architecture makes panics much less of a problem (that was the panic monkey) • Instance canaries with DTrace are possible too – OmniOS – FreeBSDslide 81:
Linux Tracing Future • Vrace + perf_events cover much, but not custom in-‐kernel aggregaFons • eBPF may provide this missing feature – eg, in-‐kernel latency heat map (showing bimodal):slide 82:
Linux Tracing Future • Vrace + perf_events cover much, but not custom in-‐kernel aggregaFons • eBPF may provide this missing feature Time – eg, in-‐kernel latency heat map (showing bimodal): Low latency cache hits High latency device I/Oslide 83:
CPU Performance Counters • … is this even possible from a cloud guest?slide 84:
CPU Performance Counters • Model Specific Registers (MSRs) – Basic details: Fmestamp clock, temperature, power – Some are available in EC2 • Performance Monitoring Counters (PMCs) – Advanced details: cycles, stall cycles, cache misses, … – Not available in EC2 (by default) • Root cause CPU usage at the cycle level – Eg, higher CPU usage due to more memory stall cyclesslide 85:
msr-‐cloud-‐tools • Uses the msr-‐tools package and rdmsr(1) – hlps://github.com/brendangregg/msr-‐cloud-‐tools ec2-guest# ./cputemp 1! CPU1 CPU2 CPU3 CPU4! 61 61 60 59! CPU Temperature 60 61 60 60! [...]! ec2-guest# ./showboost! CPU MHz : 2500! Turbo MHz : 2900 (10 active)! Turbo Ratio : 116% (10 active)! CPU 0 summary every 5 seconds...! TIME C0_MCYC C0_ACYC UTIL 06:11:35 51% 06:11:40 50% 06:11:45 49% 06:11:50 49% [...]! Real CPU MHz RATIO 116% 115% 115% 116% MHz! 2900! 2899! 2899! 2900!slide 86:
MSRs: CPU Temperature • Useful to explain variaFon in turbo boost (if seen) • Temperature for a syntheFc workload:slide 87:
MSRs: Intel Turbo Boost • Can dynamically increase CPU speed up to 30+% • This can mess up all performance comparisons • Clock speed can be observed from MSRs using – IA32_MPERF: Bits 63:0 is TSC Frequency Clock Counter C0_MCNT TSC relaFve – IA32_APERF: Bits 63:0 is TSC Frequency Clock Counter C0_ACNT actual clocks • This is how msr-‐cloud-‐tools showturbo worksslide 88:
PMCsslide 89:
PMCs • Needed for remaining low-‐level CPU analysis: – CPU stall cycles, and stall cycle breakdowns – L1, L2, L3 cache hit/miss raFo – Memory, CPU Interconnect, and bus I/O • Not enabled by default in EC2. Is possible, eg: # perf stat -e cycles,instructions,r0480,r01A2 -p `pgrep -n java` sleep 10! Performance counter stats for process id '17190':! 71,208,028,133 cycles 0.000 GHz [100.00%]! 41,603,452,060 instructions 0.58 insns per cycle [100.00%]! 23,489,032,742 r0480 [100.00%]! ICACHE.IFETCH_STALL 20,241,290,520 r01A2! RESOURCE_STALLS.ANY 10.000894718 seconds time elapsed!slide 90:
Using Advanced Perf Tools • Everyone doesn’t need to learn these • Reality: – A. Your company has one or more people for advanced perf analysis (perf team). Ask them. – B. You are that person – C. You buy a product that does it. Ask them. • If you aren’t the advanced perf engineer, you need to know what to ask for – Flame graphs, latency heat maps, Vrace, PMCs, etc… • At Ne5lix, we’re building the (C) opFon: Vectorslide 91:
Future Work: Vectorslide 92:
Future Work: Vector U,liza,on Satura,on Errors Per device Breakdownsslide 93:
Future Work: Vector • Real-‐Fme, per-‐second, instance metrics • On-‐demand CPU flame Atlas Alerts ICE graphs, heat maps, Vrace metrics, and Atlas Dashboards SystemTap metrics • Analyze from clouds Chronos to roots quickly, and Atlas Metrics from a web interface Mogul • Scalable: other teams can use it easily Vector Salpslide 94:
In Summary • 1. Ne5lix architecture – Fault tolerance: ASGs, ASG clusters, Hystrix (dependency API), Zuul (proxy), Simian army (tesFng) – Reduces the severity and urgency of issues • 2. Cloud Analysis – Atlas (alerts/dashboards/metrics), Chronos (event tracking), Mogul & Salp (dependency analysis), ICE (AWS usage) – Quickly narrow focus from cloud to ASG to instance • 3. Instance Analysis – Linux tools (*stat, sar, …), perf_events, Vrace, perf-‐tools, rdmsr, msr-‐cloud-‐tools, Vector – Read logs, profile & trace all soVware, read CPU countersslide 95:
References & Links hlps://ne5lix.github.io/#repo hlp://techblog.ne5lix.com/2012/01/auto-‐scaling-‐in-‐amazon-‐cloud.html hlp://techblog.ne5lix.com/2012/06/asgard-‐web-‐based-‐cloud-‐management-‐and.html hlp://www.slideshare.net/benjchristensen/performance-‐and-‐fault-‐tolerance-‐for-‐the-‐ ne5lix-‐api-‐qcon-‐sao-‐paulo hlp://www.slideshare.net/adrianco/ne5lix-‐nosql-‐search hlp://www.slideshare.net/ufried/resilience-‐with-‐hystrix hlps://github.com/Ne5lix/Hystrix, hlps://github.com/Ne5lix/Zuul hlp://techblog.ne5lix.com/2011/07/ne5lix-‐simian-‐army.html hlp://techblog.ne5lix.com/2014/09/introducing-‐chaos-‐engineering.html hlp://www.brendangregg.com/blog/2014-‐06-‐12/java-‐flame-‐graphs.html hlp://www.brendangregg.com/blog/2014-‐09-‐17/node-‐flame-‐graphs-‐on-‐linux.html Systems Performance: Enterprise and the Cloud, PrenFce Hall, 2014 hlp://sourceforge.net/projects/nicstat/ perf-‐tools: hlps://github.com/brendangregg/perf-‐tools Ftrace: The Hidden Light Switch: hlp://lwn.net/ArFcles/608497/ msr-‐cloud-‐tools: hlps://github.com/brendangregg/msr-‐cloud-‐toolsslide 96:
Thanks Coburn Watson, Adrian CockcroV Atlas: Insight Engineering (Roy Rapoport, etc.) Mogul: Performance Engineering (Scol Emmons, MarFn Spier) Vector: Performance Engineering (MarFn Spier, Amer Ather)slide 97:
Thanks • QuesFons? • hlp://techblog.ne5lix.com • hlp://slideshare.net/brendangregg • hlp://www.brendangregg.com • bgregg@ne5lix.com • @brendangregg