Monitorama 2015: Netflix Instance Analysis
Video: https://vimeo.com/131484323Monitorama 2015 talk by Brendan Gregg, Netflix.
Description: "With our large and ever-changing cloud environment, it can be vital to debug instance-level performance quickly. There are many instance monitoring solutions, but few come close to meeting our requirements, so we've been building our own and open sourcing them. In this talk, I will discuss our real-world requirements for instance-level analysis and monitoring: not just the metrics and features we desire, but the methodologies we'd like to apply. I will also cover the new and novel solutions we have been developing ourselves to meet these needs and desires, which include use of advanced Linux performance technologies (eg, ftrace, perf_events), and on-demand self-service analysis (Vector)."
next prev 1/69 | |
next prev 2/69 | |
next prev 3/69 | |
next prev 4/69 | |
next prev 5/69 | |
next prev 6/69 | |
next prev 7/69 | |
next prev 8/69 | |
next prev 9/69 | |
next prev 10/69 | |
next prev 11/69 | |
next prev 12/69 | |
next prev 13/69 | |
next prev 14/69 | |
next prev 15/69 | |
next prev 16/69 | |
next prev 17/69 | |
next prev 18/69 | |
next prev 19/69 | |
next prev 20/69 | |
next prev 21/69 | |
next prev 22/69 | |
next prev 23/69 | |
next prev 24/69 | |
next prev 25/69 | |
next prev 26/69 | |
next prev 27/69 | |
next prev 28/69 | |
next prev 29/69 | |
next prev 30/69 | |
next prev 31/69 | |
next prev 32/69 | |
next prev 33/69 | |
next prev 34/69 | |
next prev 35/69 | |
next prev 36/69 | |
next prev 37/69 | |
next prev 38/69 | |
next prev 39/69 | |
next prev 40/69 | |
next prev 41/69 | |
next prev 42/69 | |
next prev 43/69 | |
next prev 44/69 | |
next prev 45/69 | |
next prev 46/69 | |
next prev 47/69 | |
next prev 48/69 | |
next prev 49/69 | |
next prev 50/69 | |
next prev 51/69 | |
next prev 52/69 | |
next prev 53/69 | |
next prev 54/69 | |
next prev 55/69 | |
next prev 56/69 | |
next prev 57/69 | |
next prev 58/69 | |
next prev 59/69 | |
next prev 60/69 | |
next prev 61/69 | |
next prev 62/69 | |
next prev 63/69 | |
next prev 64/69 | |
next prev 65/69 | |
next prev 66/69 | |
next prev 67/69 | |
next prev 68/69 | |
next prev 69/69 |
PDF: Monitorama2015_NetflixInstanceAnalysis.pdf
Keywords (from pdftotext):
slide 1:
Jun 2015 Netflix Instance Performance Analysis Requirements Brendan Gregg Senior Performance Architect Performance Engineering Team bgregg@netflix.com @brendangreggslide 2:
Monitoring companies are selling faster horses I want to buy a carslide 3:
Server/Instance Analysis Potential In the last 10 years… More Linux More Linux metrics Better visualizations Containers Conditions ripe for innovation: where is our Henry Ford?slide 4:
This Talk • Instance analysis: system resources, kernel, processes – For customers: what you can ask for – For vendors: our desirables & requirements – What we are building (and open sourcing) at Netflix to modernize instance performance analysis (Vector, …)slide 5:
Over 60M subscribers FreeBSD CDN for content delivery Massive AWS EC2 Linux cloud Many monitoring/analysis tools Awesome place to workslide 6:
Agenda Desirables Undesirables Requirements Methodologies Our Toolsslide 7:
1. Desirablesslide 8:
Line Graphsslide 9:
Historical Dataslide 10:
Summary Statisticsslide 11:
Histograms … or a density plotslide 12:
Heat Mapsslide 13:
slide 14:
Frequency Trailsslide 15:
Waterfall Chartsslide 16:
Directed Graphsslide 17:
Flame Graphsslide 18:
Flame Chartsslide 19:
Full System Coverageslide 20:
… Without Running All Theseslide 21:
Deep System Coverageslide 22:
Other Desirables Safe for production use Easy to use: self service [Near] Real Time Ad hoc / custom instrumentation Complete documentation Graph labels and units Open source Communityslide 23:
2. Undesirablesslide 24:
Tachometers …especially with arbitrary color highlightingslide 25:
Pie Charts usr sys wait idle …for real-time metricsslide 26:
Doughnuts usr sys wait idle …like pie charts but worseslide 27:
Traffic Lights RED == BAD (usually) GREEN == GOOD (hopefully) …when used for subjective metrics These can be used for objective metrics For subjective metrics (eg, IOPS/latency) try weather icons insteadslide 28:
3. Requirementsslide 29:
Acceptable T&Cs • Probably acceptable: XXX, Inc. shall have a royalty-‐free, worldwide, transferable, and perpetual license to use or incorporate into the Service any suggesFons, ideas, enhancement requests, feedback, or other informaFon provided by you or any Authorized User relaFng to the Service. • Probably not acceptable: By submi9ng any Ideas, Customer and Authorized Users agree that: ... (iii) all right, Ftle and interest in and to the Ideas, including all associated IP Rights, shall be, and hereby are, assigned to [us] • Check with your legal teamslide 30:
Acceptable Technical Debt • It must be worth the … • Extra complexity when debugging • Time to explain to others • Production reliability risk • Security risk • There is no such thing as a free trialslide 31:
Known Overhead • Overhead must be known to be managed – T&Cs should not prohibit its measurement or publication • Sources of overhead: – CPU cycles – File system I/O – Network I/O – Installed software size • We will measure itslide 32:
Low Overhead • Overhead should also be the lowest possible – 1% CPU overhead means 1% more instances, and $$$ • Things we try to avoid – Tracing every function/method call – Needless kernel/user data transfers – strace (ptrace), tcpdump, libpcap, … • Event logging doesn't scaleslide 33:
Scalable • Can the product scale to (say) 100,000 instances? – Atlas, our cloud-wide analysis tool, can – We tend to kill other monitoring tools that attempt this • Real-time dashboards showing all instances: – How does that work? Can it scale to 1k? … 100k? – Adrian Cockcroft's spigo can simulate protocols at scale • High overhead might be worth it: on-demand onlyslide 34:
Useful An instance analysis solution must provide actionable information that helps us improve performanceslide 35:
4. Methodologiesslide 36:
Methodologies Methodologies pose the questions for metrics to answer Good monitoring/analysis tools should support performance analysis methodologiesslide 37:
Drunk Man Anti-Method • Tune things at random until the problem goes awayslide 38:
Workload Characterization Study the workload applied: Who Why What How Workload Targetslide 39:
Workload Characterization Eg, for CPUs: Who: which PIDs, programs, users Why: code paths, context What: CPU instructions, cycles How: changing over time Workload Targetslide 40:
CPUs Who Why How Whatslide 41:
CPUs Who Why top, htop! perf record -g! flame graphs How What monitoring perf stat -a -d!slide 42:
Most Monitoring Products Today Who Why top, htop! perf record -g! flame Graphs How What monitoring perf stat -a -d!slide 43:
The USE Method • For every resource, check: Utilization Saturation Errors X Resource UFlizaFon (%) • Saturation is queue length or queued time • Start by drawing a functional (block) diagram of your system / software / environmentslide 44:
USE Method for Hardware Include busses & interconnects!slide 45:
hXp://www.brendangregg.com/USEmethod/use-‐linux.htmlslide 46:
Most Monitoring Products Today • Showing what is and is not commonly measured • Score: 8 out of 33 (24%) • We can do better… U S E U S E U S E U S E U S E U S E U S E U S E U S E U S E U S Eslide 47:
Other Methodologies • There are many more: – Drill-Down Analysis Method – Time Division Method – Stack Profile Method – Off-CPU Analysis – … – I've covered these in previous talks & booksslide 48:
5. Our Tools Atlasslide 49:
BaseAMI • Many sources for instance metrics & analysis – Atlas, Vector, sar, perf-tools (ftrace, perf_events), … • Currently not using 3rd party monitoring vendor tools Linux (usually Ubuntu) OpFonal Apache, memcached, Node.js, … Atlas, S3 log rotaFon, sar, erace, perf, stap, perf-‐tools Vector, pcp Java (JDK 7 or 8) GC and thread dump logging Tomcat ApplicaFon war files, plahorm, base servelet hystrix, metrics (Servo), health checkslide 50:
Netflix Atlasslide 51:
Netflix Atlas Select Metrics Select Instance Historical Metricsslide 52:
Netflix Vectorslide 53:
Netflix Vector Select Instance Select Metrics Flame Graphs Near real-‐7me, per-‐second metricsslide 54:
Java CPU Flame Graphsslide 55:
Java CPU Flame Graphs Needs -XX:+PreserveFramePointer and perf-map-agent Kernel Java JVMslide 56:
sar • System Activity Reporter. Archive of metrics, eg: $ sar -n DEV! Linux 3.13.0-49-generic (prod0141) !06/06/2015 !_x86_64_ !(16 CPU)! 12:00:01 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s 12:05:01 AM eth0 919.57 15706.14 12:05:01 AM lo 23913.29 23913.29 17677.23 17677.23 12:15:01 AM eth0 909.03 12481.74 12:15:01 AM lo 23456.94 23456.94 14424.28 14424.28 12:25:01 AM eth0 10372.37 1219.22 27788.19 12:25:01 AM lo 25725.15 25725.15 29372.20 29372.20 12:35:01 AM eth0 914.74 12773.97 12:35:01 AM lo 23943.61 23943.61 14740.62 14740.62 […]! • Metrics are also in Atlas and Vector • Linux sar is well designed: units, groups rxmcst/s %ifutil! 0.00! 0.00! 0.00! 0.00! 0.00! 0.00! 0.00! 0.00!slide 57:
sar Observabilityslide 58:
perf-tools • Some front-ends to Linux ftrace & perf_events – Advanced, custom kernel observability when needed (rare) – https://github.com/brendangregg/perf-tools – Unsupported hacks: see WARNINGs • ftrace – First added to Linux 2.6.27 – A collection of capabilities, used via /sys/kernel/debug/tracing/ • perf_events – First added to Linux 2.6.31 – Tracer/profiler multi-tool, used via "perf" commandslide 59:
perf-tools: funccount • Eg, count a kernel function call rate: # ./funccount -i 1 'bio_*'! Tracing "bio_*"... Ctrl-C to end.! FUNC COUNT! bio_attempt_back_merge 26! bio_get_nr_vecs 361! bio_alloc 536! bio_alloc_bioset 536! bio_endio 536! bio_free 536! bio_fs_destructor 536! bio_init 536! bio_integrity_enabled 536! bio_put 729! bio_add_page 1004! [...]! Counts are in-‐kernel, for low overhead • Other perf-tools can then instrument these in more detailslide 60:
perf-tools (so far…)slide 61:
eBPF • Currently being integrated. Efficient (JIT) in-kernel maps. • Measure latency, heat maps, …slide 62:
eBPF eBPF will make a profound difference to monitoring on Linux systems There will be an arms race to support it, post Linux 4.1+ If it's not on your roadmap, it should beslide 63:
Summaryslide 64:
Requirements Acceptable T&Cs Acceptable technical debt Known overhead Low overhead Scalable Usefulslide 65:
Methodologies Support for: • Workload Characterization • The USE Method • … Not starting with metrics in search of usesslide 66:
Desirablesslide 67:
Instrument These With full eBPF support Linux has awesome instrumentation: use it!slide 68:
Links & References Netflix Vector – https://github.com/netflix/vector – http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html Netflix Atlas – http://techblog.netflix.com/2014/12/introducing-atlas-netflixs-primary.html Heat Maps – http://www.brendangregg.com/heatmaps.html – http://www.brendangregg.com/HeatMaps/latency.html Flame Graphs – http://www.brendangregg.com/flamegraphs.html – http://techblog.netflix.com/2014/11/nodejs-in-flames.html Frequency Trails: http://www.brendangregg.com/frequencytrails.html Methodology – http://www.brendangregg.com/methodology.html – http://www.brendangregg.com/USEmethod/use-linux.html perf-tools: https://github.com/brendangregg/perf-tools eBPF: http://www.brendangregg.com/blog/2015-05-15/ebpf-one-small-step.html Images: – horse: Microsoft Powerpoint clip art – gauge: https://github.com/thlorenz/d3-gauge – eBPF ponycorn: Deirdré Straughan & General Zoi's Pony Creatorslide 69:
Jun 2015 Thanks Questions? http://techblog.netflix.com http://slideshare.net/brendangregg http://www.brendangregg.com bgregg@netflix.com @brendangregg