eBPF Summit 2023: Fast by Friday: Why eBPF is Essential
Keynote by Brendan Gregg for eBPF Summit 2023 (online).Video: https://www.youtube.com/watch?v=s1mobd8t_u0
Description: "It is not ok that we speed weeks, even months, trying to solve why software is slow. It should not take more than a week to identify the root cause or causes for a performance issue, such that any performance issue reported on a Monday should be solved by Friday, or sooner. This talk explores the dream of "fast by Friday," and shows how kernel technologies like eBPF, and performance methodologies, can get us there. The end goal is not more tools and metrics or having everyone learn eBPF bytecode. It's about efficient computing, and solving inefficiencies as quickly as possible to save cycles and carbon."
next prev 1/29 | |
next prev 2/29 | |
next prev 3/29 | |
next prev 4/29 | |
next prev 5/29 | |
next prev 6/29 | |
next prev 7/29 | |
next prev 8/29 | |
next prev 9/29 | |
next prev 10/29 | |
next prev 11/29 | |
next prev 12/29 | |
next prev 13/29 | |
next prev 14/29 | |
next prev 15/29 | |
next prev 16/29 | |
next prev 17/29 | |
next prev 18/29 | |
next prev 19/29 | |
next prev 20/29 | |
next prev 21/29 | |
next prev 22/29 | |
next prev 23/29 | |
next prev 24/29 | |
next prev 25/29 | |
next prev 26/29 | |
next prev 27/29 | |
next prev 28/29 | |
next prev 29/29 |
PDF: eBPFSummit2023_FastByFriday.pdf
Keywords (from pdftotext):
slide 1:
Fast by Friday Why eBPF is Essential Brendan Gregg eBPF Summit 2023 Fast by Friday: Why eBPF is Essentialslide 2:
What would it take to solve any computer performance issue in 5 days? eBPF Summit 2023 Fast by Friday: Why eBPF is Essentialslide 3:
Imagine solving the performance of anything Operating systems, kernels, web browsers, phones, applications, websites, microservices, processors, AI, etc., … Examples: Linux, Windows, Firefox, Google docs, Minecraft, Amazon.com, Intel GPUs, pytorch, etc., … Websites should load in the blink of an eye. eBPF Summit 2023 Fast by Friday: Why eBPF is Essentialslide 4:
A vision: "Fast by Friday": Any computer performance issue reported on Monday should be solved by Friday (or sooner) eBPF Summit 2023 Fast by Friday: Why eBPF is Essentialslide 5:
"Fast by Friday" is… A vision A way of thinking A call to action A methodology A practical deadline I want to completely understand the performance of everything…in 5 days eBPF Summit 2023 Fast by Friday: Why eBPF is Essentialslide 6:
Why eBPF Summit 2023 Fast by Friday: Why eBPF is Essentialslide 7:
Expected performance improvement for computing products Performance Product Performance: Hypothetical eBPF Summit 2023 Fast by Friday: Why eBPF is Essentialslide 8:
Example reality Performance Product Performance: Actual eBPF Summit 2023 Fast by Friday: Why eBPF is Essentialslide 9:
Example reality Performance Product Performance: Actual Bottleneck not found in time Not enough time to properly analyze performance of all new software/hardware options New bottleneck not found in time We, engineers, have to fix this! eBPF Summit 2023 Fast by Friday: Why eBPF is Essentialslide 10:
Problem: Computers are getting increasingly complex Just one example (computer hardware) of increasing complexity. Software is worse! Performance issues can now go unsolved for weeks, months, years Product decisions miss improvements as analysis and tuning takes too long eBPF Summit 2023 Fast by Friday: Why eBPF is Essentialslide 11:
A common scenario at product vendors Your product is probably the fastest But there's likely some config/tunable error It's the final week of the customer eval You have to make it fast by friday eBPF Summit 2023 Fast by Friday: Why eBPF is Essentialslide 12:
Why this matters Timely performance analysis allows faster and more efficient software/hardware/tuning options to be adopted Good for the environment: Less cycles, energy, carbon Good for innovation: Rewards investment in engineering Good for companies: Less compute expense Good for end-users: Lower latency, cheaper products eBPF Summit 2023 Fast by Friday: Why eBPF is Essentialslide 13:
How eBPF Summit 2023 Fast by Friday: Why eBPF is Essentialslide 14:
Definitions "Fast by Friday": Any computer performance issue reported on Monday should be solved by Friday (or sooner) Issues: bottlenecks, evaluations, etc. Solved by friday: root cause(s) known eBPF Summit 2023 Fast by Friday: Why eBPF is Essentialslide 15:
"Fast by Friday": Proposed Agenda Prior weeks: Preparation Monday: Tuesday: Wednesday: Thursday: Friday: Quantify, static tuning, load Checklists, elimination Profiling Latency, logs, critical path Efficiency, algorithms Post weeks: Case study, retrospective eBPF Summit 2023 Fast by Friday: Why eBPF is Essentialslide 16:
Prior weeks: Preparation Everything must work on Monday! Critical analysis tools ("crisis tools") must be preinstalled; E.g., Linux: procps, sysstat, linux-tools-common, bcc-tools, bpftrace, … Stack tracing and symbols should work for the kernel, libraries, and applications Tracing (host & distributed) must work The performance engineers must already have host SSH root access A functional diagram of the system must be known Source code should be available Example functional diagram Source: Lunar Module - LM10 Through LM14 Familiarization Manual" (1969): Current industry status: 1 out of 5 eBPF Summit 2023 Fast by Friday: Why eBPF is Essentialslide 17:
Monday: Quantify, static tuning, load 1. Quantify the problem Problem statement method 2. Static performance tuning The system without load Check all hardware and software versions, past errors, config Covered in sysperf 3. Load vs implementation Problem Statement method Source: Systems Performance 2nd edition, page 44 Just a problem of load? Usually solved via basic monitoring and line charts Current industry status: 4 out of 5 A familiar pattern of load Source: https://www.brendangregg.com/Slides/SREcon_2016_perf_checklists eBPF Summit 2023 Fast by Friday: Why eBPF is Essentialslide 18:
Tuesday: Checklists, elimination 1. Recent issue checklist Often need new tools for ad hoc checks Can now be automated by AI auto-tuners (e.g., Intel Granulate) 2. Elimination: Subsystems it isn't It's impossible to deep-dive everything in one week, need to narrow down New tools to exonerate components Dashboards of health check traffic lights Include experiments: microbenchmarks Generic system diagram Current industry status: 2 out of 5 eBPF Summit 2023 Fast by Friday: Why eBPF is Essentialslide 19:
New observability tools often need eBPF eBPF is a superpower that can answer any software performance question, in-situ and immediately eBPF Summit 2023 Fast by Friday: Why eBPF is Essentialslide 20:
Tuesday (cont.): eBPF Tools Current eBPF tools *snoop, *top, *stat, *count, *slower, *dist Supports later methodologies Workload characterization, latency analysis, off-CPU analysis, USE method, etc. Future elimination tools *health, *diagnosis Supports "fast by friday" Open source & in the target code repo They should not be in bcc/bpftrace or proprietary. Linux subsystem health tools should be in Linux, like unit tests, ideally written by the developers! Current eBPF performance tools Source: BPF Performance Tools, cover art [Gregg 2019] eBPF Summit 2023 Fast by Friday: Why eBPF is Essentialslide 21:
Wednesday: Profiling 1. CPU Flame Graphs More efficient with eBPF eBPF runtime stack walkers 2. Off-CPU Flame Graphs Impractical without eBPF CPU flame graph Solves most performance issues Needs preparation! Current industry status: 3 out of 5 Off-CPU/waker time flame graph eBPF Summit 2023 Fast by Friday: Why eBPF is Essentialslide 22:
Thursday: Latency, logs, critical path 1. Latency drilldowns Latency histograms Latency heat maps Latency outliers Drill down to origin of latency 2. Logs, event tracing Latency heat maps Source: https://www.brendangregg.com/HeatMaps/latency.html Custom event logs 3. Critical path analysis Multi-threaded tracing Distributed tracing across a distributed environment Current industry status: 3 out of 5 Distributed tracing Source: https://www.brendangregg.com/Slides/Monitorama2015_NetflixInstanceAnalysis eBPF Summit 2023 Fast by Friday: Why eBPF is Essentialslide 23:
Thursday: Latency, logs, critical path eBPF Tools 1. Latency drilldowns Latency histograms *dist Latency heat maps *slower Latency outliers Drill down to origin of latency 2. Logs, event tracing Custom event logs Latency heat maps Source: https://www.brendangregg.com/HeatMaps/latency.html *snoop, bpftrace 3. Critical path analysis Multi-threaded tracing Distributed tracing across a distributed environment "Zero instrumentation" Current industry status: 3 out of 5 (when faster uprobes is done; current status: https://dont-ship.it) Distributed tracing Source: https://www.brendangregg.com/Slides/Monitorama2015_NetflixInstanceAnalysis eBPF Summit 2023 Fast by Friday: Why eBPF is Essentialslide 24:
Friday: Efficiency, algorithms 1. Is the target efficient? A largely unsolved problem Cycles/carbon per request Compare with similar products New efficiency tools (eBPF?) Modeling, theory Protocol CIFS iSCSI FTP NFSv3 NFSv4 Cycles(k) per 1k read Example efficiency comparisons (made up) 2. Use faster algorithms? Big O Notation Current industry status: 1 out of 5 eBPF Summit 2023 Fast by Friday: Why eBPF is Essential Source: Systems Performance 2nd Edition, page 175slide 25:
Post weeks: Case study, retrospective 1. Document as a case study JIRA, wiki, gist External blog/talk Repetition? Add to Tuesday's "Recent issue checklist" 2. Retrospective How to debug it faster by friday? A new way of thinking: If it took over 1 week to solve, that's a failure. Current industry status: 1 out of 5 eBPF Summit 2023 Fast by Friday: Why eBPF is Essential Example blog post: https://www.brendangregg.com/blogslide 26:
"Fast by Friday": My current industry ratings (5 == best) Prior weeks: Preparation Monday: Tuesday: Wednesday: Thursday: Friday: Quantify, static tuning, load Checklists, elimination Profiling Latency, logs, critical path Efficiency, algorithms Post weeks: Case study, retrospective We are not currently good at this eBPF Summit 2023 Fast by Friday: Why eBPF is Essentialslide 27:
"Fast by Friday": eBPF is Essential Prior weeks: Preparation Monday: Tuesday: Wednesday: Thursday: Friday: Quantify, static tuning, load Checklists, elimination Profiling Latency, logs, critical path Efficiency, algorithms Post weeks: Case study, retrospective eBPF Summit 2023 Fast by Friday: Why eBPF is Essential eBPFslide 28:
Take Aways "Fast by Friday": Any computer performance issue reported on Monday should be solved by Friday (or sooner) eBPF is essential for such fast in-situ production analysis It will take all of us many years: OS changes, kernel support, new tools, methodologies eBPF Summit 2023 Fast by Friday: Why eBPF is Essentialslide 29:
Thanks Jesper Dangaard Brouer eBPF: Alexei Starovoitov (Meta), Daniel Borkmann (Isovalent), David S. Miller (Red Hat), Jakub Kicinski (Meta), Yonghong Song (Meta), Andrii Nakryiko (Meta), Martin KaFai Lau (Meta), John Fastabend (Isovalent), Quentin Monnet (Isovalent), Jesper Dangaard Brouer (Red Hat), Andrey Ignatov (Meta), Stanislav Fomichev (Google), Joe Stringer (Isolavent), KP Singh (Google), Dave Thaler (Microsoft), Chris Wright (Red Hat), Linus Torvalds, and many more in the BPF community eBPF Summit 2023 Fast by Friday: Why eBPF is Essential