Kernel Recipes 2023: Fast By Friday: Why Kernel Superpowers are Essential
Talk by Brendan Gregg for Kernel Recipes 2023Video: https://www.youtube.com/watch?v=XudHNF4k_x0
Description: "It is not ok that we speed weeks, even months, trying to solve why software is slow. Companies waste money on compute costs, users are unhappy with latency, and product evaluations run out of investigation time. It should not take more than a week to identify the root cause or causes for a performance issue, such that any performance issue reported on a Monday should be solved by Friday, or sooner. The kernel superpowers we have been building are essential for this dream, and allow us to explore performance analysis methodologies to achieve this that were previously a fantasy.
This talk explores the dream of "fast by Friday," and shows how kernel technologies like eBPF, and performance methodologies, can get us there. The end goal is not more tools and metrics or having everyone learn eBPF bytecode. It's about efficient computing, and solving inefficiencies as quickly as possible. It's about saving cycles and carbon.
To be fast by Friday requires observability tools to work on Monday, and right now for many Linux environments that means /proc based tools and Ftrace, sometimes perf, and rarely the eBPF tracing tools: bcc and bpftrace. This and other current and future technical challenges will be discussed, including eBPF stack walking, runtime behavior and uprobes, compiler optimization defaults, OS default packages, and non-CPU targets (GPUs, accelerators)."
next prev 1/47 | |
next prev 2/47 | |
next prev 3/47 | |
next prev 4/47 | |
next prev 5/47 | |
next prev 6/47 | |
next prev 7/47 | |
next prev 8/47 | |
next prev 9/47 | |
next prev 10/47 | |
next prev 11/47 | |
next prev 12/47 | |
next prev 13/47 | |
next prev 14/47 | |
next prev 15/47 | |
next prev 16/47 | |
next prev 17/47 | |
next prev 18/47 | |
next prev 19/47 | |
next prev 20/47 | |
next prev 21/47 | |
next prev 22/47 | |
next prev 23/47 | |
next prev 24/47 | |
next prev 25/47 | |
next prev 26/47 | |
next prev 27/47 | |
next prev 28/47 | |
next prev 29/47 | |
next prev 30/47 | |
next prev 31/47 | |
next prev 32/47 | |
next prev 33/47 | |
next prev 34/47 | |
next prev 35/47 | |
next prev 36/47 | |
next prev 37/47 | |
next prev 38/47 | |
next prev 39/47 | |
next prev 40/47 | |
next prev 41/47 | |
next prev 42/47 | |
next prev 43/47 | |
next prev 44/47 | |
next prev 45/47 | |
next prev 46/47 | |
next prev 47/47 |
PDF: KernelRecipes2023_FastByFriday.pdf
Keywords (from pdftotext):
slide 1:
Fast by Friday Why Kernel Superpowers are Essential Brendan Gregg Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 2:
What would it take to solve any computer performance issue in 5 days? Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 3:
Imagine solving the performance of anything Operating systems, kernels, web browsers, phones, applications, websites, microservices, processors, AI, etc., … Examples: Linux, Windows, Firefox, Google docs, Minecraft, Amazon.com, Intel GPUs, pytorch, etc., … Websites should load in the blink of an eye. Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 4:
Why Timely performance analysis allows faster and more efficient software/hardware/tuning options to be adopted Good for the environment: Less cycles, energy, carbon Good for innovation: Rewards investment in engineering Good for companies: Less compute expense Good for end-users: Lower latency, cheaper products Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 5:
A vision: "Fast by Friday": Any computer performance issue reported on Monday should be solved by Friday (or sooner) Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 6:
Definitions "Fast by Friday": Any computer performance issue reported on Monday should be solved by Friday (or sooner) Issues: any performance analysis task, especially SW/HW evaluations Solved by friday: doesn't mean fixed, it means root cause(s) known Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 7:
"Fast by Friday" is… A vision A way of thinking A call to action A methodology A practical deadline I want to completely understand the performance of everything…in 5 days Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 8:
The first of three activities 1. Found 2. Fixed 3. Deployed Performance root cause(s) known Fix developed Fixed everywhere "Fast by Friday" focuses on (1) as it's often the biggest obstacle. Yes, even for the Linux kernel. Show me a 2x perf fix and I'll show you comparies running it by Friday. If the wasted cores paper was widely applicable, I'd have a pretty good example. Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 9:
The Problem Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 10:
Expected performance improvement for computing products Performance Product Performance: Hypothetical Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 11:
Example reality Performance Product Performance: Actual Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 12:
Example reality: 3 issues Performance Product Performance: Actual Not enough time to properly analyze all new software/ hardware/compiler options (e.g., icx!) Regression not solved in time Amount of lost performance Bottleneck not found in time We, engineers, have to fix this! Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 13:
Problem: Computers are getting increasingly complex Just one example (computer hardware) of increasing complexity. Software is worse! Performance issues can now go unsolved for weeks, months, years Product decisions miss improvements as analysis and tuning takes too long Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 14:
Analogy: Car performance You build the world's fastest car, but the customer says: "it isn't" You investigate and discover: They were sent the wrong car … with flat tires They also weren't told how to drive it … unbalanced wheels … and left economy enabled … a minor engine issue … and didn't use the turbo button … and older firmware This may take too long to debug and the customer may leave. Computers are like this too! Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 15:
A common scenario at product vendors Your product is probably the fastest But there's likely some config/tunable error It's the final week of the customer eval You have to make it fast by friday Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 16:
How Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 17:
"Fast by Friday": Proposed Agenda Prior weeks: Preparation Monday: Tuesday: Wednesday: Thursday: Friday: Quantify, static tuning, load Checklists, elimination Profiling Latency, logs, critical path Efficiency, algorithms Post weeks: Case study, retrospective Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 18:
Prior weeks: Preparation Everything must work on Monday! Critical analysis tools ("crisis tools") must be preinstalled; E.g., Linux: procps, sysstat, linux-tools-common, bcc-tools, bpftrace, … Stack tracing and symbols should work for the kernel, libraries, and applications Tracing (host & distributed) must work The performance engineers must already have host SSH root access A functional diagram of the system must be known Source code should be available Example functional diagram Source: Lunar Module - LM10 Through LM14 Familiarization Manual" (1969): Current industry status: 1 out of 5 Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 19:
Prior weeks: "Crisis Tools" No time to "apt-get update; apt-get install…" during a perf crisis. Ftrace is great as it's usually there; my Ftrace/perf tools: Source: Systems Performance 2nd Edition, page 131-132 Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essential https://github.com/brendangregg/perf-toolsslide 20:
Monday: Quantify, static tuning, load 1. Quantify the problem Problem statement method 2. Static performance tuning The system without load Check all hardware, software versions, past errors, config Covered in sysperf 3. Load vs implementation Problem Statement method Source: Systems Performance 2nd edition, page 44 Just a problem of load? Usually solved via basic monitoring and line charts Current industry status: 4 out of 5 A familiar pattern of load Source: https://www.brendangregg.com/Slides/SREcon_2016_perf_checklists Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 21:
Monday (cont.): End-of-day Status If still unsolved, we now know: - It’s a real issue, of this magnitude, affecting these systems - It’s not just config - It’s not just load Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 22:
Tuesday: Checklists, elimination 1. Recent issue checklist Often need new tools for ad hoc checks Can now be automated by AI auto-tuners (e.g., Intel Granulate) 2. Elimination: Subsystems it isn't It's impossible to deep-dive everything in one week, need to narrow down New tools to exonerate components Dashboards of health check traffic lights Include experiments: microbenchmarks Generic system diagram Current industry status: 2 out of 5 Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 23:
New observability tools often need kernel superpowers We need new tools for broad and deep custom performance analysis, ideally that can be developed and run in-situ by Friday. No restarts. eBPF is a kernel superpower that makes this possible. (e.g., show me how much workload A queued behind workload B: This is not just queue latency histograms, but needs programmatic filters.) Ftrace/perf/perf+eBPF also have kernel superpowers in the hands of wizards. eBPF Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essential Ftrace perfslide 24:
Tuesday (cont.): eBPF Tools Current eBPF tools *snoop, *top, *stat, *count, *slower, *dist Supports later methodologies Workload characterization, latency analysis, off-CPU analysis, USE method, etc. Future elimination tools *health, *diagnosis Supports "fast by friday" Analyzes existing dynamic workload Open source & in the target code repo E.g., Linux subsystem tools should be in Linux, like unit tests, accepted by maintainers, and ideally written by the developers! E.g., dctcphealth should ideally be written by the dctcp author: Daniel Borkmann! This ensures they are accurate and maintained. They should not be in bcc/bpftrace or proprietary. Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essential Current eBPF performance tools Source: BPF Performance Tools, cover art [Gregg 2019]slide 25:
Tuesday (cont.): Health Tool Example 1/2 I wrote the ZFS L2ARC (second level cache) so I should write the health check tool, or at least share thoughts for others to follow: I designed it to either help or do nothing, so shouldn’t be an issue, but... It could burn CPU for scanning, memory for metadata, and disk I/O throughput for caching, and not providing a net win, especially if someone set the record size to very small. Plus there could be outright bugs by new: There was that ARC bug I talked about at the last KR. Experimental is easiest: It’s a cache, so turn it off! Are things now faster or slower? Accurate observability is hard: Measure CPU burn (profiling or eBPF tracing), disk I/O, and impact of L2ARC kernel metadata preventing app WSS from caching, but measuring WSS is hard, and my website is overdue an update www.brendangregg.com/wss.html Rough observability: From kernel counters: Is the L2ARC in use? Is the recsizeslide 26: Tuesday (cont.): Health Tool Example 2/2 I wrote the ZFS L2ARC (second level cache) so I should write the health check tool, or at least share thoughts for others to follow: I designed it to either help or do nothing, so shouldn’t be an issue, but... It could burn CPU for scanning, memory for metadata, and disk I/O throughput for caching, and not providing a net In summary, a practical L2ARC health tool could: win, especially if someone set the record size to very small. Plus there could be outright bugs by new: There was that ARCto bug I talkedfor about at the lastresource KR. 1. Use kernel counters check possible contention - Experimental is easiest: It’sthresholds, a cache, so turn it off! Are things now faster or slower?issue”. versus handpicked and report “good” or “maybe - Accurate observability is hard: Measure CPU burn (profiling or eBPF tracing), disk I/O, and 2. If maybe, prompt an invasive that from disables impact of L2ARC kernelfor metadata preventingtest app WSS caching,the but L2ARC measuringwhile WSS is monitoring systemic throughput. Report “good” or “bad” and quantify. hard, and my website is overdue an update www.brendangregg.com/wss.html - Rough observability: From kernel counters: Is the L2ARC in use? Is the recsizeslide 27: Tuesday (cont.): Health Tool Points An ugly half-good tool is better than nothing Sharing thoughts can let others write it (Documentation/*/health.txt) Reporting "maybe" is ok Not an C64 diagnostics cart: Has to analyze exsiting workloads Test hierarchy: safe ->gt; violent, only progress if needed, can prompt Be pragmatic: eBPF, perf, Ftrace, /proc, use anything Current tools: "Here's data, you figure it out" Health tools: "I figured it out" Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 28:Tuesday (cont.): End-of-day Status If still unsolved, we now know: - It’s a real issue, of this magnitude, affecting these systems - It’s not just config - It’s not just load - It’s not a recent issue - It’s caused by these components Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 29:Wednesday: Profiling 1. CPU Flame Graphs More efficient with eBPF eBPF runtime stack walkers 2. CPI Flame Graphs Needs PMCs PEBS on Intel for accuracy 3. Off-CPU Flame Graphs CPU flame graph Impractical without eBPF Solves most performance issues Needs preparation! Current industry status: 3 out of 5 Off-CPU/waker time flame graph Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 30:Wednesday (cont.): End-of-day Status If still unsolved, we now know: - It’s a real issue, of this magnitude, affecting these systems - It’s not just config - It’s not just load - It’s not a recent issue - It’s caused by these components - It’s caused by these codepaths Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 31:Thursday: Latency, logs, critical path, HW 1. Latency drilldowns Latency histograms Latency heat maps Latency outliers 2. Logs, event tracing Latency heat maps Source: https://www.brendangregg.com/HeatMaps/latency.html Custom event logs 3. Critical path analysis Multi-threaded tracing Distributed tracing across a distributed environment 4. Hardware counters Current industry status: 3 out of 5 Distributed tracing Source: https://www.brendangregg.com/Slides/Monitorama2015_NetflixInstanceAnalysis Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 32:Thursday: Latency, logs, critical path, HW 1. Latency drilldowns Latency histograms Latency heat maps Latency outliers eBPF Tools *dist *slower 2. Logs, event tracing Custom event logs Latency heat maps Source: https://www.brendangregg.com/HeatMaps/latency.html *snoop, bpftrace 3. Critical path analysis Multi-threaded tracing Distributed tracing across a distributed "Zero instrumentation" environment 4. Hardware counters Current industry status: 3 out of 5 Kernel Recipes 2023 (when faster uprobes is done; currently: https://dont-ship.it) perf & its subcommands Fast by Friday: Why Kernel Superpowers are Essential Distributed tracing Source: https://www.brendangregg.com/Slides/Monitorama2015_NetflixInstanceAnalysisslide 33:Thursday (cont.): End-of-day Status If still unsolved, we now know: - It’s a real issue, of this magnitude, affecting these systems - It’s not just config - It’s not just load - It’s not a recent issue - It’s caused by these components - It’s caused by these codepaths - Latency has this distribution, over time, and these outliers - Latency is coming from this specific component - It's not a low-level hardware issue Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 34:Friday: Efficiency, algorithms 1. Is the target efficient? A largely unsolved problem Cycles/carbon per request Compare with similar products New efficiency tools (eBPF?) System efficiency equals the least efficient component Modeling, theory Protocol CIFS iSCSI FTP NFSv3 NFSv4 Cycles(k) per 1k read Example efficiency comparisons (made up) 2. Use faster algorithms? Big O Notation Current industry status: 1 out of 5 Kernel Recipes 2023 Source: Systems Performance 2nd Edition, page 175 Fast by Friday: Why Kernel Superpowers are Essentialslide 35:Friday (cont.): End-of-day Status If still unsolved, we now know: - It’s a real issue, of this magnitude, affecting these systems - It’s not just config - It’s not just load - It’s not a recent issue - It’s caused by this component - It’s caused by these codepaths - Latency has this distribution, over time, and these outliers - Latency is coming from this specific component - It's not a low-level hardware issue - The code is efficient already. There is no “problem”! Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 36:Post weeks: Case study, retrospective 1. Document as a case study JIRA, wiki, gist External blog/talk Including (redacted) flame graphs is great: You may find overlooked perf issues years later from them. Repetition? Add to Tuesday's "Recent issue checklist" 2. Retrospective How to debug it faster by friday? Current industry status: 1 out of 5 Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essential Example blog post: https://www.brendangregg.com/blogslide 37:"Fast by Friday": My current industry ratings (5 == best) Prior weeks: Preparation Monday: Tuesday: Wednesday: Thursday: Friday: Quantify, static tuning, load Checklists, elimination Profiling Latency, logs, critical path Efficiency, algorithms Post weeks: Case study, retrospective We are not currently good at this Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 38:"Fast by Friday": Linux Kernel Superpowers Prior weeks: Preparation Monday: Tuesday: Wednesday: Thursday: Friday: Quantify, static tuning, load Checklists, elimination Profiling Latency, logs, critical path Efficiency, algorithms Post weeks: Case study, retrospective Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essential eBPF perf Ftraceslide 39:What Needs to Change Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 40:A way of thinking, a call for action Consider perf wins that took weeks as room for improvement New tracing tools needed: *diagnose, *health Crisis tools should be installed by default in enterprise distros Stack walking should work by default for everything Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 41:Stack walking, frame pointers, and eBPF walking Frame pointers already enabled at major companies. Fedora first distro to offer it? Can't we be smarter if needed? NOP/__fentry__ style rewrites (Rostedt)? Options with LD/ELF. Reasons FPs were disabled in 2004: - i386 - gdb doesn't need them - gcc vs icc eBPF custom runtime stack walkers (Java, etc.) Yes, multiple people are doing this. They should ship as open source with the runtime code. https://gcc.gnu.org/legacy-ml/gcc-patches/2004-08/msg01033.html Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 42:Summary Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 43:"Fast by Friday" Summary Prior weeks: Preparation Day 1: Day 2: Day 3: Day 4: Day 5: Quantify, static tuning, load Checklists, elimination Profiling Latency, logs, critical path Efficiency, algorithms Fast by Friday: Any computer performance issue reported on Monday should be solved by Friday (or sooner) Post weeks: Case study, retrospective Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 44:"Fixed by Friday" (a different talk) sample Performance Mantras: Don't do it Do it, but don't do it again Do it less Do it later Do it when they're not looking Do it concurrently Do it cheaper Fixed by Friday: Any known performance bug reported on Monday should have a fix by Friday (or sooner) AFAIK these mantras are from Craig Hanson and Pat Crain (I'm still looking for a reference) Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 45:Take Aways "Fast by Friday": Any computer performance issue reported on Monday should be solved by Friday (or sooner) Kernel superpowers, especially eBPF, are essential for such fast in-situ production analysis It will take all of us many years: OS changes, kernel support, new tools, methodologies. How can you help? One step at a time! Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 46:Q&A Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essentialslide 47:Thanks Jesper Dangaard Brouer eBPF: Alexei Starovoitov (Meta), Daniel Borkmann (Isovalent), David S. Miller (Red Hat), Jakub Kicinski (Meta), Yonghong Song (Meta), Andrii Nakryiko (Meta), Thomas Graf (Isovalent), Martin KaFai Lau (Meta), John Fastabend (Isovalent), Quentin Monnet (Isovalent), Jesper Dangaard Brouer (Red Hat), Andrey Ignatov (Meta), Stanislav Fomichev (Google), Joe Stringer (Isolavent), KP Singh (Google), Dave Thaler (Microsoft), Liz Rice (Isovalent), Chris Wright (Red Hat), Linus Torvalds, and many more in the BPF community Ftrace: Steven Rostedt (Google) and the Ftrace community Perf: Arnaldo Carvalho de Melo (Red Hat) and the perf community Kernel Recipes 10th edition! Kernel Recipes 2023 Fast by Friday: Why Kernel Superpowers are Essential