Tracing Summit Europe 2014: From DTrace to Linux
Video: https://www.youtube.com/watch?v=TMXZcgnhXvg&list=UU3S1jlPpZUyx8ND3KKPhPvwTalk at the Tracing Summit 2014, Düsseldorf, by Brendan Gregg.
Description: "What can Linux learn from DTrace: what went well, and what didn't go well, on its path to success? This talk will discuss not just the DTrace software, but lessons from the marketing and adoption of a system tracer, and an inside look at how DTrace was really deployed and used in production environments. It will also cover ongoing problems with DTrace, and how Linux may surpass them and continue to advance the field of system tracing. A world expert and core contributor to DTrace, Brendan now works at Netflix on Linux performance with the various Linux tracers (ftrace, perf_events, eBPF, SystemTap, ktap, sysdig, LTTng, and the DTrace Linux ports), and will summarize his experiences and suggestions for improvements. He has also been contributing to various tracers: recently promoting ftrace and perf_events adoption through articles and front-end scripts, and testing eBPF."
PDF: TracingSummit2014_FromDTraceToLinux.pdf
Keywords (from pdftotext):
slide 1:
TRACING SUMMIT EUROPE Oct, 2014 From DTrace To Linux: What can Linux learn from DTrace? Brendan Gregg Senior Performance Architect bgregg@ne7lix.com @brendangreggslide 2:
Brendan Gregg • DTrace contribu?ons include: – Primary author of the DTrace book – DTraceToolkit – dtrace-‐cloud-‐tools – DTrace network providers • I now work on Linux at Ne7lix – using: Jrace, perf_events, SystemTap, ktap, eBPF, … – created: perf-‐tools, msr-‐cloud-‐tools • Opinions in this talk are my ownslide 3:
Agenda 1. DTrace – What is DTrace, really? – Who is DTrace for, really? – Why doesn’t Linux have DTrace? – What worked well? – What didn’t work well? 2. Linux Tracers – Jrace, perf_events, eBPF, … Topics include adop?on, marke?ng, technical challenges, and our usage at Ne7lix.slide 4:
What is DTrace, really?slide 5:
Technology + Marke?ng (Like many other company products)slide 6:
Prior Technology Kerninst: kernel dynamic tracing, Solaris 2.5.1, 1999slide 7:
Prior Technology Early dynamic tracers weren’t safeslide 8:
Prior Technology • Also: – Sun’s TNF – DProbes: user + kernel dynamic tracing – Linux Trace Toolkit (LTT) – Others, including offline binary instrumenta?on • DProbes and LTT were combined in Nov 2000, but not integrated into the Linux kernel1 • Sun set forth to produce a produc?on-‐safe tool 1 h^p://lkml.iu.edu/hypermail/linux/kernel/0011.3/0183.htmlslide 9:
slide 10:
Technology • DTrace: – Safe for produc?on use • You might step on your foot (overhead), but you won’t shoot it off – Dynamic tracing, sta?c tracing, and profiling – User-‐ and kernel-‐level, unified – Programma?c: filters and summaries – Solved countless issues in dev and prod • That’s what DTrace is for me – An awesome technology, oJen needed to root cause kernel & app issues • But for most people….slide 11:
A Typical Conversa?on… “Does Linux have DTrace yet?” “No.” “That’s a pity” “Why?” “DTrace is awesome!” “Why, specifically?” “I’m not sure” “Have you used it?” “No.”slide 12:
Marke?ngslide 13:
Early Marke?ngslide 14:
Early Marke?ng • DTrace had awesome marke?ng – People s?ll want it but don’t really know why • Early marke?ng: tradi?onal, $$$ – Great marke?ng product managers • 10 Moves Ahead campaign: airports, sta?ons, etc. – Sun sales staff pitched DTrace directly – Sun technology evangelists • Benefits – Not another Sun tech no one knew about – Compelled people to learn more, try it outslide 15:
Marke?ng Evolved • Sun marke?ng become innova?ve – Engineering blogs, BigAdmin – Marke?ng staff who used and understood DTrace • Who could be^er ar?culate its value – Marke?ng more directly from the engineersslide 16:
Later Marke?ngslide 17:
Later Marke?ng • Many ini?a?ves by Deirdré Straughan: – Social media, blogs, events, the ponycorn mascot, ... – Video and share everything: all meetups, talks • Blogs: – including h^p://dtrace.org; my own >gt; 1M views • Books: – my own >gt; 30k sold • Videos: – me shou?ng while DTracing disks, ~1M views • Language support exposed new communi?es to DTraceslide 18:
slide 19:
slide 20:
???slide 21:
Who is DTrace for, really?slide 22:
DTrace end-‐users: Current Es?mated DTrace guide users: ~100 Script end-‐users: ~5,000 Note: 91.247% of sta?s?cs are made upslide 23:
DTrace end-‐users: Current • DTrace guide users: ~100 – Understand the ~400 page Dynamic Tracing Guide – Develop their own scripts from scratch – Understand overhead intui?vely • Script end-‐users: ~5000 – DTraceToolkit, Google – Run scripts. Some tweaks/customiza?ons.slide 24:
DTrace end-‐users: Future eg, Oracle ZFS Storage Appliance Analy?csslide 25:
DTrace end-‐users: Future Possible Future DTrace guide users: ~100 Script end-‐users: ~5,000 GUI end-‐users: >gt;50,000slide 26:
Company Usageslide 27:
Company Usage • Prac?cal usage for most companies: – A) A performance team (or person) • Acquires useful scripts • Develops custom scripts – B) The rest of the company asks (A) for script/help • They need to know what’s possible, to know to ask – Or, you buy/develop a GUI that everyone can use • There are some excep?onsslide 28:
Why doesn’t Linux have DTrace?slide 29:
Why doesn’t Linux have a DTrace equivalent? 4 Answers…slide 30:
1. It does (sort of) Jrace perf_eventsslide 31:
1. It does (sort of) • Linux has changed – In 2005, numerous Linux issues were difficult or impossible to solve. Linux needed a DTrace equivalent. – By 2014, many of these are now solvable, especially using Jrace, perf_events, kprobes, uprobes: all part of the Linux kernelslide 32:
2. Technical semantic error: missing x86_64 kernel/module debuginfoslide 33:
2. Technical • Linux is a more difficult environment – Solaris always has symbols, via CTF, which DTrace uses for dynamic tracing – Linux doesn’t always have symbols/debuginfoslide 34:
3. Linux isn’t a Company “All the wood behind one arrow” – Sco^ McNealy, CEO, Sun Microsystemsslide 35:
3. Linux isn’t a Company • Linus can refuse patches, but can’t stop projects – The tracing wood is split between many arrows • Jrace, perf_events, eBPF, SystemTap, ktap, LTTng, … – And we are a small community: there’s not much wood to go around!slide 36:
4. No Trace Raceslide 37:
4. No Trace Race • Post 2001, Solaris was losing ground to Linux. Sun desperately needed differen?ators to survive – Three top Sun engineers spent years on DTrace – Sun marke?ng gave it their best shot… • This circumstance will never exist again – For Linux today, it would be like having Linus, Ingo, and Steven do tracing full-‐?me for three years, followed by a major marke?ng campaign • There may never be another trace race. Unless…slide 38:
Why doesn’t Linux have DTrace itself? 2 Answers…slide 39:
1. The CDDL From: Claire Giordanoslide 40:gt; To: license-discuss@opensource.org [Open Source Ini?a?ve] Subject: For Approval: Common Development and Distribution License (CDDL) Date: Wed, 01 Dec 2004 19:47:39 -0800 […] Like the MPL, the CDDL is not expected to be compatible with the GPL, since it contains requirements that are not in the GPL (for example, the "patent peace" provision in section 6). Thus, it is likely that files released under the CDDL will not be able to be combined with files released under the GPL to create a larger program. […] CDDL Team, Sun Microsystems Source: h^p://lwn.net/Ar?cles/114840/
1. The CDDL • Linux tradi?onally includes the tracer/profiler in the (GPL) kernel, but the DTrace license is CDDL – Out-‐of-‐tree projects have maintenance difficul?es – Oracle (who own the DTrace copyrights) could relicense it as GPL, but haven’t, and may never do this • Note that ZFS on Linux is doing well, despite being CDDL, and out of treeslide 41:
2. DTrace portsslide 42:
2. DTrace ports • There are two ports, but both currently incomplete • A) h^ps://github.com/dtrace4linux/linux: – Mostly one UK developer, Paul Fox, as a hobby since 2008 (when he isn’t developing on the Rasberry Pi) • B) Oracle Linux DTrace: – Open source kernel, closed source user-‐level ($) • We pay for monitoring tools; why not this too? – Experienced engineers, test suite focused – Had been good progress, but no updates for monthsslide 43:
What with DTrace worked well? 5 Key items…slide 44:
1. Produc?on Safetyslide 45:
1. Produc?on Safety • DTrace architecture – Restricted probe context: no kernel facility calls, restricted instruc?ons, no backwards branches, restricted loads/stores – Heartbeat: aborted due to systemic unresponsiveness • DTrace Test Suite – Hundreds of tests • Linux is learning this: – Oracle Linux DTrace is taking the test suite seriously – Jracetestslide 46:
2. All the wood behind one arrow DTraceslide 47:
2. All the wood behind one arrow • Can Linux learn this? – Can we vote some off the Linux tracing island? • At least, no new tracers in 2015, please!slide 48:
3. In-‐Kernel Aggrega?ons value ------------- Distribution ------------- count! 4096 | 8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 1085! 16384 |@@@@@@@@@@@ 443! 32768 |@@ 98! 65536 | 131072 | 262144 | 524288 | 1048576 | 11! 2097152 |slide 49:
3. In-‐Kernel Aggrega?ons • Changed how inves?ga?ons are conducted – rapid, live analysis • Low overhead: – per-‐CPU storage, asynchronous kernel-‐>gt;user transfers • Key uses: – summary sta?s?cs: count, avg, min, max – histograms: quan?ze, lquan?ze – by-‐key: execname, kernel stacks, user stacks • Linux can learn: – Need aggrega?ons (eBPF maps, SystemTap, ktap?)slide 50:
4. Many Example Scriptsslide 51:
4. Many Example Scriptsslide 52:
4. Many Example Scripts • Scripts serve many needs: – tools: ready to run – examples: learn by-‐example – marke?ng: each is a use case • DTrace book scripts – 150+ short examples • DTraceToolkit DTraceToolkit scripts – 230 more scripts – all have man pages, example files, and are tested – An essen?al factor in DTrace’s adop?onslide 53:
4. Many Example Scripts • Linux can learn: – Many users will just run scripts, not write them – People want good short examples – If they aren’t tested, they don’t work • It’s easy to generate metrics that kind-‐of work; it’s hard to make them reliable for different workloads. – Maintenance of dynamic tracing scripts is painful • The instrumented code can change • Need more sta?c tracepointsslide 54:
5. Marke?ngslide 55:
5. Marke?ng • DTrace was effec?vely marketed in many ways – Tradi?onal, social, blogs, scripts, ponycorn, … • Linux has virtually no marke?ng for its tracers – Jrace is great, if you ever discover it; etc. – Marke?ng spend is on commercial products instead • Linux can learn to market what it has – Tracers may also benefit from “a great name and a cute logo”1 – “eBPF” is not catchy, and doesn’t convey meaning 1 h^p://thenewstack.io/why-‐did-‐docker-‐catch-‐on-‐quickly-‐and-‐why-‐is-‐it-‐so-‐interes?ng/slide 56:
Cute Tracing Logos Jrace ktap perf_events LTTng SystemTap dtrace4linux Ponies by Deirdré Straughan, using: h^p://generalzoi.deviantart.com pony creatorslide 57:
Other Things • Programmable/scriptable • Built-‐in stability seman?csslide 58:
What with DTrace didn’t work well?slide 59:
What with DTrace didn’t work well? 5 Key Issues…slide 60:
1. Adop?onslide 61:
1. Adop?on • Few customers ever wrote DTrace scripts – DTrace should have been used more than it was – Sun’s “killer” tool just wasn’t – Be^er pickup rate with developers, not sysadmins • Many customers just ran my scripts – Not ideal, but be^er than nothing – This wasn’t what many at Sun dreamed • Internal adop?on was slow, limited – Sun could have done much more, but didn’t • The problem was knowing what to do with it – The syntax was the easy partslide 62:
1. Adop?on • Linux can learn: – Adop?on is about more than just the technology • Documenta?on, marke?ng, training, community – Teaching what it does is more important than how • Everyone needs to know when to ask for it, not necessarily how to use it – Needs an adop?on curve (not a step func?on) • Tools, one-‐liners, short scripts, …slide 63:
2. Training This is to cer?fy that Brendan Gregg Has Completed the Sun Educa?onal course DTrace is a Solaris differen3atorslide 64:
2. Training • Early training was not very effec?ve – Sun began including the DTraceToolkit in courses, with be^er success • It gradually improved – The last courses I developed and taught (aJer Sun) used simulated problems for the students to solve on their own with DTrace • Linux can learn: – Lab-‐based training is most effec?ve. Online tutorials?slide 65:
3. GUIsslide 66:
3. GUIs • Dozens of performance monitoring products, but almost no meaningful DTrace support • A couple of excep?ons: – Oracle ZFS Storage Appliance Analy?cs • Formally the Sun Storage 7000 Analy?cs • Should be generalized. Oracle Solaris 11.3? – Joyent Cloud Analy?csslide 67:
3. GUIs • Linux can learn: – Real adop?on possible through scripts & GUIs – Use the GUI to add value to the data • Heat maps: latency, u?liza?on, offset • Flame graphs • Time series thread visualiza?ons (Trace Compass) • ie, not just line graphs! – Commercial GUI products have marke?ng budget • Applica?on perf monitoring was $2.4B in 20131 1 h^ps://www.gartner.com/doc/2752217/market-‐share-‐analysis-‐applica?on-‐performanceslide 68:
3. GUIs • Heat maps are an example must-‐have use case for trace dataslide 69:
4. Overheads While the DTrace technology is awesome, it does have some minor technical challenges as wellslide 70:
4. Overheads • While op?mized, for many targets the DTrace CPU overheads can s?ll be too high – Scheduler tracing, memory alloca?on tracing – User-‐level dynamic tracing (fast trap) – VM probes (eg, Java disables some probes by default) – 10 GbE Network I/O, etc… • In some cases it doesn’t ma^er – Despera?on: system already mel?ng down – Troubleshoo?ng in dev: speed not a concern • Linux can learn: – Speed can ma^er, faster makes more possibleslide 71:
5. Syscall Providerslide 72:
5. Syscall Provider • Solaris DTrace instrumented the trap table, and called it the syscall provider – Which is actually an unstable interface – Breaks between Solaris versions • And really broke in Oracle Solaris 11 – Other weird caveats • Linux can learn: – syscalls are the #1 target for users learning system tracers. The API should be easy and stable.slide 73:
Other Issues • The lack of: – Bounded loops (like SystemTap) – Kernel instruc?on tracing (like perf_events) – Easy PMC interface (like perf stat) – Aggrega?on key/value access (stap, ktap, eBPF) – Kernel source (issue for Oracle Solaris only) • 4+ second startup ?mes – Several Linux tracers start instantlyslide 74:
From DTrace to Linux Tracers (2014)slide 75:
• Massive AWS EC2 Linux cloud, with FreeBSD appliances for content delivery • Performance is cri?cal: >gt;50M subscribers • Just launched in Europe!slide 76:
System Tracing at Ne7lix • Present: – Jrace can serve many needs – perf_events some more, esp. with debuginfo – SystemTap as needed, esp. for Java – ad hoc other tools • Future: – Jrace/perf_events/ktap with eBPF, for a fully featured and mainline tracer? – One of the other tracers going mainline? • Summarizing 4 tracers…slide 77:
1. Jraceslide 78:
1. Jrace • Tracing and profiling: /sys/kernel/debug/tracing – added by Steven Rostedt and others since 2.6.27, and already enabled on our servers (3.2+) • Experiences: – very useful capabili?es: tracing, coun?ng – surprising features: graphing (latencies), filters • Front-‐end tools to ease use – h^ps://github.com/brendangregg/perf-‐tools – WARNING: these are unsupported hacks – There’s also the trace-‐cmd front-‐end by Steven • 4 examples…slide 79:
perf-‐tools: iosnoop • Block I/O (disk) events with latency: # ./iosnoop –ts! Tracing block I/O. Ctrl-C to end.! STARTs ENDs COMM 5982800.302061 5982800.302679 supervise 5982800.302423 5982800.302842 supervise 5982800.304962 5982800.305446 supervise 5982800.305250 5982800.305676 supervise […]! PID TYPE DEV 202,1 202,1 202,1 202,1 BLOCK BYTES LATms! 0.62! 0.42! 0.48! 0.43! # ./iosnoop –h! USAGE: iosnoop [-hQst] [-d device] [-i iotype] [-p PID] [-n name] [duration]! -d device # device string (eg, "202,1)! -i iotype # match type (eg, '*R*' for all reads)! -n name # process name to match on I/O issue! -p PID # PID to match on I/O issue! # include queueing time in LATms! # include start time of I/O (s)! # include completion time of I/O (s)! # this usage message! duration # duration seconds, and use buffers! […]!slide 80:
perf-‐tools: iolatency • Block I/O (disk) latency distribu?ons: # ./iolatency ! Tracing block I/O. Output every 1 seconds. Ctrl-C to end.! >gt;=(ms) ..slide 81:gt; 1 : 1144 |######################################|! 1 ->gt; 2 : 267 |######### 2 ->gt; 4 : 10 4 ->gt; 8 : 5 8 ->gt; 16 : 248 |######### 16 ->gt; 32 : 601 |#################### 32 ->gt; 64 : 117 |#### […]! • User-‐level processing some?mes can’t keep up – Over 50k IOPS. Could buffer more workaround, but would prefer in-‐kernel aggrega?ons
perf-‐tools: opensnoop • Trace open() syscalls showing filenames: # ./opensnoop -t! Tracing open()s. Ctrl-C to end.! TIMEs COMM PID postgres postgres postgres postgres postgres postgres postgres svstat svstat stat stat stat stat stat stat […]! FD FILE! 0x8 /proc/self/oom_adj! 0x5 global/pg_filenode.map! 0x5 global/pg_internal.init! 0x5 base/16384/PG_VERSION! 0x5 base/16384/pg_filenode.map! 0x5 base/16384/pg_internal.init! 0x5 base/16384/11725! 0x4 supervise/ok! 0x4 supervise/status! 0x3 /etc/ld.so.cache! 0x3 /lib/x86_64-linux-gnu/libselinux…! 0x3 /lib/x86_64-linux-gnu/libc.so.6! 0x3 /lib/x86_64-linux-gnu/libdl.so.2! 0x3 /proc/filesystems! 0x3 /etc/nsswitch.conf!slide 82:
perf-‐tools: kprobe • Just wrapping capabili?es eases use. Eg, kprobes: # ./kprobe 'p:open do_sys_open filename=+0(%si):string' 'filename ~ "*stat"'! Tracing kprobe myopen. Ctrl-C to end.! postgres-1172 [000] d... 6594028.787166: open: (do_sys_open +0x0/0x220) filename="pg_stat_tmp/pgstat.stat"! postgres-1172 [001] d... 6594028.797410: open: (do_sys_open +0x0/0x220) filename="pg_stat_tmp/pgstat.stat"! postgres-1172 [001] d... 6594028.797467: open: (do_sys_open +0x0/0x220) filename="pg_stat_tmp/pgstat.stat”! ^C! Ending tracing...! • By some defini?on of “ease”. Would like easier symbol usage, instead of +0(%si).slide 83:
1. Jrace • Sugges?ons: – I’m blogging and so can you! – Func?on profiler: • Can these in-‐kernel counts be used for other vars? Eg, associa?ve array or histogram of %dx – Func?on grapher: • Can the ?ming be exposed by some vars? Picture histogram of latency – Mul?-‐user access possible?slide 84:
2. perf_eventsslide 85:
2. perf_events • In-‐kernel, tools/perf, mul?-‐tool, “perf” command • Experiences: – Stable, powerful, reliable – The sub op?ons can feel inconsistent (perf bench?) – Amazing with kernel debuginfo, when we have it – We use it for CPU stack profiles all the ?me • And turn them into flame graphs, which have solved numerous issues so far…slide 86:
perf CPU Flame Graph Kernel TCP/IP Broken Java stacks (missing frame pointer) GC Locks Time Idle thread epollslide 87:
2. perf_events • Sugges?ons: – Support for func?on argument symbols without a full debuginfo – Rework scrip?ng framework (eg, try por?ng iosnoop) • eg, “perf record” may need a tunable ?meout to trigger data writes, for efficient interac?ve scripts – Break up the mul?-‐tool a bit (separate perf bench) – eBPF integra?on for custom aggrega?ons?slide 88:
3. SystemTapslide 89:
3. SystemTap • The most powerful of the tracers • Used for the deepest custom tracing – Especially Java hotspot probes • Experiences: – Undergoing a reset. Switching to the latest SystemTap version, and a newer kernel. So far, so good. – Trying out nd_syscall for debuginfo-‐less tracing • Sugges?ons: – More non-‐debuginfo tapset func?onalityslide 90:
4. eBPFslide 91:
4. eBPF • Extended BPF: programs on tracepoints Time – High performance filtering: JIT – In-‐kernel summaries: maps • eg, in-‐kernel latency heat map (showing bimodal): Low latency cache hits High latency device I/Oslide 92:
4. eBPF • Experiences: – Can have lower CPU overhead than DTrace – Very powerful: really custom maps – Assembly version very hard to use; C is be^er, but s?ll not easy • Sugges?ons: – Integrate: custom in-‐kernel aggrega?ons is the missing pieceslide 93:
Other Tracers • Experiences and sugges?ons: – ktap – LTTng – Oracle Linux DTrace – dtrace4linux – sysdigslide 94:
The Tracing Landscape, Oct 2014 (less brutal) (my opinion) Ease of use sysdig perf stap Jrace (alpha) (brutal) dtrace4L. ktap (mature) Stage of Development eBPF Scope & Capabilityslide 95:
Summary • DTrace is an awesome technology – Which has also had awesome marke?ng • Tradi?onal, social, sales, blogs, … – Most people won’t use it directly, and that’s ok • Drive usage via GUIs and scripts • Linux Tracers are catching up, and may surpass – It’s not 2005 anymore • Now we have Jrace, perf_events, kprobes, uprobes, … – Speed and aggrega?ons ma^er • If DTrace is Ki^y Hawk, eBPF is a jet engineslide 96:
Acks dtrace.conf X-‐ray pony art by substack h^p://www.raspberrypi.org/ rasberry PI image h^p://en.wikipedia.org/wiki/Crash_test_dummy photo by Brady Holt h^ps://findery.com/johnfox/notes/all-‐the-‐wood-‐behind-‐one-‐arrow h^p://en.wikipedia.org/wiki/Early_flying_machines hang glider image h^p://www.beginningwithi.com/2010/09/12/how-‐the-‐dtrace-‐book-‐got-‐ done/ • h^p://www.cafepress.com/joyentsmartos.724465338 • h^p://generalzoi.deviantart.com/art/Pony-‐Creator-‐v3-‐397808116 • Tux by Larry Ewing; Linux® is the registered trademark of Linus Torvalds in the U.S. and other countries. • Thanks Dominic Kay and Deirdré Straughan for feedbackslide 97:
Links h^ps://www.usenix.org/legacy/event/usenix04/tech/general/full_papers/ cantrill/cantrill.pdf Jrace & perf-‐tools • h^ps://github.com/brendangregg/perf-‐tools • h^p://lwn.net/Ar?cles/608497/ eBPF: h^p://lwn.net/Ar?cles/603983/ ktap: h^p://www.ktap.org/ SystemTap: h^ps://sourceware.org/systemtap/ sysdig: h^p://www.sysdig.org/ h^p://lwn.net/Ar?cles/114840/ CDDL h^p://dtrace.org/blogs/ahl/2011/10/05/dtrace-‐for-‐linux-‐2/ Jp://Jp.cs.wisc.edu/paradyn/papers/Tamches99Using.pdf h^p://www.brendangregg.com/heatmaps.html h^p://lkml.iu.edu/hypermail/linux/kernel/0011.3/0183.html LTT + DProbesslide 98:
Thanks • Ques?ons? • h^p://slideshare.net/brendangregg • h^p://www.brendangregg.com • bgregg@ne7lix.com • @brendangregg