LSFMM 2019: BPF Observability
Keynote for the Linux Storage, File System, and Memory Management Summit (LSFMM) 2019, by Brendan Gregg.next prev 1/67 | |
next prev 2/67 | |
next prev 3/67 | |
next prev 4/67 | |
next prev 5/67 | |
next prev 6/67 | |
next prev 7/67 | |
next prev 8/67 | |
next prev 9/67 | |
next prev 10/67 | |
next prev 11/67 | |
next prev 12/67 | |
next prev 13/67 | |
next prev 14/67 | |
next prev 15/67 | |
next prev 16/67 | |
next prev 17/67 | |
next prev 18/67 | |
next prev 19/67 | |
next prev 20/67 | |
next prev 21/67 | |
next prev 22/67 | |
next prev 23/67 | |
next prev 24/67 | |
next prev 25/67 | |
next prev 26/67 | |
next prev 27/67 | |
next prev 28/67 | |
next prev 29/67 | |
next prev 30/67 | |
next prev 31/67 | |
next prev 32/67 | |
next prev 33/67 | |
next prev 34/67 | |
next prev 35/67 | |
next prev 36/67 | |
next prev 37/67 | |
next prev 38/67 | |
next prev 39/67 | |
next prev 40/67 | |
next prev 41/67 | |
next prev 42/67 | |
next prev 43/67 | |
next prev 44/67 | |
next prev 45/67 | |
next prev 46/67 | |
next prev 47/67 | |
next prev 48/67 | |
next prev 49/67 | |
next prev 50/67 | |
next prev 51/67 | |
next prev 52/67 | |
next prev 53/67 | |
next prev 54/67 | |
next prev 55/67 | |
next prev 56/67 | |
next prev 57/67 | |
next prev 58/67 | |
next prev 59/67 | |
next prev 60/67 | |
next prev 61/67 | |
next prev 62/67 | |
next prev 63/67 | |
next prev 64/67 | |
next prev 65/67 | |
next prev 66/67 | |
next prev 67/67 |
PDF: LSFMM2019_BPF_Observability.pdf
Keywords (from pdftotext):
slide 1:
BPF Observability # readahead.bt Attaching 5 probes... Readahead unused pages: 128 Readahead used page age (ms): @age_ms: [1] 2455 |@@@@@@@@@@@@@@@ [2, 4) 8424 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [4, 8) 4417 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ [8, 16) 7680 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ [16, 32) 4352 |@@@@@@@@@@@@@@@@@@@@@@@@@@ [32, 64) 0 | [64, 128) 0 | [128, 256) 384 |@@ Brendan Gregg LSFMM Apr 2019slide 2:
eBPFslide 3:
Superpowers Demoslide 4:
eBPF: extended Berkeley Packet Filter User-Defined BPF Programs SDN Configuration DDoS Mitigation Kernel Runtime Event Targets verifier sockets Intrusion Detection Container Security kprobes BPF Observability Firewalls (bpfilter) Device Drivers uprobes tracepoints BPF actions perf_eventsslide 5:
>gt;150k AWS EC2 server instances ~34% US Internet traffic at night >gt;130M members Performance is customer satisfaction & Netflix costslide 6:
Experience: ps(1) failure (This is from last week...)slide 7:
Experience: ps(1) failure # wait for $pid to finish: while ps -p $pid >gt;/dev/null; do sleep 1 done # do stuff...slide 8:
Experience: ps(1) failure # wait for $pid to finish: while ps -p $pid >gt;/dev/null; do sleep 1 done # do stuff... Problem: ps(1) sometimes fails to find the process Hypothesis: kernel bug!slide 9:
Experience: ps(1) failure Which syscall is abnormally failing (without strace(1) )? # bpftrace -e 't:syscalls:sys_exit_* /comm == "ps"/ { @[probe, args->gt;ret >gt; 0 ? 0 : - args->gt;ret] = count(); }' Attaching 316 probes... [...] @[tracepoint:syscalls:sys_exit_openat, 2]: 120 @[tracepoint:syscalls:sys_exit_newfstat, 0]: 180 @[tracepoint:syscalls:sys_exit_mprotect, 0]: 230 @[tracepoint:syscalls:sys_exit_rt_sigaction, 0]: 240 @[tracepoint:syscalls:sys_exit_mmap, 0]: 350 @[tracepoint:syscalls:sys_exit_newstat, 0]: 5000 @[tracepoint:syscalls:sys_exit_read, 0]: 10170 @[tracepoint:syscalls:sys_exit_close, 0]: 10190 @[tracepoint:syscalls:sys_exit_openat, 0]: 10190slide 10:
Experience: ps(1) failure Which syscall is abnormally failing (without multi-probe)? # bpftrace -e 't:raw_syscalls:sys_exit /comm == "ps"/ { @[args->gt;id, args->gt;ret >gt; 0 ? 0 : - args->gt;ret] = count(); }' Attaching 1 probe... [...] @[21, 2]: 120 @[5, 0]: 180 @[10, 0]: 230 @[13, 0]: 240 @[9, 0]: 350 @[4, 0]: 5000 @[0, 0]: 10170 @[3, 0]: 10190 @[257, 0]: 10190slide 11:
Experience: ps(1) failure Which syscall is abnormally failing (without multi-probe)? # bpftrace -e 't:raw_syscalls:sys_exit /comm == "ps"/ { @[ksym(*(kaddr("sys_call_table") + args->gt;id * 8)), args->gt;ret >gt; 0 ? 0 : - args->gt;ret] = count(); }' [...] @[sys_brk, 0]: 8202 @[sys_ioctl, 25]: 8203 @[sys_access, 2]: 32808 @[SyS_openat, 2]: 32808 @[sys_newfstat, 0]: 49213 caught 1 extra failure @[sys_newstat, 2]: 60820 @[sys_mprotect, 0]: 62882 ioctl() was a dead end [...]slide 12:
Experience: ps(1) failure Which syscall is successfully failing? # bpftrace -e 't:syscalls:sys_exit_getdents /comm == "ps"/ { printf("ret: %d\n", args->gt;ret); }' [...] ret: 9192 ret: 0 ret: 9216 ret: 0 ret: 9168 ret: 0 ret: 5640 ret: 0slide 13:
Experience: ps(1) failure Which syscall is successfully failing? # bpftrace -e 't:syscalls:sys_enter_getdents /comm == "ps"/ { @start[tid] = nsecs; } t:syscalls:sys_exit_getdents /@start[tid]/ { printf("%8d us, ret: %d\n", (nsecs - @start[tid]) / 1000, args->gt;ret); delete(@start[tid]); }' [...] 559 us, ret: 9640 3 us, ret: 0 516 us, ret: 9576 3 us, ret: 0 373 us, ret: 7720 2 us, ret: 0slide 14:
Experience: ps(1) failure /proc debugging # funccount '*proc*' Tracing "*proc*"... Ctrl-C to end.^C FUNC COUNT […] proc_readdir proc_readdir_de proc_root_getattr process_measurement kick_process wake_up_process proc_pid_readdir proc_root_readdir proc_fill_cacheslide 15:
Experience: ps(1) failure Some quick dead ends # bpftrace -e 'kr:proc_fill_cache /comm == "ps"/ { @[retval] = count(); }' # bpftrace -e 'kr:nr_processes /comm == "ps"/ { printf("%d\n", retval); }' # bpftrace -e 'kr:proc_readdir_de /comm == "ps"/ { printf("%d\n", retval); }' # bpftrace -e 'kr:proc_root_readdir /comm == "ps"/ { printf("%d\n", retval); }' Note: this is all in productionslide 16:
Experience: ps(1) failure Getting closer to the cause # bpftrace -e 'k:find_ge_pid /comm == "ps"/ { printf("%d\n", arg0); success failureslide 17:
Experience: ps(1) failure find_ge_pid() entry argument & return value: # bpftrace -e 'k:find_ge_pid /comm == "ps"/ { @nr[tid] = arg0; } kr:find_ge_pid /@nr[tid]/ { printf("%d: %llx\n", @nr[tid], retval); delete(@nr[tid]); }' […] 15561: ffff8a3ee70ad280 15564: ffff8a400244bb80 15569: ffff8a3f6f1a1840 15570: ffff8a3ffe890c00 15571: ffff8a3ffd23bdc0 15575: ffff8a40024fdd80 15576: 0slide 18:
Experience: ps(1) failure Kernel source: struct pid *find_ge_pid(int nr, struct pid_namespace *ns) return idr_get_next(&ns->gt;idr, &nr); […] void *idr_get_next(struct idr *idr, int *nextid) […] slot = radix_tree_iter_find(&idr->gt;idr_rt, &iter, id); Subject [RFC 2/2] pid: Replace PID bitmap implementation with IDR API Date Sat, 9 Sep 2017 18:03:17 +0530 […]slide 19:
Experience: ps(1) failure So far we have moved from: To: ps(1) sometimes fails. Kernel bug! find_ge_pid() sometimes returns NULL instead of the next struct *pid I’ll keep digging after this keynoteslide 20:
Takeaway: BPF enables better bug reportsslide 21:
bpftrace: BPF observability front-end # Files opened by process bpftrace -e 't:syscalls:sys_enter_open { printf("%s %s\n", comm, str(args->gt;filename)) }' # Read size distribution by process bpftrace -e 't:syscalls:sys_exit_read { @[comm] = hist(args->gt;ret) }' # Count VFS calls bpftrace -e 'kprobe:vfs_* { @[func]++ }' # Show vfs_read latency as a histogram bpftrace -e 'k:vfs_read { @[tid] = nsecs } kr:vfs_read /@[tid]/ { @ns = hist(nsecs - @[tid]); delete(@tid) }’ # Trace user-level function bpftrace -e 'uretprobe:bash:readline { printf(“%s\n”, str(retval)) }’ Linux 4.9+ https://github.com/iovisor/bpftraceslide 22:
Raw BPF samples/bpf/sock_example.c 87 lines truncatedslide 23:
C/BPF samples/bpf/tracex1_kern.c 58 lines truncatedslide 24:
bcc/BPF (C & Python) bcc examples/tracing/bitehist.py entire programslide 25:
bpftrace/BPF bpftrace -e 'kr:vfs_read { @ = hist(retval); }' https://github.com/iovisor/bpftrace entire programslide 26:
(brutal) Ease of use (less brutal) The Tracing Landscape, Apr 2019 (my opinion) (eBPF) (0.9) bpftrace ply/BPF sysdig perf (many) stap LTTng recent changes (alpha) (mature) (hist trigg ers) ftrace bcc/BPF C/BPF Stage of Development Raw BPF Scope & Capabilityslide 27:
Experience: readaheadslide 28:
Experience: readahead Is readahead polluting the cache?slide 29:
Experience: readahead Is readahead polluting the cache? # readahead.bt Attaching 5 probes... Readahead unused pages: 128 Readahead used page age (ms): @age_ms: [1] 2455 |@@@@@@@@@@@@@@@ [2, 4) 8424 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [4, 8) 4417 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ [8, 16) 7680 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ [16, 32) 4352 |@@@@@@@@@@@@@@@@@@@@@@@@@@ [32, 64) 0 | [64, 128) 0 | [128, 256) 384 |@@slide 30:
#!/usr/local/bin/bpftrace kprobe:__do_page_cache_readahead { @in_readahead[tid] = 1; } kretprobe:__do_page_cache_readahead { @in_readahead[tid] = 0; } kretprobe:__page_cache_alloc /@in_readahead[tid]/ @birth[retval] = nsecs; @rapages++; kprobe:mark_page_accessed /@birth[arg0]/ @age_ms = hist((nsecs - @birth[arg0]) / 1000000); delete(@birth[arg0]); @rapages--; END printf("\nReadahead unused pages: %d\n", @rapages); printf("\nReadahead used page age (ms):\n"); print(@age_ms); clear(@age_ms); clear(@birth); clear(@in_readahead); clear(@rapages);slide 31:
Takeaway: bpftrace is good for short toolsslide 32:
bpftrace Syntax bpftrace -e ‘k:do_nanosleep /pid >gt; 100/ { @[comm]++ }’ Probe Filter (optional) Actionslide 33:
Probesslide 34:
Probe Type Shortcuts tracepoint usdt Kernel static tracepoints User-level statically defined tracing kprobe Kernel function tracing kretprobe Kernel function returns uprobe User-level function tracing uretprobe User-level function returns profile Timed sampling across all CPUs interval Interval output software Kernel software events hardware Processor hardware eventsslide 35:
Filters /pid == 181/ ● /comm != “sshd”/ ● /@ts[tid]/slide 36:
Actions Per-event output printf() system() join() time() Map Summaries @ = count() or @++ @ = hist() The following is in the https://github.com/iovisor/bpftrace/blob/master/docs/reference_slide 37:
Functions printf(fmt, ...) Print formatted print(@x[, top[, div]]) Print map delete(@x) clear(@x) Delete all keys/values Maximum value reg(n) Register lookup avg(n) Average value join(a) Join string array stats(n) Statistics str(s) String time(fmt) Print formatted time ksym(p) Resolve kernel addr system(fmt) usym(p) Resolve user addr cat(file) Print file contents kaddr(n) Resolve kernel symbol exit() Quit bpftrace uaddr(n) Resolve user symbol Log2 histogram hist(n) lhist(n, min, max, step) Linear hist. count() Count events sum(n) Sum value min(n) Minimum value max(n) Delete map element Run shell commandslide 38:
Variable Types Basic Variables @global @thread_local[tid] $scratch Associative Arrays @array[key] = value Buitins pidslide 39:
Builtin Variables pid Process ID (kernel tgid) arg0, arg1, ... Function args tid Thread ID (kernel pid) retval Return value cgroup Current Cgroup ID args Tracepoint args uid User ID func Function name gid Group ID probe nsecs Nanosecond timestamp Full probe name cpu Processor ID curtask Curr task_struct (u64) comm Process name rand kstack Kernel stack trace ustack User stack trace Random number (u32)slide 40:
bpftrace: biolatency #!/usr/local/bin/bpftrace BEGIN printf("Tracing block device I/O... Hit Ctrl-C to end.\n"); kprobe:blk_account_io_start @start[arg0] = nsecs; kprobe:blk_account_io_completion /@start[arg0]/ @usecs = hist((nsecs - @start[arg0]) / 1000); delete(@start[arg0]);slide 41:
Experience: superping!slide 42:
Experience: superping How much is scheduler latency? # ping 172.20.0.1 PING 172.20.0.1 (172.20.0.1) 56(84) bytes of data. 64 bytes from 172.20.0.1: icmp_seq=1 ttl=64 time=2.87 ms 64 bytes from 172.20.0.1: icmp_seq=2 ttl=64 time=1.66 ms 64 bytes from 172.20.0.1: icmp_seq=3 ttl=64 time=1.55 ms 64 bytes from 172.20.0.1: icmp_seq=4 ttl=64 time=1.11 ms 64 bytes from 172.20.0.1: icmp_seq=5 ttl=64 time=2.48 ms 64 bytes from 172.20.0.1: icmp_seq=6 ttl=64 time=2.39 ms [...]slide 43:
Experience: superping How much is scheduler latency? # ./superping.bt Attaching 6 probes... Tracing ICMP echo request latency. Hit Ctrl-C to end. IPv4 ping, ID 9827 seq 1: 2883 us IPv4 ping, ID 9827 seq 2: 1682 us IPv4 ping, ID 9827 seq 3: 1568 us IPv4 ping, ID 9827 seq 4: 1078 us IPv4 ping, ID 9827 seq 5: 2486 us IPv4 ping, ID 9827 seq 6: 2394 us [...]slide 44:
#!/usr/local/bin/bpftrace #includeslide 45:gt; #include gt; #include gt; #include gt; #include gt; BEGIN { printf("Tracing ICMP echo request latency. Hit Ctrl-C to end.\n"); } kprobe:ip_send_skb $skb = (struct sk_buff *)arg1; // get IPv4 header; see skb_network_header(): $iph = (struct iphdr *)($skb->gt;head + $skb->gt;network_header); if ($iph->gt;protocol == IPPROTO_ICMP) { // get ICMP header; see skb_transport_header(): $icmph = (struct icmphdr *)($skb->gt;head + $skb->gt;transport_header); if ($icmph->gt;type == ICMP_ECHO) { $id = $icmph->gt;un.echo.id; $seq = $icmph->gt;un.echo.sequence; @start[$id, $seq] = nsecs; Note: no debuginfo required [...]
Takeaway: BPF tracing can walk structsslide 46:
bpftrace Internalsslide 47:
bpftrace Development v0.80 Jan-2019 Dec 2016 Oct 2018 Major Features (v1) v0.10 ?Jun-2019 v0.90 Mar-2019 Minor Features (v1) v1.0 ?2019 Stable Docs API Stability Known Bug Fixes Packaging More Bug Fixesslide 48:
bpftrace Tools Linux 4.9+ https://github.com/slide 49:
Experience: tcplifeslide 50:
Experience: tcplife Which processes are connecting to which port?slide 51:
Experience: tcplife Which processes are connecting to which port? # ./tcplife PID COMM LADDR 22597 recordProg 127.0.0.1 3277 redis-serv 127.0.0.1 22598 curl 22604 curl 22624 recordProg 127.0.0.1 3277 redis-serv 127.0.0.1 22647 recordProg 127.0.0.1 3277 redis-serv 127.0.0.1 [...] LPORT RADDR 46644 127.0.0.1 28527 127.0.0.1 61620 52.205.89.26 44400 52.204.43.121 46648 127.0.0.1 28527 127.0.0.1 46650 127.0.0.1 28527 127.0.0.1 RPORT TX_KB RX_KB MS 0 0.23 0 0.28 1 91.79 1 121.38 0 0.22 0 0.27 0 0.21 0 0.26slide 52:
bcc Linux 4.4+ https://github.com/iovisor/bccslide 53:
bcc: tcplife int kprobe__tcp_set_state(struct pt_regs *ctx, struct sock *sk, int state) u32 pid = bpf_get_current_pid_tgid() >gt;>gt; 32; // lport is either used in a filter here, or later u16 lport = sk->gt;__sk_common.skc_num; […] struct tcp_sock *tp = (struct tcp_sock *)sk; rx_b = tp->gt;bytes_received; tx_b = tp->gt;bytes_acked; u16 family = sk->gt;__sk_common.skc_family; […] if (family == AF_INET) { struct ipv4_data_t data4 = {}; data4.span_us = delta_us; data4.rx_b = rx_b; data4.tx_b = tx_b; data4.ts_us = bpf_ktime_get_ns() / 1000; data4.saddr = sk->gt;__sk_common.skc_rcv_saddr; data4.daddr = sk->gt;__sk_common.skc_daddr;slide 54:
Experience: tcplife From kprobes to tracepoints # bpftrace -lv t:tcp:tcp_set_state tracepoint:tcp:tcp_set_state const void * skaddr; int oldstate; int newstate; __u16 sport; __u16 dport; __u8 saddr[4]; __u8 daddr[4]; __u8 saddr_v6[16]; __u8 daddr_v6[16]; Linux 4.15 # bpftrace -lv t:sock:inet_sock_set_state tracepoint:sock:inet_sock_set_state const void * skaddr; int oldstate; int newstate; __u16 sport; __u16 dport; __u16 family; __u8 protocol; __u8 saddr[4]; __u8 daddr[4]; __u8 saddr_v6[16]; __u8 daddr_v6[16]; Linux 4.16+slide 55:
Takeaways: bcc for complex tools kprobes can prototype tracepoints tracepoints can change (best effort)slide 56:
Experience: cachestatslide 57:
Experience: cachestat What is the page cache hit ratio?slide 58:
Experience: cachestat What is the page cache hit ratio? # cachestat HITS MISSES [...] DIRTIES HITRATIO 4 100.00% 36 100.00% 28 100.00% 55.51% 1 100.00% 83 100.00% 1 100.00% BUFFERS_MB CACHED_MBslide 59:
b.attach_kprobe(event="add_to_page_cache_lru", fn_name="do_count") b.attach_kprobe(event="mark_page_accessed", fn_name="do_count") b.attach_kprobe(event="account_page_dirtied", fn_name="do_count") b.attach_kprobe(event="mark_buffer_dirty", fn_name="do_count") […] # total = total cache accesses without counting dirties # misses = total of add to lru because of read misses total = mpa - mbd misses = apcl - apd if missesslide 60: Takeaway: BPF tracing can prototype /proc statsslide 61:Reality Check Many of our perf wins are from CPU flame graphs not CLI tracingslide 62:Stack depth (0 - max) CPU Flame Graphs Kernel Java JVM Alphabetical frame sort (A - Z)slide 63:BPF-based CPU Flame Graphs Linux 2.6 Linux 4.9 perf record profile.py perf.data perf script stackcollapse-perf.pl flamegraph.pl flamegraph.plslide 64:Takeaway: BPF all the things!slide 65:Take Aways BPF observability: – bpftrace: one-liners, short scripts – bcc: complex tools – Production safe, and no debuginfo needed kprobe tools can prototype tracepoints, /proc stats I’m ok with tracepoints are best effort BPF all the things!slide 66:URLs https://github.com/iovisor/bpftrace https://github.com/iovisor/bpftrace/blob/master/docs/tutorial_one_liners.md https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md https://github.com/iovisor/bcc https://github.com/iovisor/bcc/blob/master/docs/tutorial.md https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md http://www.brendangregg.com/ebpf.html Update: this keynote was summarized by https://lwn.net/Articles/787131/slide 67:Thanks bpftrace – Alastair Robertson (creator) – Netflix: myself, Mary Marchini – Sthima: Willian Gaspar – Facebook: Jon Haslam, Dan Xu – Augusto Mecking Caringi, Dale Hamel, ... eBPF & bcc – Facebook: Alexei Starovoitov, Teng Qin, Yonghong Song, Martin Lau, Mark Drayton, … – Netflix: myself – VMware: Brenden Blanco – Daniel Borkmann, David S. Miller, Sasha Goldsthein, Paul Chaignon, ...