Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

ATO: Linux Performance 2018

Talk by Brendan Gregg for All Things Open 2018.

Description: "At over one thousand code commits per week, it's hard to keep up with Linux developments. This keynote will summarize recent Linux performance features, for a wide audience: the KPTI patches for Meltdown, eBPF for performance observability and the new open source tools that use it, Kyber for disk I/O scheduling, BBR for TCP congestion control, and more. This is about exposure: knowing what exists, so you can learn and use it later when needed. Get the most out of your systems with the latest Linux kernels and exciting features."

next
prev
1/26
next
prev
2/26
next
prev
3/26
next
prev
4/26
next
prev
5/26
next
prev
6/26
next
prev
7/26
next
prev
8/26
next
prev
9/26
next
prev
10/26
next
prev
11/26
next
prev
12/26
next
prev
13/26
next
prev
14/26
next
prev
15/26
next
prev
16/26
next
prev
17/26
next
prev
18/26
next
prev
19/26
next
prev
20/26
next
prev
21/26
next
prev
22/26
next
prev
23/26
next
prev
24/26
next
prev
25/26
next
prev
26/26

PDF: ATO2018_Linux_Performance_2018.pdf

Keywords (from pdftotext):

slide 1:
    Linux Performance
    Brendan Gregg
    Senior Performance Architect
    Oct 2018
    
slide 2:
    http://neuling.org/linux-next-size.html
    
slide 3:
    Post frequency:
    4 per year
    https://kernelnewbies.org/Linux_4.18
    4 per week
    https://lwn.net/Kernel/
    400 per day
    LKML
    http://vger.kernel.org/vger-lists.html
    #linux-kernel
    
slide 4:
    https://meltdownattack.com/
    
slide 5:
    Cloud Hypervisor
    KPTI Linux 4.15
    & backports
    (patches)
    Linux Kernel
    (KPTI)
    Application
    (retpolne)
    CPU
    (microcode)
    
slide 6:
    Server A: 31353 MySQL queries/sec
    serverA# mpstat 1
    Linux 4.14.12-virtual (bgregg-c5.9xl-i-xxx)
    02/09/2018
    _x86_64_
    (36 CPU)
    01:09:13 AM CPU
    %usr %nice
    %sys %iowait
    %irq %soft %steal %guest %gnice %idle
    01:09:14 AM all 86.89
    0.00 13.08
    01:09:15 AM all 86.77
    0.00 13.23
    01:09:16 AM all 86.93
    0.00 13.02
    [...]
    Server B: 22795 queries/sec (27% slower)
    serverB# mpstat 1
    Linux 4.14.12-virtual (bgregg-c5.9xl-i-xxx)
    02/09/2018
    _x86_64_
    (36 CPU)
    01:09:44 AM CPU
    %usr %nice
    %sys %iowait
    %irq %soft %steal %guest %gnice %idle
    01:09:45 AM all 82.94
    0.00 17.06
    01:09:46 AM all 82.78
    0.00 17.22
    01:09:47 AM all 83.14
    0.00 16.86
    [...]
    
slide 7:
    Linux KPTI patches for Meltdown flush the Translation
    Lookaside Buffer
    Virtual
    Address
    CPU
    Physical
    Address
    MMU
    hit
    TLB
    miss
    (walk)
    Main
    Memory
    Page
    Table
    
slide 8:
    Server A: TLB miss walks 3.5%
    serverA# ./tlbstat 1
    K_CYCLES
    K_INSTR
    [...]
    IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC
    1.04 86588626
    115441706 1507279
    1.04 86281319
    115306404 1507472
    1.04 86564448
    115555259 1511158
    1.04 86187531
    115292395 1508524
    K_ITLBCYC
    DTLB% ITLB%
    1.57 1.92
    1.57 1.92
    1.58 1.93
    1.57 1.92
    Server B: TLB miss walks 19.2% (16% higher)
    serverB# ./tlbstat 1
    K_CYCLES
    K_INSTR
    [...]
    IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC
    0.84 911337888 719553692 10476524
    0.84 913726197 721751988 10518488
    0.84 912994135 721492911 10524675
    0.84 912009660 720027006 10501926
    K_ITLBCYC
    DTLB% ITLB%
    10.92 8.19
    10.96 8.25
    10.97 8.26
    10.93 8.24
    
slide 9:
    http://www.brendangregg.com/blog/2018-02-09/kpti-kaiser-meltdown-performance.html
    
slide 10:
    Enhanced BPF
    Linux 4.*
    also known as just "BPF"
    User-Defined BPF Programs
    SDN Configuration
    DDoS Mitigation
    Kernel
    Runtime
    Event Targets
    verifier
    sockets
    Intrusion Detection
    Container Security
    kprobes
    BPF
    Observability
    Firewalls (bpfilter)
    Device Drivers
    uprobes
    tracepoints
    BPF
    actions
    perf_events
    
slide 11:
    eBPF is solving new things: off-CPU + wakeup analysis
    
slide 12:
    eBPF bcc
    Linux 4.4+
    https://github.com/iovisor/bcc
    
slide 13:
    e.g., identify multimodal disk I/O latency and outliers
    with bcc/eBPF biolatency
    # biolatency -mT 10
    Tracing block device I/O... Hit Ctrl-C to end.
    19:19:04
    msecs
    0 ->gt; 1
    2 ->gt; 3
    4 ->gt; 7
    8 ->gt; 15
    16 ->gt; 31
    32 ->gt; 63
    64 ->gt; 127
    128 ->gt; 255
    19:19:14
    msecs
    0 ->gt; 1
    2 ->gt; 3
    […]
    : count
    : 238
    : 424
    : 834
    : 506
    : 986
    : 97
    : 7
    : 27
    distribution
    |*********
    |*****************
    |*********************************
    |********************
    |****************************************|
    |***
    : count
    : 427
    : 424
    distribution
    |*******************
    |******************
    
slide 14:
    bcc/eBPF programs are laborious: biolatency
    # define BPF program
    bpf_text = """
    #include gt;
    #include gt;
    typedef struct disk_key {
    char disk[DISK_NAME_LEN];
    u64 slot;
    } disk_key_t;
    BPF_HASH(start, struct request *);
    STORAGE
    // time block I/O
    int trace_req_start(struct pt_regs *ctx, struct request *req)
    u64 ts = bpf_ktime_get_ns();
    start.update(&req, &ts);
    return 0;
    // output
    int trace_req_completion(struct pt_regs *ctx, struct request *req)
    u64 *tsp, delta;
    // fetch timestamp and calculate delta
    tsp = start.lookup(&req);
    if (tsp == 0) {
    return 0;
    // missed issue
    delta = bpf_ktime_get_ns() - *tsp;
    FACTOR
    // store as histogram
    STORE
    start.delete(&req);
    return 0;
    """
    # code substitutions
    if args.milliseconds:
    bpf_text = bpf_text.replace('FACTOR', 'delta /= 1000000;')
    label = "msecs"
    else:
    bpf_text = bpf_text.replace('FACTOR', 'delta /= 1000;')
    label = "usecs"
    if args.disks:
    bpf_text = bpf_text.replace('STORAGE',
    'BPF_HISTOGRAM(dist, disk_key_t);')
    bpf_text = bpf_text.replace('STORE',
    'disk_key_t key = {.slot = bpf_log2l(delta)}; ' +
    'void *__tmp = (void *)req->gt;rq_disk->gt;disk_name; ' +
    'bpf_probe_read(&key.disk, sizeof(key.disk), __tmp); ' +
    'dist.increment(key);')
    else:
    bpf_text = bpf_text.replace('STORAGE', 'BPF_HISTOGRAM(dist);')
    bpf_text = bpf_text.replace('STORE',
    'dist.increment(bpf_log2l(delta));')
    if debug or args.ebpf:
    print(bpf_text)
    if args.ebpf:
    exit()
    # load BPF program
    b = BPF(text=bpf_text)
    if args.queued:
    b.attach_kprobe(event="blk_account_io_start", fn_name="trace_req_start")
    else:
    b.attach_kprobe(event="blk_start_request", fn_name="trace_req_start")
    b.attach_kprobe(event="blk_mq_start_request", fn_name="trace_req_start")
    b.attach_kprobe(event="blk_account_io_completion",
    fn_name="trace_req_completion")
    print("Tracing block device I/O... Hit Ctrl-C to end.")
    # output
    exiting = 0 if args.interval else 1
    dist = b.get_table("dist")
    while (1):
    try:
    sleep(int(args.interval))
    except KeyboardInterrupt:
    exiting = 1
    print()
    if args.timestamp:
    print("%-8s\n" % strftime("%H:%M:%S"), end="")
    dist.print_log2_hist(label, "disk")
    dist.clear()
    countdown -= 1
    if exiting or countdown == 0:
    exit()
    
slide 15:
    … rewritten in bpftrace (launched Oct 2018)!
    #!/usr/local/bin/bpftrace
    BEGIN
    printf("Tracing block device I/O... Hit Ctrl-C to end.\n");
    kprobe:blk_account_io_start
    @start[arg0] = nsecs;
    kprobe:blk_account_io_completion
    /@start[arg0]/
    @usecs = hist((nsecs - @start[arg0]) / 1000);
    delete(@start[arg0]);
    
slide 16:
    eBPF bpftrace (aka BPFtrace)
    Linux 4.9+
    # Syscall count by program
    bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
    # Read size distribution by process:
    bpftrace -e 'tracepoint:syscalls:sys_exit_read { @[comm] = hist(args->gt;ret); }'
    # Files opened by process
    bpftrace -e 'tracepoint:syscalls:sys_enter_open { printf("%s %s\n", comm,
    str(args->gt;filename)); }'
    # Trace kernel function
    bpftrace -e 'kprobe:do_nanosleep { printf(“sleep by %s”, comm); }'
    # Trace user-level function
    Bpftrace -e 'uretprobe:/bin/bash:readline { printf(“%s\n”, str(retval)); }’
    Good for one-liners & short scripts; bcc is good for complex tools
    https://github.com/iovisor/bpftrace
    
slide 17:
    bpftrace Internals
    
slide 18:
    eBPF XDP
    Linux 4.8+
    https://www.netronome.com/blog/frnog-30-faster-networking-la-francaise/
    
slide 19:
    eBPF bpfilter
    Linux 4.18+
    ipfwadm (1.2.1)
    ipchains (2.2.10)
    iptables
    nftables (3.13)
    bpfilter (4.18+)
    jit-compiled
    NIC offloading
    https://lwn.net/Articles/747551/
    
slide 20:
    BBR
    Linux 4.9
    TCP congestion control algorithm
    Bottleneck Bandwidth and RTT
    1% packet loss: we see 3x better throughput
    https://twitter.com/amernetflix/status/892787364598132736
    https://blog.apnic.net/2017/05/09/bbr-new-kid-tcp-block/ https://queue.acm.org/detail.cfm?id=3022184
    
slide 21:
    Linux 4.12
    Kyber
    Multiqueue block I/O scheduler
    Tune target read & write latency
    Up to 300x lower 99th latencies in our testing
    reads (sync)
    writes (async)
    Kyber (simplified)
    dispatch
    dispatch
    queue size adjust
    completions
    https://lwn.net/Articles/720675/
    
slide 22:
    Hist Triggers
    Linux 4.17
    # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
    # trigger info:
    hist:keys=stacktrace:vals=bytes_req,bytes_alloc:sort=bytes_alloc:size=2048
    [active]
    […]
    { stacktrace:
    __kmalloc+0x11b/0x1b0
    ftrace
    seq_buf_alloc+0x1b/0x50
    advanced
    seq_read+0x2cc/0x370
    summaries
    proc_reg_read+0x3d/0x80
    __vfs_read+0x28/0xe0
    vfs_read+0x86/0x140
    SyS_read+0x46/0xb0
    system_call_fastpath+0x12/0x6a
    } hitcount:
    19133 bytes_req:
    78368768 bytes_alloc:
    https://www.kernel.org/doc/html/latest/trace/histogram.html
    
slide 23:
    Linux 4.?
    not merged yet
    PSI
    Pressure Stall Information
    More saturation metrics!
    The USE Method
    /proc/pressure/cpu
    /proc/pressure/memory
    /proc/pressure/io
    10-, 60-, and 300-second averages
    Saturation
    Errors
    Resource
    Utilization
    (%)
    https://lwn.net/Articles/759781/
    
slide 24:
    More perf 4.4 - 4.19 (2016 - 2018)
    TCP listener lockless (4.4)
    copy_file_range() (4.5)
    madvise() MADV_FREE (4.5)
    epoll multithread scalability (4.5)
    Kernel Connection Multiplexor (4.6)
    Writeback management (4.10)
    Hybrid block polling (4.10)
    BFQ I/O scheduler (4.12)
    Async I/O improvements (4.13)
    In-kernel TLS acceleration (4.13)
    Socket MSG_ZEROCOPY (4.14)
    Asynchronous buffered I/O (4.14)
    Longer-lived TLB entries with PCID (4.14)
    mmap MAP_SYNC (4.15)
    Software-interrupt context hrtimers (4.16)
    Idle loop tick efficiency (4.17)
    perf_event_open() [ku]probes (4.17)
    AF_XDP sockets (4.18)
    Block I/O latency controller (4.19)
    CAKE for bufferbloat (4.19)
    New async I/O polling (4.19)
    … and many minor improvements to:
    perf
    CPU scheduling
    futexes
    NUMA
    Huge pages
    Slab allocation
    TCP, UDP
    Drivers
    Processor support
    GPUs
    
slide 25:
    Take Aways
    1. Run latest
    2. Browse major features
    eg, https://kernelnewbies.org/Linux_4.19
    
slide 26:
    Some Linux perf Resources
    http://www.brendangregg.com/linuxperf.html
    https://kernelnewbies.org/LinuxChanges
    https://lwn.net/Kernel
    https://github.com/iovisor/bcc
    http://blog.stgolabs.net/search/label/linux
    http://www.brendangregg.com/blog/2018-02-09/kpti-kaiser-meltdown-performance.html