Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

illumosday 2012: DTracing the Cloud

Video: http://www.youtube.com/watch?v=uZTMuXXFG38

A talk for illumosday 2012 by Brendan Gregg.

Description: "Cloud computing facilitates rapid deployment and scaling, often pushing high load at applications under continual development. DTrace allows immediate analysis of issues on live production systems even in these demanding environments – no need to restart or run a special debug kernel.

For the illumos kernel, DTrace has been enhanced to support cloud computing, providing more observation capabilities to zones as used by Joyent SmartMachine customers. DTrace is also frequently used by the cloud operators to analyze systems and verify performance isolation of tenants.

This talk covers DTrace in the illumos-based cloud, showing examples of real-world performance wins."

next
prev
1/78
next
prev
2/78
next
prev
3/78
next
prev
4/78
next
prev
5/78
next
prev
6/78
next
prev
7/78
next
prev
8/78
next
prev
9/78
next
prev
10/78
next
prev
11/78
next
prev
12/78
next
prev
13/78
next
prev
14/78
next
prev
15/78
next
prev
16/78
next
prev
17/78
next
prev
18/78
next
prev
19/78
next
prev
20/78
next
prev
21/78
next
prev
22/78
next
prev
23/78
next
prev
24/78
next
prev
25/78
next
prev
26/78
next
prev
27/78
next
prev
28/78
next
prev
29/78
next
prev
30/78
next
prev
31/78
next
prev
32/78
next
prev
33/78
next
prev
34/78
next
prev
35/78
next
prev
36/78
next
prev
37/78
next
prev
38/78
next
prev
39/78
next
prev
40/78
next
prev
41/78
next
prev
42/78
next
prev
43/78
next
prev
44/78
next
prev
45/78
next
prev
46/78
next
prev
47/78
next
prev
48/78
next
prev
49/78
next
prev
50/78
next
prev
51/78
next
prev
52/78
next
prev
53/78
next
prev
54/78
next
prev
55/78
next
prev
56/78
next
prev
57/78
next
prev
58/78
next
prev
59/78
next
prev
60/78
next
prev
61/78
next
prev
62/78
next
prev
63/78
next
prev
64/78
next
prev
65/78
next
prev
66/78
next
prev
67/78
next
prev
68/78
next
prev
69/78
next
prev
70/78
next
prev
71/78
next
prev
72/78
next
prev
73/78
next
prev
74/78
next
prev
75/78
next
prev
76/78
next
prev
77/78
next
prev
78/78

PDF: dtracecloud2012.pdf

Keywords (from pdftotext):

slide 1:
    DTracing the Cloud
    Brendan Gregg
    Lead Performance Engineer
    brendan@joyent.com
    @brendangregg
    Monday, October 1, 12
    October, 2012
    
slide 2:
    DTracing the Cloud
    Monday, October 1, 12
    
slide 3:
    whoami
    • G’Day, I’m Brendan
    • These days I do performance analysis of the cloud
    • I use the right tool for the job; sometimes traditional, often DTrace.
    Traditional +
    some DTrace
    All DTrace
    Monday, October 1, 12
    
slide 4:
    DTrace
    • DTrace is a magician
    that conjures up
    rainbows, ponies
    and unicorns —
    and does it all
    entirely safely
    and in production!
    Monday, October 1, 12
    
slide 5:
    DTrace
    • Or, the version with fewer ponies:
    • DTrace is a performance analysis and troubleshooting tool
    • Instruments all software, kernel and user-land.
    • Production safe. Designed for minimum overhead.
    • Default in SmartOS, Oracle Solaris, Mac OS X and FreeBSD.
    Two Linux ports are in development.
    • There’s a couple of awesome books about it.
    Monday, October 1, 12
    
slide 6:
    illumos
    • Joyent’s SmartOS uses (and contributes to) the illumos kernel.
    • illumos is the most DTrace-featured kernel
    • illumos community includes Bryan Cantrill & Adam Leventhal,
    DTrace co-inventors (pictured on right).
    Monday, October 1, 12
    
slide 7:
    Agenda
    • Theory
    • Cloud types and DTrace visibility
    • Reality
    • DTrace and Zones
    • DTrace Wins
    • Tools
    • DTrace Cloud Tools
    • Cloud Analytics
    • Case Studies
    Monday, October 1, 12
    
slide 8:
    Theory
    Monday, October 1, 12
    
slide 9:
    Cloud Types
    • We deploy two types of virtualization on SmartOS/illumos:
    • Hardware Virtualization: KVM
    • OS-Virtualization: Zones
    Monday, October 1, 12
    
slide 10:
    Cloud Types, cont.
    • Both virtualization types can co-exist:
    Linux
    Windows
    SmartOS
    Cloud Tenant
    Cloud Tenant
    Cloud Tenant
    Apps
    Apps
    Apps
    Guest Kernel
    Guest Kernel
    Virtual Device Drivers
    Host Kernel
    SmartOS
    Monday, October 1, 12
    
slide 11:
    Cloud Types, cont.
    • KVM
    • Used for Linux and Windows guests
    • Legacy apps
    • Zones
    • Used for SmartOS guests (zones) called SmartMachines
    • Preferred over Linux:
    • Bare-metal performance
    • Less memory overheads
    • Better visibility (debugging)
    • Global Zone == host, Non-Global Zone == guest
    • Also used to encapsulate KVM guests (double-hull security)
    Monday, October 1, 12
    
slide 12:
    Cloud Types, cont.
    • DTrace can be used for:
    • Performance analysis: user- and kernel-level
    • Troubleshooting
    • Specifically, for the cloud:
    • Performance effects of multi-tenancy
    • Effectiveness and troubleshooting of performance isolation
    • Four contexts:
    • KVM host, KVM guest, Zones host, Zones guest
    • FAQ: What can DTrace see in each context?
    Monday, October 1, 12
    
slide 13:
    Hardware Virtualization: DTrace Visibility
    • As the cloud operator (host):
    Linux
    Linux
    Windows
    Cloud Tenant
    Cloud Tenant
    Cloud Tenant
    Apps
    Apps
    Apps
    Guest Kernel
    Guest Kernel
    Guest Kernel
    Virtual Device Drivers
    Host Kernel
    SmartOS
    Monday, October 1, 12
    
slide 14:
    Hardware Virtualization: DTrace Visibility
    • Host can see:
    • Entire host: kernel, apps
    • Guest disk I/O (block-interface-level)
    • Guest network I/O (packets)
    • Guest CPU MMU context register
    • Host can’t see:
    • Guest kernel
    • Guest apps
    • Guest disk/network context (kernel stack)
    • ... unless the guest has DTrace, and access (SSH) is allowed
    Monday, October 1, 12
    
slide 15:
    Hardware Virtualization: DTrace Visibility
    • As a tenant (guest):
    Linux
    An OS with DTrace
    Windows
    Cloud Tenant
    Cloud Tenant
    Cloud Tenant
    Apps
    Apps
    Apps
    Guest Kernel
    Guest Kernel
    Guest Kernel
    Virtual Device Drivers
    Host Kernel
    SmartOS
    Monday, October 1, 12
    
slide 16:
    Hardware Virtualization: DTrace Visibility
    • Guest can see:
    • Guest kernel, apps, provided DTrace is available
    • Guest can’t see:
    • Other guests
    • Host kernel, apps
    Monday, October 1, 12
    
slide 17:
    OS Virtualization: DTrace Visibility
    • As the cloud operator (host):
    SmartOS
    SmartOS
    SmartOS
    Cloud Tenant
    Cloud Tenant
    Cloud Tenant
    Apps
    Apps
    Apps
    Host Kernel
    SmartOS
    Monday, October 1, 12
    
slide 18:
    OS Virtualization: DTrace Visibility
    • Host can see:
    • Entire host: kernel, apps
    • Entire guests: apps
    Monday, October 1, 12
    
slide 19:
    OS Virtualization: DTrace Visibility
    • Operators can trivially see the entire cloud
    • Direct visibility from host of all tenant processes
    • Each blob is a tenant. The background shows one entire data
    center (availability zone).
    Monday, October 1, 12
    
slide 20:
    OS Virtualization: DTrace Visibility
    • Zooming in, 1 host,
    10 guests:
    • All can be examined
    with 1 DTrace invocation;
    don’t need multiple SSH
    or API logins per-guest.
    Reduces observability
    framework overhead by
    a factor of 10 (guests/host)
    • This pic was just created
    from a process snapshot (ps)
    http://dtrace.org/blogs/brendan/2011/10/04/visualizing-the-cloud/
    Monday, October 1, 12
    
slide 21:
    OS Virtualization: DTrace Visibility
    • As a tenant (guest):
    SmartOS
    SmartOS
    SmartOS
    Cloud Tenant
    Cloud Tenant
    Cloud Tenant
    Apps
    Apps
    Apps
    Host Kernel
    SmartOS
    Monday, October 1, 12
    
slide 22:
    OS Virtualization: DTrace Visibility
    • Guest can see:
    • Guest apps
    • Some host kernel (in guest context), as configured by
    DTrace zone privileges
    • Guest can’t see:
    • Other guests
    • Host kernel (in non-guest context), apps
    Monday, October 1, 12
    
slide 23:
    OS Stack DTrace Visibility
    • Entire operating system stack (example):
    Applications
    DBs, all server types, ...
    Virtual Machines
    System Libaries
    System Call Interface
    VFS
    Sockets
    UFS/...
    ZFS
    TCP/UDP
    Volume Managers
    Block Device Interface
    Ethernet
    Device Drivers
    Devices
    Monday, October 1, 12
    
slide 24:
    OS Stack DTrace Visibility
    • Entire operating system stack (example):
    Applications
    DBs, all server types, ...
    user
    kernel
    Monday, October 1, 12
    Virtual Machines
    System Libaries
    System Call Interface
    VFS
    Sockets
    UFS/...
    ZFS
    TCP/UDP
    Volume Managers
    Block Device Interface
    Ethernet
    Device Drivers
    Devices
    DTrace
    
slide 25:
    Reality
    Monday, October 1, 12
    
slide 26:
    DTrace and Zones
    • DTrace and Zones were developed in parallel for Solaris 10, and
    then integrated.
    • DTrace functionality for the Global Zone (GZ) was added first.
    • This is the host context, and allows operators to use DTrace to
    inspect all tenants.
    • DTrace functionality for the Non-Global Zone (NGZ) was harder,
    and some capabilities added later (2006):
    • Providers: syscall, pid, profile
    • This is the guest context, and allows customers to use DTrace
    to inspect themselves only (can’t see neighbors).
    Monday, October 1, 12
    
slide 27:
    DTrace and Zones, cont.
    Monday, October 1, 12
    
slide 28:
    DTrace and Zones, cont.
    • GZ DTrace works well.
    • We found many issues in practice with NGZ DTrace:
    • Can’t read fds[] to translate file descriptors. Makes using the
    syscall provider more difficult.
    # dtrace -n 'syscall::read:entry /fds[arg0].fi_fs == "zfs"/ { @ =
    quantize(arg2); }'
    dtrace: description 'syscall::read:entry ' matched 1 probe
    dtrace: error on enabled probe ID 1 (ID 4: syscall::read:entry): invalid
    kernel access in predicate at DIF offset 64
    dtrace: error on enabled probe ID 1 (ID 4: syscall::read:entry): invalid
    kernel access in predicate at DIF offset 64
    dtrace: error on enabled probe ID 1 (ID 4: syscall::read:entry): invalid
    kernel access in predicate at DIF offset 64
    dtrace: error on enabled probe ID 1 (ID 4: syscall::read:entry): invalid
    kernel access in predicate at DIF offset 64
    [...]
    Monday, October 1, 12
    
slide 29:
    DTrace and Zones, cont.
    • Can’t read curpsinfo, curlwpsinfo, which breaks many scripts
    (eg, curpsinfo->gt;pr_psargs, or curpsinfo->gt;pr_dmodel)
    # dtrace -n 'syscall::exec*:return { trace(curpsinfo->gt;pr_psargs); }'
    dtrace: description 'syscall::exec*:return ' matched 1 probe
    dtrace: error on enabled probe ID 1 (ID 103: syscall::exece:return): invalid
    kernel access in action #1 at DIF offset 0
    dtrace: error on enabled probe ID 1 (ID 103: syscall::exece:return): invalid
    kernel access in action #1 at DIF offset 0
    dtrace: error on enabled probe ID 1 (ID 103: syscall::exece:return): invalid
    kernel access in action #1 at DIF offset 0
    dtrace: error on enabled probe ID 1 (ID 103: syscall::exece:return): invalid
    kernel access in action #1 at DIF offset 0
    [...]
    • Missing proc provider. Breaks this common one-liner:
    # dtrace -n 'proc:::exec-success { trace(execname); }'
    dtrace: invalid probe specifier proc:::exec-success { trace(execname); }:
    probe description proc:::exec-success does not match any probes
    [...]
    Monday, October 1, 12
    
slide 30:
    DTrace and Zones, cont.
    • Missing vminfo, sysinfo, and sched providers.
    • Can’t read cpu built-in.
    • profile probes behave oddly. Eg, profile:::tick-1s only fires if
    tenant is on-CPU at the same time as the probe would fire.
    Makes any script that produces interval-output unreliable.
    Monday, October 1, 12
    
slide 31:
    DTrace and Zones, cont.
    • These and other bugs have since been fixed for SmartOS/illumos
    (thanks Bryan Cantrill!)
    • Now, from a SmartOS Zone:
    # dtrace -n 'proc:::exec-success { @[curpsinfo->gt;pr_psargs] = count(); }
    profile:::tick-5s { exit(0); }'
    dtrace: description 'proc:::exec-success ' matched 2 probes
    CPU
    FUNCTION:NAME
    13 71762
    :tick-5s
    -bash
    /bin/cat -s /etc/motd
    /bin/mail -E
    /usr/bin/hostname
    /usr/sbin/quota
    /usr/bin/locale -a
    ls -l
    sh -c /usr/bin/locale -a
    • Trivial DTrace one-liner, but represents much needed functionality.
    Monday, October 1, 12
    
slide 32:
    DTrace Wins
    • Aside from the NGZ issues, DTrace has worked well in the cloud
    and solved numerous issues. For example (these are mostly from
    operator context):
    • http://dtrace.org/blogs/brendan/2012/08/09/10-performance-wins/
    Monday, October 1, 12
    
slide 33:
    DTrace Wins, cont.
    • Not surprising given DTrace’s visibility...
    Monday, October 1, 12
    
slide 34:
    DTrace Wins, cont.
    • For example, DTrace script counts from the DTrace book:
    10+
    10+
    Monday, October 1, 12
    Applications
    DBs, all server types, ...
    Virtual Machines
    System Libaries
    System Call Interface
    VFS
    Sockets
    UFS/...
    ZFS
    TCP/UDP
    Volume Managers
    Block Device Interface
    Ethernet
    Device Drivers
    Devices
    
slide 35:
    Tools
    Monday, October 1, 12
    
slide 36:
    Ad-hoc
    • Write DTrace scripts as needed
    • Execute individually on hosts, or,
    • With ah-hoc scripting, execute across all hosts (cloud)
    • My ad-hoc tools include:
    • DTrace Cloud Tools
    • Flame Graphs
    Monday, October 1, 12
    
slide 37:
    Ad-hoc: DTrace Cloud Tools
    • Contains around 70 ad-hoc DTrace tools written by myself for
    operators and cloud customers.
    ./fs/metaslab_free.d
    ./fs/spasync.d
    ./fs/zfsdist.d
    ./fs/zfsslower.d
    ./fs/zfsslowzone.d
    ./fs/zfswhozone.d
    ./fs/ziowait.d
    ./mysql/innodb_pid_iolatency.d
    ./mysql/innodb_pid_ioslow.d
    ./mysql/innodb_thread_concurrency.d
    ./mysql/libmysql_pid_connect.d
    ./mysql/libmysql_pid_qtime.d
    ./mysql/libmysql_pid_snoop.d
    ./mysql/mysqld_latency.d
    ./mysql/mysqld_pid_avg.d
    ./mysql/mysqld_pid_filesort.d
    ./mysql/mysqld_pid_fslatency.d
    [...]
    ./net/dnsconnect.d
    ./net/tcp-fbt-accept_sdc5.d
    ./net/tcp-fbt-accept_sdc6.d
    ./net/tcpconnreqmaxq-pid_sdc5.d
    ./net/tcpconnreqmaxq-pid_sdc6.d
    ./net/tcpconnreqmaxq_sdc5.d
    ./net/tcpconnreqmaxq_sdc6.d
    ./net/tcplistendrop_sdc5.d
    ./net/tcplistendrop_sdc6.d
    ./net/tcpretranshosts.d
    ./net/tcpretransport.d
    ./net/tcpretranssnoop_sdc5.d
    ./net/tcpretranssnoop_sdc6.d
    ./net/tcpretransstate.d
    ./net/tcptimewait.d
    ./net/tcptimewaited.d
    ./net/tcptimretransdropsnoop_sdc6.d
    [...]
    • Customer scripts are linked from the “smartmachine” directory
    • https://github.com/brendangregg/dtrace-cloud-tools
    Monday, October 1, 12
    
slide 38:
    Ad-hoc: DTrace Cloud Tools, cont.
    • For example, tcplistendrop.d traces each kernel-dropped SYN due
    to TCP backlog overflow (saturation):
    # ./tcplistendrop.d
    TIME
    2012 Jan 19 01:22:49
    2012 Jan 19 01:22:49
    2012 Jan 19 01:22:49
    2012 Jan 19 01:22:49
    2012 Jan 19 01:22:49
    2012 Jan 19 01:22:49
    2012 Jan 19 01:22:49
    2012 Jan 19 01:22:49
    2012 Jan 19 01:22:49
    [...]
    SRC-IP
    PORT
    DST-IP
    25691 ->gt; 192.192.240.212
    18423 ->gt; 192.192.240.212
    38883 ->gt; 192.192.240.212
    10739 ->gt; 192.192.240.212
    27988 ->gt; 192.192.240.212
    28824 ->gt; 192.192.240.212
    65070 ->gt; 192.192.240.212
    56392 ->gt; 192.192.240.212
    24628 ->gt; 192.192.240.212
    • Can explain multi-second client connect latency.
    Monday, October 1, 12
    PORT
    
slide 39:
    Ad-hoc: DTrace Cloud Tools, cont.
    • tcplistendrop.d processes IP and TCP headers from the in-kernel
    packet buffer:
    fbt::tcp_input_listener:entry { self->gt;mp = args[1]; }
    fbt::tcp_input_listener:return { self->gt;mp = 0; }
    mib:::tcpListenDrop
    /self->gt;mp/
    this->gt;iph = (ipha_t *)self->gt;mp->gt;b_rptr;
    this->gt;tcph = (tcph_t *)(self->gt;mp->gt;b_rptr + 20);
    printf("%-20Y %-18s %-5d ->gt; %-18s %-5d\n", walltimestamp,
    inet_ntoa(&this->gt;iph->gt;ipha_src),
    ntohs(*(uint16_t *)this->gt;tcph->gt;th_lport),
    inet_ntoa(&this->gt;iph->gt;ipha_dst),
    ntohs(*(uint16_t *)this->gt;tcph->gt;th_fport));
    • Since this traces the fbt provider (kernel), it is operator only.
    Monday, October 1, 12
    
slide 40:
    Ad-hoc: DTrace Cloud Tools, cont.
    • A related example: tcpconnreqmaxq-pid*.d prints a summary, showing
    backlog lengths (on SYN arrival), the current max, and drops:
    tcp_conn_req_cnt_q distributions:
    cpid:3063
    max_q:8
    value ------------- Distribution ------------- count
    -1 |
    0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
    1 |
    Text
    cpid:11504
    max_q:128
    value ------------- Distribution ------------- count
    -1 |
    0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    1 |@@
    2 |@
    4 |@
    8 |
    16 |
    32 |
    64 |
    128 |
    256 |
    tcpListenDrops:
    cpid:11504
    Monday, October 1, 12
    max_q:128
    
slide 41:
    Ad-hoc: Flame Graphs
    • Visualizing CPU time using DTrace profiling and SVG
    Monday, October 1, 12
    
slide 42:
    Product
    • Cloud observability products including DTrace:
    • Joyent’s Cloud Analytics
    Monday, October 1, 12
    
slide 43:
    Product: Cloud Analytics
    • Syscall latency across the entire cloud, as a heat map!
    Monday, October 1, 12
    
slide 44:
    Product: Cloud Analytics, cont.
    • For operators and cloud customers
    • Observes entire cloud, in real-time
    • Latency focus, including heat maps
    • Instrumentation: DTrace and kstats
    • Front-end: Browser JavaScript
    • Back-end: node.js and C
    Monday, October 1, 12
    
slide 45:
    Product: Cloud Analytics, cont.
    • Creating an instrumentation:
    Monday, October 1, 12
    
slide 46:
    Product: Cloud Analytics, cont.
    • Aggregating data across cloud:
    Monday, October 1, 12
    
slide 47:
    Product: Cloud Analytics, cont.
    • Visualizing data:
    Monday, October 1, 12
    
slide 48:
    Product: Cloud Analytics, cont.
    • By-host breakdowns are essential:
    Switch from
    cloud to host
    in one click
    Monday, October 1, 12
    
slide 49:
    Case Studies
    Monday, October 1, 12
    
slide 50:
    Case Studies
    • Slow disks
    • Scheduler
    Monday, October 1, 12
    
slide 51:
    Slow disks
    • Customer complains of poor MySQL performance.
    • Noticed disks are busy via iostat-based monitoring software,
    and have blamed noisy neighbors causing disk I/O contention.
    • Multi-tenancy and performance isolation are common cloud issues
    Monday, October 1, 12
    
slide 52:
    Slow disks, cont.
    • Unix 101
    Process
    Syscall
    Interface
    VFS
    ZFS
    Block Device Interface
    Disks
    Monday, October 1, 12
    
slide 53:
    Slow disks, cont.
    • Unix 101
    Process
    sync.
    Syscall
    Interface
    VFS
    ZFS
    Block Device Interface
    Disks
    Monday, October 1, 12
    iostat(1)
    often async:
    write buffering,
    read ahead
    
slide 54:
    Slow disks, cont.
    • By measuring FS latency in application-synchronous context we
    can either confirm or rule-out FS/disk origin latency.
    • Including expressing FS latency during MySQL query, so that
    the issue can be quantified, and speedup calculated.
    • Ideally, this would be possible from within the SmartMachine, so
    both customer and operator can run the DTrace script. This is
    possible using:
    • pid provider: trace and time MySQL FS functions
    • syscall provider: trace and time read/write syscalls for FS file
    descriptors (hence needing fds[].fi_fs; otherwise cache open())
    Monday, October 1, 12
    
slide 55:
    Slow disks, cont.
    • mysql_pid_fslatency.d from dtrace-cloud-tools:
    # ./mysqld_pid_fslatency.d -n 'tick-10s { exit(0); }' -p 7357
    Tracing PID 7357... Hit Ctrl-C to end.
    MySQL filesystem I/O: 55824; latency (ns):
    read
    write
    Monday, October 1, 12
    value ------------- Distribution ------------- count
    1024 |
    2048 |@@@@@@@@@@
    4096 |@@@@@@@@@@@@@@@@@
    8192 |@@@@@@@@@@@
    16384 |@@
    32768 |
    65536 |
    131072 |
    262144 |
    value ------------- Distribution ------------- count
    2048 |
    4096 |
    8192 |@@@@@@
    16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    32768 |@@@@@
    65536 |@
    131072 |
    262144 |
    524288 |
    1048576 |
    2097152 |
    4194304 |
    8388608 |
    16777216 |
    33554432 |
    
slide 56:
    Slow disks, cont.
    • mysql_pid_fslatency.d from dtrace-cloud-tools:
    # ./mysqld_pid_fslatency.d -n 'tick-10s { exit(0); }' -p 7357
    Tracing PID 7357... Hit Ctrl-C to end.
    MySQL filesystem I/O: 55824; latency (ns):
    read
    write
    Monday, October 1, 12
    value ------------- Distribution ------------- count
    1024 |
    2048 |@@@@@@@@@@
    4096 |@@@@@@@@@@@@@@@@@
    8192 |@@@@@@@@@@@
    16384 |@@
    32768 |
    65536 |
    131072 |
    262144 |
    value ------------- Distribution ------------- count
    2048 |
    4096 |
    8192 |@@@@@@
    16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    32768 |@@@@@
    65536 |@
    131072 |
    262144 |
    524288 |
    1048576 |
    2097152 |
    4194304 |
    8388608 |
    16777216 |
    33554432 |
    DRAM
    cache hits
    Disk I/O
    
slide 57:
    Slow disks, cont.
    • mysql_pid_fslatency.d is about 30 lines of DTrace:
    pid$target::os_file_read:entry,
    pid$target::os_file_write:entry,
    pid$target::my_read:entry,
    pid$target::my_write:entry
    self->gt;start = timestamp;
    pid$target::os_file_read:return { this->gt;dir = "read"; }
    pid$target::os_file_write:return { this->gt;dir = "write"; }
    pid$target::my_read:return
    { this->gt;dir = "read"; }
    pid$target::my_write:return
    { this->gt;dir = "write"; }
    pid$target::os_file_read:return,
    pid$target::os_file_write:return,
    pid$target::my_read:return,
    pid$target::my_write:return
    /self->gt;start/
    @time[this->gt;dir] = quantize(timestamp - self->gt;start);
    @num = count();
    self->gt;start = 0;
    dtrace:::END
    printa("MySQL filesystem I/O: %@d; latency (ns):\n", @num);
    printa(@time);
    clear(@time); clear(@num);
    Monday, October 1, 12
    
slide 58:
    Slow disks, cont.
    • mysql_pid_fslatency.d is about 30 lines of DTrace:
    pid$target::os_file_read:entry,
    pid$target::os_file_write:entry,
    pid$target::my_read:entry,
    pid$target::my_write:entry
    self->gt;start = timestamp;
    Thank you MySQL!
    If not that easy,
    try syscall with fds[]
    pid$target::os_file_read:return { this->gt;dir = "read"; }
    pid$target::os_file_write:return { this->gt;dir = "write"; }
    pid$target::my_read:return
    { this->gt;dir = "read"; }
    pid$target::my_write:return
    { this->gt;dir = "write"; }
    pid$target::os_file_read:return,
    pid$target::os_file_write:return,
    pid$target::my_read:return,
    pid$target::my_write:return
    /self->gt;start/
    @time[this->gt;dir] = quantize(timestamp - self->gt;start);
    @num = count();
    self->gt;start = 0;
    dtrace:::END
    printa("MySQL filesystem I/O: %@d; latency (ns):\n", @num);
    printa(@time);
    clear(@time); clear(@num);
    Monday, October 1, 12
    
slide 59:
    Slow disks, cont.
    • Going for the slam dunk:
    # ./mysqld_pid_fslatency_slowlog.d 29952
    2011 May 16 23:34:00 filesystem I/O during query >gt; 100 ms: query 538 ms,
    fs 509 ms, 83 I/O
    2011 May 16 23:34:11 filesystem I/O during query >gt; 100 ms: query 342 ms,
    fs 303 ms, 75 I/O
    2011 May 16 23:34:38 filesystem I/O during query >gt; 100 ms: query 479 ms,
    fs 471 ms, 44 I/O
    2011 May 16 23:34:58 filesystem I/O during query >gt; 100 ms: query 153 ms,
    fs 152 ms, 1 I/O
    2011 May 16 23:35:09 filesystem I/O during query >gt; 100 ms: query 383 ms,
    fs 372 ms, 72 I/O
    2011 May 16 23:36:09 filesystem I/O during query >gt; 100 ms: query 406 ms,
    fs 344 ms, 109 I/O
    2011 May 16 23:36:44 filesystem I/O during query >gt; 100 ms: query 343 ms,
    fs 319 ms, 75 I/O
    2011 May 16 23:36:54 filesystem I/O during query >gt; 100 ms: query 196 ms,
    fs 185 ms, 59 I/O
    2011 May 16 23:37:10 filesystem I/O during query >gt; 100 ms: query 254 ms,
    fs 209 ms, 83 I/O
    • Shows FS latency as a proportion of Query latency
    • mysld_pid_fslatency_slowlog*.d in dtrace-cloud-tools
    Monday, October 1, 12
    
slide 60:
    Slow disks, cont.
    • The cloud operator can trace kernel internals. Eg, the VFS->gt;ZFS
    interface using zfsslower.d:
    # ./zfsslower.d 10
    TIME
    PROCESS
    2012 Sep 27 13:45:33 zlogin
    2012 Sep 27 13:45:36 bash
    2012 Sep 27 13:45:58 mysqld
    2012 Sep 27 13:45:58 mysqld
    2012 Sep 27 13:46:14 master
    libexec/postfix/qmgr
    2012 Sep 27 13:46:14 master
    postfix/master.cf
    [...]
    ms FILE
    11 /zones/b8b2464c/var/adm/wtmpx
    14 /zones/b8b2464c/opt/local/bin/zsh
    19 /zones/b8b2464c/var/mysql/ibdata1
    22 /zones/b8b2464c/var/mysql/ibdata1
    6 /zones/b8b2464c/root/opt/local/
    /zones/b8b2464c/root/opt/local/etc/
    • My go-to tool (does all apps). This example showed if there were
    VFS-level I/O >gt; 10ms? (arg == 10)
    • Stupidly easy to do
    Monday, October 1, 12
    
slide 61:
    Slow disks, cont.
    • zfs_read() entry ->gt; return; same for zfs_write().
    [...]
    fbt::zfs_read:entry,
    fbt::zfs_write:entry
    self->gt;path = args[0]->gt;v_path;
    self->gt;kb = args[1]->gt;uio_resid / 1024;
    self->gt;start = timestamp;
    fbt::zfs_read:return,
    fbt::zfs_write:return
    /self->gt;start && (timestamp - self->gt;start) >gt;= min_ns/
    this->gt;iotime = (timestamp - self->gt;start) / 1000000;
    this->gt;dir = probefunc == "zfs_read" ? "R" : "W";
    printf("%-20Y %-16s %1s %4d %6d %s\n", walltimestamp,
    execname, this->gt;dir, self->gt;kb, this->gt;iotime,
    self->gt;path != NULL ? stringof(self->gt;path) : "gt;");
    [...]
    • zfsslower.d originated from the DTrace book
    Monday, October 1, 12
    
slide 62:
    Slow disks, cont.
    • The operator can use deeper tools as needed. Anywhere in ZFS.
    # dtrace -n 'io:::start { @[stack()] = count(); }'
    dtrace: description 'io:::start ' matched 6 probes
    genunix`ldi_strategy+0x53
    zfs`vdev_disk_io_start+0xcc
    zfs`zio_vdev_io_start+0xab
    zfs`zio_execute+0x88
    zfs`zio_nowait+0x21
    zfs`vdev_mirror_io_start+0xcd
    zfs`zio_vdev_io_start+0x250
    zfs`zio_execute+0x88
    zfs`zio_nowait+0x21
    zfs`arc_read_nolock+0x4f9
    zfs`arc_read+0x96
    zfs`dsl_read+0x44
    zfs`dbuf_read_impl+0x166
    zfs`dbuf_read+0xab
    zfs`dmu_buf_hold_array_by_dnode+0x189
    zfs`dmu_buf_hold_array+0x78
    zfs`dmu_read_uio+0x5c
    zfs`zfs_read+0x1a3
    genunix`fop_read+0x8b
    genunix`read+0x2a7
    Monday, October 1, 12
    
slide 63:
    Slow disks, cont.
    • Cloud Analytics, for either operator or customer, can be used to
    examine the full latency distribution, including outliers:
    Outliers
    This heat map shows FS latency for an entire cloud data center
    Monday, October 1, 12
    
slide 64:
    Slow disks, cont.
    • Found that the customer problem was not disks or FS (99% of the
    time), but was CPU usage during table joins.
    • On Joyent’s IaaS architecture, it’s usually not the disks or
    filesystem; useful to rule that out quickly.
    • Some of the time it is, due to:
    • Bad disks (1000+ms I/O)
    • Controller issues (PERC)
    • Big I/O (how quick is a 40 Mbyte read from cache?)
    • Other tenants (benchmarking!). Much less for us now with ZFS
    I/O throttling (thanks Bill Pijewski), used for disk performance
    isolation in the SmartOS cloud.
    Monday, October 1, 12
    
slide 65:
    Slow disks, cont.
    • Customer resolved real issue
    • Prior to DTrace analysis, had spent months of poor performance
    believing disks were to blame
    Monday, October 1, 12
    
slide 66:
    Kernel scheduler
    • Customer problem: occasional latency outliers
    • Analysis: no smoking gun. No slow I/O or locks, etc. Some random
    dispatcher queue latency, but with CPU headroom.
    $ prstat -mLc 1
    PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
    17930 103
    21 7.6 0.0 0.0 0.0 53 16 9.1 57K
    1 73K
    0 beam.smp/265
    17930 103
    20 7.0 0.0 0.0 0.0 57 16 0.4 57K
    2 70K
    0 beam.smp/264
    17930 103
    20 7.4 0.0 0.0 0.0 53 18 1.7 63K
    0 78K
    0 beam.smp/263
    17930 103
    19 6.7 0.0 0.0 0.0 60 14 0.4 52K
    0 65K
    0 beam.smp/266
    17930 103
    2.0 0.7 0.0 0.0 0.0 96 1.6 0.0 6K
    0 8K
    0 beam.smp/267
    17930 103
    1.0 0.9 0.0 0.0 0.0 97 0.9 0.0
    0 47
    0 beam.smp/280
    [...]
    Monday, October 1, 12
    
slide 67:
    Kernel scheduler, cont.
    • Unix 101
    Threads:
    R = Ready to run
    O = On-CPU
    R R R
    Run Queue
    Scheduler
    preemption
    R R R R
    Run Queue
    Monday, October 1, 12
    CPU
    CPU
    
slide 68:
    Kernel scheduler, cont.
    • Unix 102
    • TS (and FSS) check for CPU starvation
    Priority Promotion
    R R R R R R R
    Run Queue
    CPU
    Starvation
    Monday, October 1, 12
    CPU
    
slide 69:
    Kernel scheduler, cont.
    • Experimentation: run 2 CPU-bound threads, 1 CPU
    • Subsecond offset heat maps:
    Monday, October 1, 12
    
slide 70:
    Kernel scheduler, cont.
    • Experimentation: run 2 CPU-bound threads, 1 CPU
    • Subsecond offset heat maps:
    THIS
    SHOULDNT
    HAPPEN
    Monday, October 1, 12
    
slide 71:
    Kernel scheduler, cont.
    • Worst case (4 threads 1 CPU), 44 sec dispq latency
    # dtrace -n 'sched:::off-cpu /execname == "burn1"/ { self->gt;s = timestamp; }
    sched:::on-cpu /self->gt;s/ { @["off-cpu (ms)"] =
    lquantize((timestamp - self->gt;s) / 1000000, 0, 100000, 1000); self->gt;s = 0; }'
    off-cpu (ms)
    value ------------- Distribution ------------- count
    
slide 72:
    Kernel scheduler, cont.
    • FSS scheduler class bug:
    • FSS uses a more complex technique to avoid CPU starvation. A
    thread priority could stay high and on-CPU for many seconds
    before the priority is decayed to allow another thread to run.
    • Analyzed (more DTrace) and fixed (thanks Jerry Jelinek)
    • Under (too) high CPU load, your runtime can be bound by how
    well you schedule, not do work
    • Cloud workloads scale fast, hit (new) scheduler issues
    Monday, October 1, 12
    
slide 73:
    Kernel scheduler, cont.
    • Required the operator of the cloud to debug
    • Even if the customer doesn’t have kernel-DTrace access in the
    zone, they still benefit from the cloud provider having access
    • Ask your cloud provider to trace scheduler internals, in case
    you have something similar
    • On Hardware Virtualization, scheduler issues can be terrifying
    Monday, October 1, 12
    
slide 74:
    Kernel scheduler, cont.
    • Each kernel believes they own the hardware.
    Cloud Tenant
    Cloud Tenant
    Cloud Tenant
    Apps
    Apps
    Apps
    Guest Kernel
    Guest Kernel
    Guest Kernel
    VCPU
    VCPU
    VCPU
    VCPU
    VCPU
    VCPU
    Host Kernel
    CPU
    Monday, October 1, 12
    CPU
    CPU
    CPU
    
slide 75:
    Kernel scheduler, cont.
    • One scheduler:
    Cloud Tenant
    Cloud Tenant
    Cloud Tenant
    Apps
    Apps
    Apps
    Guest Kernel
    Guest Kernel
    Guest Kernel
    VCPU
    VCPU
    VCPU
    VCPU
    VCPU
    VCPU
    Host Kernel
    CPU
    Monday, October 1, 12
    CPU
    CPU
    CPU
    
slide 76:
    Kernel scheduler, cont.
    • Many schedulers. Kernel fight!
    Cloud Tenant
    Cloud Tenant
    Cloud Tenant
    Apps
    Apps
    Apps
    Guest Kernel
    Guest Kernel
    Guest Kernel
    VCPU
    VCPU
    VCPU
    VCPU
    VCPU
    VCPU
    Host Kernel
    CPU
    Monday, October 1, 12
    CPU
    CPU
    CPU
    
slide 77:
    Kernel scheduler, cont.
    • Had a networking performance issue on KVM; debugged using:
    • Host: DTrace
    • Guests: Prototype DTrace for Linux, SystemTap
    • Took weeks to debug
    the kernel scheduler
    interactions and
    determine the fix
    for an 8x win.
    • Office wall
    (output from many
    perf tools, including
    Flame Graphs):
    Monday, October 1, 12
    
slide 78:
    Thank you!
    • http://dtrace.org/blogs/brendan
    • email brendan@joyent.com
    • twitter @brendangregg
    • Resources:
    • http://www.slideshare.net/bcantrill/dtrace-in-the-nonglobal-zone
    • http://dtrace.org/blogs/dap/2011/07/27/oscon-slides/
    • https://github.com/brendangregg/dtrace-cloud-tools
    • http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/
    • http://dtrace.org/blogs/brendan/2012/08/09/10-performance-wins/
    • http://dtrace.org/blogs/brendan/2011/10/04/visualizing-the-cloud/
    • Thanks @dapsays and team for Cloud Analytics, Bryan Cantrill for DTrace
    fixes, @rmustacc for KVM perf war, and @DeirdreS for another great event.
    Monday, October 1, 12