Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

SCaLE10x (2012): Performance Analysis: new tools and concepts from the cloud

Talk delivered at SCaLE10x, Los Angeles 2012, by Brendan Gregg.

Video: http://www.youtube.com/watch?list=PLB0197EF82791B1F3&v=RoTSNVuBpqg

Description: "Cloud Computing introduces new challenges for performance analysis, for both customers and operators of the cloud. Apart from monitoring a scaling environment, issues within a system can be complicated when tenants are competing for the same resources, and are invisible to each other. Other factors include rapidly changing production code and wildly unpredictable traffic surges. For performance analysis in the Joyent public cloud, we use a variety of tools including Dynamic Tracing, which allows us to create custom tools and metrics and to explore new concepts. In this presentation I'll discuss a collection of these tools and the metrics that they measure. While these are DTrace-based, the focus of the talk is on which metrics are proving useful for analyzing real cloud issues."

next
prev
1/86
next
prev
2/86
next
prev
3/86
next
prev
4/86
next
prev
5/86
next
prev
6/86
next
prev
7/86
next
prev
8/86
next
prev
9/86
next
prev
10/86
next
prev
11/86
next
prev
12/86
next
prev
13/86
next
prev
14/86
next
prev
15/86
next
prev
16/86
next
prev
17/86
next
prev
18/86
next
prev
19/86
next
prev
20/86
next
prev
21/86
next
prev
22/86
next
prev
23/86
next
prev
24/86
next
prev
25/86
next
prev
26/86
next
prev
27/86
next
prev
28/86
next
prev
29/86
next
prev
30/86
next
prev
31/86
next
prev
32/86
next
prev
33/86
next
prev
34/86
next
prev
35/86
next
prev
36/86
next
prev
37/86
next
prev
38/86
next
prev
39/86
next
prev
40/86
next
prev
41/86
next
prev
42/86
next
prev
43/86
next
prev
44/86
next
prev
45/86
next
prev
46/86
next
prev
47/86
next
prev
48/86
next
prev
49/86
next
prev
50/86
next
prev
51/86
next
prev
52/86
next
prev
53/86
next
prev
54/86
next
prev
55/86
next
prev
56/86
next
prev
57/86
next
prev
58/86
next
prev
59/86
next
prev
60/86
next
prev
61/86
next
prev
62/86
next
prev
63/86
next
prev
64/86
next
prev
65/86
next
prev
66/86
next
prev
67/86
next
prev
68/86
next
prev
69/86
next
prev
70/86
next
prev
71/86
next
prev
72/86
next
prev
73/86
next
prev
74/86
next
prev
75/86
next
prev
76/86
next
prev
77/86
next
prev
78/86
next
prev
79/86
next
prev
80/86
next
prev
81/86
next
prev
82/86
next
prev
83/86
next
prev
84/86
next
prev
85/86
next
prev
86/86

PDF: SCALE2012_new_tools_concepts_cloud.pdf

Keywords (from pdftotext):

slide 1:
    Performance Analysis:
    new tools and concepts
    from the cloud
    Brendan Gregg
    Lead Performance Engineer, Joyent
    brendan.gregg@joyent.com
    SCaLE10x
    Jan, 2012
    
slide 2:
    whoami
    • I do performance analysis
    • I also write performance tools out of necessity
    • Was Brendan @ Sun Microsystems, Oracle,
    now Joyent
    
slide 3:
    Joyent
    • Cloud computing provider
    • Cloud computing software
    • SmartOS
    • host OS, and guest via OS virtualization
    • Linux, Windows
    • guest via KVM
    
slide 4:
    Agenda
    • Data
    • Example problems & solutions
    • How cloud environments complicate performance
    • Theory
    • Performance analysis
    • Summarize new tools & concepts
    • This talk uses SmartOS and DTrace to illustrate
    concepts that are applicable to most OSes.
    
slide 5:
    Data
    • Example problems:
    • CPU
    • Memory
    • Disk
    • Network
    • Some have neat solutions, some messy, some none
    • This is real world
    • Some I’ve covered before, some I haven’t
    
slide 6:
    CPU
    
slide 7:
    CPU utilization: problem
    • Would like to identify:
    • single or multiple CPUs at 100% utilization
    • average, minimum and maximum CPU utilization
    • CPU utilization balance (tight or loose distribution)
    • time-based characteristics
    changing/bursting? burst interval, burst length
    • For small to large environments
    • entire datacenters or clouds
    
slide 8:
    CPU utilization
    • mpstat(1) has the data. 1 second, 1 server (16 CPUs):
    
slide 9:
    CPU utilization
    • Scaling to 60 seconds, 1 server:
    
slide 10:
    CPU utilization
    • Scaling to entire datacenter, 60 secs, 5312 CPUs:
    
slide 11:
    CPU utilization
    • Line graphs can solve some problems:
    • x-axis: time, 60 seconds
    • y-axis: utilization
    
slide 12:
    CPU utilization
    • ... but don’t scale well to individual devices
    • 5312 CPUs, each as a line:
    
slide 13:
    CPU utilization
    • Pretty, but scale limited as well:
    
slide 14:
    CPU utilization
    • Utilization as a heat map:
    • x-axis: time, y-axis: utilization
    • z-axis (color): number of CPUs
    
slide 15:
    CPU utilization
    • Available in Cloud Analytics (Joyent)
    • Clicking highlights and shows details; eg, hostname:
    
slide 16:
    CPU utilization
    • Utilization heat map also suitable and used for:
    • disks
    • network interfaces
    • Utilization as a metric can be a bit misleading
    • really a percent busy over a time interval
    • devices may accept more work at 100% busy
    • may not directly relate to performance impact
    
slide 17:
    CPU utilization: summary
    • Data readily available
    • Using a new visualization
    
slide 18:
    CPU usage
    • Given a CPU is hot, what is it doing?
    • Beyond just vmstat’s usr/sys ratio
    • Profiling (sampling at an interval) the program
    counter or stack back trace
    • user-land stack for %usr
    • kernel stack for %sys
    • Many tools can do this to some degree
    • Developer Studios/DTrace/oprofile/...
    
slide 19:
    CPU usage: profiling
    • Frequency count on-CPU user-land stack traces:
    # dtrace -x ustackframes=100 -n 'profile-997 /execname == "mysqld"/ {
    @[ustack()] = count(); } tick-60s { exit(0); }'
    dtrace: description 'profile-997 ' matched 2 probes
    CPU
    FUNCTION:NAME
    1 75195
    :tick-60s
    [...]
    libc.so.1`__priocntlset+0xa
    libc.so.1`getparam+0x83
    libc.so.1`pthread_getschedparam+0x3c
    libc.so.1`pthread_setschedprio+0x1f
    mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x9ab
    mysqld`_Z10do_commandP3THD+0x198
    mysqld`handle_one_connection+0x1a6
    libc.so.1`_thrp_setup+0x8d
    libc.so.1`_lwp_start
    mysqld`_Z13add_to_statusP17system_status_varS0_+0x47
    mysqld`_Z22calc_sum_of_all_statusP17system_status_var+0x67
    mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x1222
    mysqld`_Z10do_commandP3THD+0x198
    mysqld`handle_one_connection+0x1a6
    libc.so.1`_thrp_setup+0x8d
    libc.so.1`_lwp_start
    
slide 20:
    CPU usage: profiling
    • Frequency count on-CPU user-land stack traces:
    # dtrace -x ustackframes=100 -n 'profile-997 /execname == "mysqld"/ {
    @[ustack()] = count(); } tick-60s { exit(0); }'
    dtrace: description 'profile-997 ' matched 2 probes
    CPU
    FUNCTION:NAME
    1 75195
    :tick-60s
    [...]
    libc.so.1`__priocntlset+0xa
    libc.so.1`getparam+0x83
    libc.so.1`pthread_getschedparam+0x3c
    libc.so.1`pthread_setschedprio+0x1f
    mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x9ab
    mysqld`_Z10do_commandP3THD+0x198
    mysqld`handle_one_connection+0x1a6
    libc.so.1`_thrp_setup+0x8d
    libc.so.1`_lwp_start
    Over
    500,000
    lines
    truncated
    mysqld`_Z13add_to_statusP17system_status_varS0_+0x47
    mysqld`_Z22calc_sum_of_all_statusP17system_status_var+0x67
    mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x1222
    mysqld`_Z10do_commandP3THD+0x198
    mysqld`handle_one_connection+0x1a6
    libc.so.1`_thrp_setup+0x8d
    libc.so.1`_lwp_start
    
slide 21:
    CPU usage: profiling data
    
slide 22:
    CPU usage: visualization
    • Visualized as a “Flame Graph”:
    
slide 23:
    CPU usage: Flame Graphs
    • Just some Perl that turns DTrace output into an
    interactive SVG: mouse-over elements for details
    • It’s on github
    • http://github.com/brendangregg/FlameGraph
    • Works on kernel stacks, and both user+kernel
    • Shouldn’t be hard to have it process oprofile, etc.
    
slide 24:
    CPU usage: on the Cloud
    • Flame Graphs were born out of necessity on
    Cloud environments:
    • Perf issues need quick resolution
    (you just got hackernews’d)
    • Everyone is running different versions of everything
    (don’t assume you’ve seen the last of old CPU-hot
    code-path issues that have been fixed)
    
slide 25:
    CPU usage: summary
    • Data can be available
    • For cloud computing: easy for operators to fetch on
    OS virtualized environments; otherwise agent driven,
    and possibly other difficulties (access to CPU
    instrumentation counter-based interrupts)
    • Using a new visualization
    
slide 26:
    CPU latency
    • CPU dispatcher queue latency
    • thread is ready-to-run, and waiting its turn
    • Observable in coarse ways:
    • vmstat’s r
    • high load averages
    • Less course, with microstate accounting
    • prstat -mL’s LAT
    • How much is it affecting application performance?
    
slide 27:
    CPU latency: zonedispqlat.d
    • Using DTrace to trace kernel scheduler events:
    #./zonedisplat.d
    Tracing...
    Note: outliers (>gt; 1 secs) may be artifacts due to the use of scalar globals
    (sorry).
    CPU disp queue latency by zone (ns):
    dbprod-045
    value ------------- Distribution ------------- count
    512 |
    1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    2048 |@@@@@@@@@@
    4096 |@
    8192 |
    16384 |
    32768 |
    65536 |
    131072 |
    262144 |
    524288 |
    1048576 |
    2097152 |
    4194304 |
    8388608 |
    16777216 |
    [...]
    
slide 28:
    CPU latency: zonedispqlat.d
    • CPU dispatcher queue latency by zonename
    (zonedispqlat.d), work in progress:
    #!/usr/sbin/dtrace -s
    #pragma D option quiet
    dtrace:::BEGIN
    printf("Tracing...\n");
    printf("Note: outliers (>gt; 1 secs) may be artifacts due to the ");
    printf("use of scalar globals (sorry).\n\n");
    sched:::enqueue
    /* scalar global (I don't think this can be thread local) */
    start[args[0]->gt;pr_lwpid, args[1]->gt;pr_pid] = timestamp;
    sched:::dequeue
    /this->gt;start = start[args[0]->gt;pr_lwpid, args[1]->gt;pr_pid]/
    this->gt;time = timestamp - this->gt;start;
    /* workaround since zonename isn't a member of args[1]... */
    this->gt;zone = ((proc_t *)args[1]->gt;pr_addr)->gt;p_zone->gt;zone_name;
    @[stringof(this->gt;zone)] = quantize(this->gt;time);
    start[args[0]->gt;pr_lwpid, args[1]->gt;pr_pid] = 0;
    tick-1sec
    printf("CPU disp queue latency by zone (ns):\n");
    printa(@);
    trunc(@);
    Save timestamp
    on enqueue;
    calculate delta
    on dequeue
    
slide 29:
    CPU latency: zonedispqlat.d
    • Instead of zonename, this could be process name, ...
    • Tracing scheduler enqueue/dequeue events and
    saving timestamps costs CPU overhead
    • they are frequent
    • I’d prefer to only trace dequeue, and reuse the
    existing microstate accounting timestamps
    • but one problem is a clash between unscaled and
    scaled timestamps
    
slide 30:
    CPU latency: on the Cloud
    • With virtualization, you can have:
    high CPU latency with idle CPUs
    due to an instance consuming their quota
    • OS virtualization
    • not visible in vmstat r
    • is visible as part of prstat -mL’s LAT
    • more kstats recently added to SmartOS
    including nsec_waitrq (total run queue wait by zone)
    • Hardware virtualization
    • vmstat st (stolen)
    
slide 31:
    CPU latency: caps
    • CPU cap latency from the host (zonecapslat.d):
    #!/usr/sbin/dtrace -s
    #pragma D option quiet
    sched:::cpucaps-sleep
    start[args[0]->gt;pr_lwpid, args[1]->gt;pr_pid] = timestamp;
    sched:::cpucaps-wakeup
    /this->gt;start = start[args[0]->gt;pr_lwpid, args[1]->gt;pr_pid]/
    this->gt;time = timestamp - this->gt;start;
    /* workaround since zonename isn't a member of args[1]... */
    this->gt;zone = ((proc_t *)args[1]->gt;pr_addr)->gt;p_zone->gt;zone_name;
    @[stringof(this->gt;zone)] = quantize(this->gt;time);
    start[args[0]->gt;pr_lwpid, args[1]->gt;pr_pid] = 0;
    tick-1sec
    printf("CPU caps latency by zone (ns):\n");
    printa(@);
    trunc(@);
    
slide 32:
    CPU latency: summary
    • Partial data available
    • New tools/metrics created
    • although current DTrace solutions have overhead;
    we should be able to improve that
    • although, new kstats may be sufficient
    
slide 33:
    Memory
    
slide 34:
    Memory: problem
    • Riak database has endless memory growth.
    • expected 9GB, after two days:
    $ prstat -c 1
    Please wait...
    PID USERNAME SIZE
    RSS STATE
    21722 103
    43G
    40G cpu0
    15770 root
    7760K 540K sleep
    95 root
    0K sleep
    12827 root
    128M
    73M sleep
    10319 bgregg
    10M 6788K sleep
    10402 root
    22M 288K sleep
    [...]
    PRI NICE
    99 -20
    TIME CPU PROCESS/NLWP
    72:23:41 2.6% beam.smp/594
    23:28:57 0.9% zoneadmd/5
    7:37:47 0.2% zpool-zones/166
    0:49:36 0.1% node/5
    0:00:00 0.0% sshd/1
    0:18:45 0.0% dtrace/1
    • Eventually hits paging and terrible performance
    • needing a restart
    • Is this a memory leak? Or application growth?
    
slide 35:
    Memory: scope
    • Identify the subsystem and team responsible
    Subsystem
    Team
    Application
    Voxer
    Riak
    Basho
    Erlang
    Ericsson
    SmartOS
    Joyent
    
slide 36:
    Memory: heap profiling
    • What is in the heap?
    $ pmap 14719
    14719:
    beam.smp
    000000000062D000
    000000000067F000
    00000001005C0000
    00000002005BE000
    00000004002E2000
    00000004FFFD3000
    00000005FFF91000
    00000006FFF4C000
    00000007FF9EF000
    [...]
    2168K r-x-328K rw--4193540K rw--4194296K rw--4192016K rw--4193664K rw--4191172K rw--4194040K rw--4194028K rw--4188812K rw--588224K rw---
    /opt/riak/erts-5.8.5/bin/beam.smp
    /opt/riak/erts-5.8.5/bin/beam.smp
    /opt/riak/erts-5.8.5/bin/beam.smp
    [ anon ]
    [ anon ]
    [ anon ]
    [ anon ]
    [ anon ]
    [ anon ]
    [ anon ]
    [ heap ]
    • ... and why does it keep growing?
    • Would like to answer these in production
    • Without restarting apps. Experimentation (backend=mmap,
    other allocators) wasn’t working.
    
slide 37:
    Memory: heap profiling
    • libumem was used for multi-threaded performance
    • libumem == user-land slab allocator
    • detailed observability can be enabled, allowing heap
    profiling and leak detection
    • While designed with speed and production use in
    mind, it still comes with some cost (time and space),
    and aren’t on by default.
    • UMEM_DEBUG=audit
    
slide 38:
    Memory: heap profiling
    • libumem provides some default observability
    • Eg, slabs:
    >gt; ::umem_malloc_info
    CACHE
    BUFSZ MAXMAL BUFMALLC
    000000000070b028
    000000000070c028
    000000000070f028
    32 1148038
    96 1347040
    000000000071a028
    000000000071b028
    000000000071e028
    000000000071f028
    [...]
    AVG_MAL
    MALLOCED
    OVERHEAD %OVER
    0.0%
    1054998 1510.6%
    1130491 805.4%
    156179051 536.1%
    58417287 424.3%
    4806 215.9%
    1168558 165.6%
    190389780 162.5%
    42279506 150.9%
    6466801 135.0%
    25818 118.9%
    6497 80.1%
    26211 86.0%
    12276 79.4%
    7220 61.5%
    
slide 39:
    Memory: heap profiling
    • ... and heap (captured @14GB RSS):
    >gt; ::vmem
    ADDR
    NAME
    fffffd7ffebed4a0 sbrk_top
    fffffd7ffebee0a8
    sbrk_heap
    fffffd7ffebeecb0
    vmem_internal
    fffffd7ffebef8b8
    vmem_seg
    fffffd7ffebf04c0
    vmem_hash
    fffffd7ffebf10c8
    vmem_vmem
    00000000006e7000
    umem_internal
    00000000006e8000
    umem_cache
    00000000006e9000
    umem_hash
    00000000006ea000
    umem_log
    00000000006eb000
    umem_firewall_va
    00000000006ec000
    umem_firewall
    00000000006ed000
    umem_oversize
    00000000006f0000
    umem_memalign
    umem_default
    INUSE
    TOTAL
    SUCCEED FAIL
    4298117 84403
    • The heap is 9 GB (as expected), but sbrk_top total is
    14 GB (equal to RSS). And growing.
    • Are there Gbyte-sized malloc()/free()s?
    
slide 40:
    Memory: malloc() profiling
    # dtrace -n 'pid$target::malloc:entry { @ = quantize(arg0); }' -p 17472
    dtrace: description 'pid$target::malloc:entry ' matched 3 probes
    value ------------- Distribution ------------- count
    2 |
    4 |
    8 |@
    16 |@@@@
    32 |@@@@@@@@@
    64 |@@@@@@@@@@@@@@@@@@
    128 |@@@@@@@
    256 |
    512 |
    1024 |
    2048 |
    4096 |
    8192 |
    16384 |
    32768 |
    65536 |
    131072 |
    262144 |
    524288 |
    1048576 |
    2097152 |
    • No huge malloc()s, but RSS continues to climb.
    
slide 41:
    Memory: malloc() profiling
    # dtrace -n 'pid$target::malloc:entry { @ = quantize(arg0); }' -p 17472
    dtrace: description 'pid$target::malloc:entry ' matched 3 probes
    value ------------- Distribution ------------- count
    2 |
    4 |
    8 |@
    16 |@@@@
    32 |@@@@@@@@@
    64 |@@@@@@@@@@@@@@@@@@
    128 |@@@@@@@
    256 |
    512 |
    1024 |
    2048 |
    4096 |
    8192 |
    16384 |
    32768 |
    65536 |
    131072 |
    262144 |
    524288 |
    1048576 |
    2097152 |
    This tool (one-liner)
    profiles malloc()
    request sizes
    • No huge malloc()s, but RSS continues to climb.
    
slide 42:
    Memory: heap growth
    • Tracing why the heap grows via brk():
    # dtrace -n 'syscall::brk:entry /execname == "beam.smp"/ { ustack(); }'
    dtrace: description 'syscall::brk:entry ' matched 1 probe
    CPU
    FUNCTION:NAME
    brk:entry
    libc.so.1`_brk_unlocked+0xa
    libumem.so.1`vmem_sbrk_alloc+0x84
    libumem.so.1`vmem_xalloc+0x669
    libumem.so.1`vmem_alloc+0x14f
    libumem.so.1`vmem_xalloc+0x669
    libumem.so.1`vmem_alloc+0x14f
    libumem.so.1`umem_alloc+0x72
    libumem.so.1`malloc+0x59
    libstdc++.so.6.0.14`_Znwm+0x20
    libstdc++.so.6.0.14`_Znam+0x9
    eleveldb.so`_ZN7leveldb9ReadBlockEPNS_16RandomAccessFileERKNS_11Rea...
    eleveldb.so`_ZN7leveldb5Table11BlockReaderEPvRKNS_11ReadOptionsERKN...
    eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator13InitDataBl...
    eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator4SeekERKNS_5...
    eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator4SeekERKNS_5...
    eleveldb.so`_ZN7leveldb12_GLOBAL__N_115MergingIterator4SeekERKNS_5S...
    eleveldb.so`_ZN7leveldb12_GLOBAL__N_16DBIter4SeekERKNS_5SliceE+0xcc
    eleveldb.so`eleveldb_get+0xd3
    beam.smp`process_main+0x6939
    beam.smp`sched_thread_func+0x1cf
    beam.smp`thr_wrapper+0xbe
    This shows
    the user-land
    stack trace
    for every
    heap growth
    
slide 43:
    Memory: heap growth
    • More DTrace showed the size of the malloc()s
    causing the brk()s:
    # dtrace -x dynvarsize=4m -n '
    pid$target::malloc:entry { self->gt;size = arg0; }
    syscall::brk:entry /self->gt;size/ { printf("%d bytes", self->gt;size); }
    pid$target::malloc:return { self->gt;size = 0; }' -p 17472
    dtrace: description 'pid$target::malloc:entry ' matched 7 probes
    CPU
    FUNCTION:NAME
    brk:entry 8343520 bytes
    brk:entry 8343520 bytes
    [...]
    • These 8 Mbyte malloc()s grew the heap
    • Even though the heap has Gbytes not in use
    • This is starting to look like an OS issue
    
slide 44:
    Memory: allocator internals
    • More tools were created:
    • Show memory entropy (+ malloc - free)
    along with heap growth, over time
    • Show codepath taken for allocations
    compare successful with unsuccessful (heap growth)
    • Show allocator internals: sizes, options, flags
    • And run in the production environment
    • Briefly; tracing frequent allocs does cost overhead
    • Casting light into what was a black box
    
slide 45:
    Memory: allocator internals
    gt; _sbrk_grow_aligned
    gt; vmem_xalloc
    | vmem_xalloc:entry
    umem_oversize
    ->gt; vmem_alloc
    ->gt; vmem_xalloc
    | vmem_xalloc:entry
    sbrk_heap
    ->gt; vmem_sbrk_alloc
    ->gt; vmem_alloc
    ->gt; vmem_xalloc
    | vmem_xalloc:entry
    sbrk_top
    ->gt; vmem_reap
    
slide 46:
    Memory: solution
    • These new tools and metrics pointed to the
    allocation algorithm “instant fit”
    • Someone had suggested this earlier; the tools provided solid evidence that this
    really was the case here
    • A new version of libumem was built to force use of
    VM_BESTFIT
    • and added by Robert Mustacchi as a tunable:
    UMEM_OPTIONS=allocator=best
    • Customer restarted Riak with new libumem version
    • Problem solved
    
slide 47:
    Memory: on the Cloud
    • With OS virtualization, you can have:
    Paging without scanning
    • paging == swapping blocks with physical storage
    • swapping == swapping entire threads between main
    memory and physical storage
    • Resource control paging is unrelated to the page
    scanner, so, no vmstat scan rate (sr) despite
    anonymous paging
    • More new tools: DTrace sysinfo:::anonpgin by
    process name, zonename
    
slide 48:
    Memory: summary
    • Superficial data available, detailed info not
    • not by default
    • Many new tools were created
    • not easy, but made possible with DTrace
    
slide 49:
    Disk
    
slide 50:
    Disk: problem
    • Application performance issues
    • Disks look busy (iostat)
    • Blame the disks?
    $ iostat -xnz 1
    [...]
    r/s
    r/s
    r/s
    extended device statistics
    w/s
    kr/s
    kw/s wait actv wsvc_t asvc_t %w %b device
    334.9 15677.2 40484.9 0.0 1.0
    1 69 c0t1d0
    extended device statistics
    w/s
    kr/s
    kw/s wait actv wsvc_t asvc_t %w %b device
    407.1 14595.9 49409.1 0.0 0.8
    1 56 c0t1d0
    extended device statistics
    w/s
    kr/s
    kw/s wait actv wsvc_t asvc_t %w %b device
    438.0 10814.8 53242.1 0.0 0.8
    1 57 c0t1d0
    • Many graphical tools are built upon iostat
    
slide 51:
    Disk: on the Cloud
    • Tenants can’t see each other
    • Maybe a neighbor is doing a backup?
    • Maybe a neighbor is running a benchmark?
    • Can’t see their processes (top/prstat)
    • Blame what you can’t see
    
slide 52:
    Disk: VFS
    • Applications usually talk to a file system
    • and are hurt by file system latency
    • Disk I/O can be:
    • unrelated to the application: asynchronous tasks
    • inflated from what the application requested
    • deflated “ “
    • blind to issues caused higher up the kernel stack
    
slide 53:
    Disk: issues with iostat(1)
    • Unrelated:
    • other applications / tenants
    • file system prefetch
    • file system dirty data fushing
    • Inflated:
    • rounded up to the next file system record size
    • extra metadata for on-disk format
    • read-modify-write of RAID5
    
slide 54:
    Disk: issues with iostat(1)
    • Deflated:
    • read caching
    • write buffering
    • Blind:
    • lock contention in the file system
    • CPU usage by the file system
    • file system software bugs
    • file system queue latency
    
slide 55:
    Disk: issues with iostat(1)
    • blind (continued):
    • disk cache flush latency (if your file system does it)
    • file system I/O throttling latency
    • I/O throttling is a new ZFS feature for cloud
    environments
    • adds artificial latency to file system I/O to throttle it
    • added by Bill Pijewski and Jerry Jelenik of Joyent
    
slide 56:
    Disk: file system latency
    • Using DTrace to summarize ZFS read latency:
    $ dtrace -n 'fbt::zfs_read:entry { self->gt;start = timestamp; }
    fbt::zfs_read:return /self->gt;start/ {
    @["ns"] = quantize(timestamp - self->gt;start); self->gt;start = 0; }'
    dtrace: description 'fbt::zfs_read:entry ' matched 2 probes
    value ------------- Distribution ------------- count
    512 |
    1024 |@
    2048 |@@
    4096 |@@@@@@@
    8192 |@@@@@@@@@@@@@@@@@
    16384 |@@@@@@@@@@
    32768 |@
    65536 |
    131072 |
    262144 |
    524288 |
    1048576 |
    2097152 |
    4194304 |@@@
    8388608 |@
    16777216 |
    
slide 57:
    Disk: file system latency
    • Using DTrace to summarize ZFS read latency:
    $ dtrace -n 'fbt::zfs_read:entry { self->gt;start = timestamp; }
    fbt::zfs_read:return /self->gt;start/ {
    @["ns"] = quantize(timestamp - self->gt;start); self->gt;start = 0; }'
    dtrace: description 'fbt::zfs_read:entry ' matched 2 probes
    value ------------- Distribution ------------- count
    512 |
    1024 |@
    2048 |@@
    4096 |@@@@@@@
    8192 |@@@@@@@@@@@@@@@@@
    16384 |@@@@@@@@@@
    32768 |@
    65536 |
    131072 |
    262144 |
    524288 |
    1048576 |
    2097152 |
    4194304 |@@@
    8388608 |@
    16777216 |
    Cache reads
    Disk reads
    
slide 58:
    Disk: file system latency
    • Tracing zfs events using zfsslower.d:
    # ./zfsslower.d 10
    TIME
    PROCESS
    2011 May 17 01:23:12 mysqld
    2011 May 17 01:23:13 mysqld
    2011 May 17 01:23:33 mysqld
    2011 May 17 01:23:33 mysqld
    2011 May 17 01:23:51 httpd
    ms FILE
    19 /z01/opt/mysql5-64/data/xxxxx/xxxxx.ibd
    10 /z01/var/mysql/xxxxx/xxxxx.ibd
    11 /z01/var/mysql/xxxxx/xxxxx.ibd
    10 /z01/var/mysql/xxxxx/xxxxx.ibd
    14 /z01/home/xxxxx/xxxxx/xxxxx/xxxxx/xxxxx
    • Argument is the minimum latency in milliseconds
    
slide 59:
    Disk: file system latency
    • Can trace this from other locations too:
    • VFS layer: filter on desired file system types
    • syscall layer: filter on file descriptors for file systems
    • application layer: trace file I/O calls
    
slide 60:
    Disk: file system latency
    • And using SystemTap:
    # ./vfsrlat.stp
    Tracing... Hit Ctrl-C to end
    [..]
    ext4 (ns):
    value |-------------------------------------------------- count
    256 |
    512 |
    1024 |
    2048 |
    4096 |
    8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 16321
    16384 |
    32768 |
    65536 |
    131072 |
    262144 |
    • Traces vfs.read to vfs.read.return, and gets the FS
    type via: $file->gt;f_path->gt;dentry->gt;d_inode->gt;i_sb->gt;s_type->gt;name
    • Warning: this script has crashed ubuntu/CentOS; I’m told RHEL is better
    
slide 61:
    Disk: file system visualizations
    • File system latency as a heat map (Cloud Analytics):
    • This screenshot shows severe outliers
    
slide 62:
    Disk: file system visualizations
    • Sometimes the heat map is very surprising:
    • This screenshot is from the Oracle ZFS Storage Appliance
    
slide 63:
    Disk: summary
    • Misleading data available
    • New tools/metrics created
    • Latency visualizations
    
slide 64:
    Network
    
slide 65:
    Network: problem
    • TCP SYNs queue in-kernel until they are accept()ed
    • The queue length is the TCP listen backlog
    • may be set in listen()
    • and limited by a system tunable (usually 128)
    • on SmartOS: tcp_conn_req_max_q
    • What if the queue remains full
    • eg, application is overwhelmed with other work,
    • or CPU starved
    • ... and another SYN arrives?
    
slide 66:
    Network: TCP listen drops
    • Packet is dropped by the kernel
    • fortunately a counter is bumped:
    $ netstat -s | grep Drop
    tcpTimRetransDrop
    tcpTimKeepaliveProbe= 1594
    tcpListenDrop
    =3089298
    tcpHalfOpenDrop
    icmpOutDrops
    sctpTimRetrans
    sctpTimHearBeatProbe=
    sctpListenDrop
    tcpTimKeepalive
    = 2582
    tcpTimKeepaliveDrop =
    tcpListenDropQ0
    tcpOutSackRetrans
    =1400832
    icmpOutErrors
    sctpTimRetransDrop =
    sctpTimHearBeatDrop =
    sctpInClosed
    • Remote host waits, and then retransmits
    • TCP retransmit interval; usually 1 or 3 seconds
    
slide 67:
    Network: predicting drops
    • How do we know if we are close to dropping?
    • An early warning
    
slide 68:
    Network: tcpconnreqmaxq.d
    • DTrace script traces drop events, if they occur:
    # ./tcpconnreqmaxq.d
    Tracing... Hit Ctrl-C to end.
    2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
    2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
    2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
    2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
    2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
    2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
    2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
    2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
    2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
    2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
    2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
    2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
    2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
    2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
    [...]
    • ... and when Ctrl-C is hit:
    
slide 69:
    Network: tcpconnreqmaxq.d
    tcp_conn_req_cnt_q distributions:
    cpid:3063
    max_q:8
    value ------------- Distribution ------------- count
    -1 |
    0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
    1 |
    cpid:11504
    max_q:128
    value ------------- Distribution ------------- count
    -1 |
    0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    1 |@@
    2 |@
    4 |@
    8 |
    16 |
    32 |
    64 |
    128 |
    256 |
    tcpListenDrops:
    cpid:11504
    max_q:128
    
slide 70:
    Network: tcpconnreqmaxq.d
    tcp_conn_req_cnt_q distributions:
    cpid:3063
    max_q:8
    value ------------- Distribution ------------- count
    -1 |
    0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
    1 |
    cpid:11504
    max_q:128
    value ------------- Distribution ------------- count
    -1 |
    0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    1 |@@
    2 |@
    4 |@
    Length of queue
    8 |
    16 |
    measured
    32 |
    on SYN event
    64 |
    128 |
    256 |
    tcpListenDrops:
    cpid:11504
    max_q:128
    value
    use
    
slide 71:
    Network: tcplistendrop.d
    • More details can be fetched as needed:
    # ./tcplistendrop.d
    TIME
    2012 Jan 19 01:22:49
    2012 Jan 19 01:22:49
    2012 Jan 19 01:22:49
    2012 Jan 19 01:22:49
    2012 Jan 19 01:22:49
    2012 Jan 19 01:22:49
    2012 Jan 19 01:22:49
    2012 Jan 19 01:22:49
    2012 Jan 19 01:22:49
    2012 Jan 19 01:22:49
    2012 Jan 19 01:22:49
    [...]
    SRC-IP
    PORT
    DST-IP
    25691 ->gt; 192.192.240.212
    18423 ->gt; 192.192.240.212
    38883 ->gt; 192.192.240.212
    10739 ->gt; 192.192.240.212
    27988 ->gt; 192.192.240.212
    28824 ->gt; 192.192.240.212
    65070 ->gt; 192.192.240.212
    56392 ->gt; 192.192.240.212
    24628 ->gt; 192.192.240.212
    11686 ->gt; 192.192.240.212
    34629 ->gt; 192.192.240.212
    PORT
    • Just tracing the drop code-path
    • Don’t need to pay the overhead of sniffing all packets
    
slide 72:
    Network: DTrace code
    • Key code from tcplistendrop.d:
    fbt::tcp_input_listener:entry { self->gt;mp = args[1]; }
    fbt::tcp_input_listener:return { self->gt;mp = 0; }
    mib:::tcpListenDrop
    /self->gt;mp/
    this->gt;iph = (ipha_t *)self->gt;mp->gt;b_rptr;
    this->gt;tcph = (tcph_t *)(self->gt;mp->gt;b_rptr + 20);
    printf("%-20Y %-18s %-5d ->gt; %-18s %-5d\n", walltimestamp,
    inet_ntoa(&this->gt;iph->gt;ipha_src),
    ntohs(*(uint16_t *)this->gt;tcph->gt;th_lport),
    inet_ntoa(&this->gt;iph->gt;ipha_dst),
    ntohs(*(uint16_t *)this->gt;tcph->gt;th_fport));
    • This uses the unstable interface fbt provider
    • a stable tcp provider now exists, which is better for
    more common tasks - like connections by IP
    
slide 73:
    Network: summary
    • For TCP, while many counters are available, they are
    system wide integers
    • Custom tools can show more details
    • addresses and ports
    • kernel state
    • needs kernel access and dynamic tracing
    
slide 74:
    Data Recap
    • Problem types
    • CPU utilization
    scaleability
    • CPU usage
    scaleability
    • CPU latency
    observability
    • Memory
    observability
    • Disk
    observability
    • Network
    observability
    
slide 75:
    Data Recap
    • Problem types, solution types
    scaleability
    • CPU utilization
    • CPU usage
    scaleability
    • CPU latency
    observability
    • Memory
    observability
    • Disk
    observability
    • Network
    observability
    visualizations
    metrics
    
slide 76:
    Theory
    
slide 77:
    Performance Analysis
    • Goals
    • Capacity planning
    • Issue analysis
    
slide 78:
    Performance Issues
    • Strategy
    • Step 1: is there a problem?
    • Step 2: which subsystem/team is responsible?
    • Difficult to get past these steps without reliable
    metrics
    
slide 79:
    Problem Space
    • Myths
    • Vendors provide good metrics with good coverage
    • The problem is to line-graph them
    • Realities
    • Metrics can be wrong, incomplete and misleading,
    requiring time and expertise to interpret
    • Line graphs can hide issues
    
slide 80:
    Problem Space
    • Cloud computing confuses matters further:
    • hiding metrics from neighbors
    • throttling performance due to invisible neighbors
    
slide 81:
    Example Problems
    • Included:
    • Understanding utilization across 5,312 CPUs
    • Using disk I/O metrics to explain application
    performance
    • A lack of metrics for memory growth, packet drops, ...
    
slide 82:
    Example Solutions: tools
    • Device utilization heat maps for CPUs
    • Flame graphs for CPU profiling
    • CPU dispatcher queue latency by zone
    • CPU caps latency by zone
    • malloc() size profiling
    • Heap growth stack backtraces
    • File system latency distributions
    • File system latency tracing
    • TCP accept queue length distribution
    • TCP listen drop tracing with details
    
slide 83:
    Key Concepts
    • Visualizations
    • heat maps for device utilization and latency
    • flame graphs
    • Custom metrics often necessary
    • Latency-based for issue analysis
    • If coding isn’t practical/timely, use dynamic tracing
    • Cloud Computing
    • Provide observability (often to show what the problem isn’t)
    • Develop new metrics for resource control effects
    
slide 84:
    DTrace
    • Many problems were only solved thanks to DTrace
    • In the SmartOS cloud environment:
    • The compute node (global zone) can DTrace
    everything (except for KVM guests, for which it has a
    limited view: resource I/O + some MMU events, so far)
    • SmartMachines (zones) have the DTrace syscall,
    profile (their user-land only), pid and USDT providers
    • Joyent Cloud Analytics uses DTrace from the global
    zone to give extended details to customers
    
slide 85:
    Performance
    • The more you know, the more you don’t
    • Hopefully I’ve turned some unknown-unknowns
    into known-unknowns
    
slide 86:
    Thank you
    • Resources:
    • http://dtrace.org/blogs/brendan
    • More CPU utilization visualizations:
    http://dtrace.org/blogs/brendan/2011/12/18/visualizing-device-utilization/
    • Flame Graphs: http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/
    and http://github.com/brendangregg/FlameGraph
    • More iostat(1) & file system latency discussion:
    http://dtrace.org/blogs/brendan/tag/filesystem-2/
    • Cloud Analytics:
    • OSCON slides: http://dtrace.org/blogs/dap/files/2011/07/ca-oscon-data.pdf
    • Joyent: http://joyent.com
    • brendan@joyent.com