SCALE2012_new_tools_concepts

SCaLE10x (2012): Performance Analysis: new tools and concepts from the cloud

Talk delivered at SCaLE10x, Los Angeles 2012, by Brendan Gregg.

Video: http://www.youtube.com/watch?list=PLB0197EF82791B1F3&v=RoTSNVuBpqg

Description: "Cloud Computing introduces new challenges for performance analysis, for both customers and operators of the cloud. Apart from monitoring a scaling environment, issues within a system can be complicated when tenants are competing for the same resources, and are invisible to each other. Other factors include rapidly changing production code and wildly unpredictable traffic surges. For performance analysis in the Joyent public cloud, we use a variety of tools including Dynamic Tracing, which allows us to create custom tools and metrics and to explore new concepts. In this presentation I'll discuss a collection of these tools and the metrics that they measure. While these are DTrace-based, the focus of the talk is on which metrics are proving useful for analyzing real cloud issues."

	next prev 1/86
	next prev 2/86
	next prev 3/86
	next prev 4/86
	next prev 5/86
	next prev 6/86
	next prev 7/86
	next prev 8/86
	next prev 9/86
	next prev 10/86
	next prev 11/86
	next prev 12/86
	next prev 13/86
	next prev 14/86
	next prev 15/86
	next prev 16/86
	next prev 17/86
	next prev 18/86
	next prev 19/86
	next prev 20/86
	next prev 21/86
	next prev 22/86
	next prev 23/86
	next prev 24/86
	next prev 25/86
	next prev 26/86
	next prev 27/86
	next prev 28/86
	next prev 29/86
	next prev 30/86
	next prev 31/86
	next prev 32/86
	next prev 33/86
	next prev 34/86
	next prev 35/86
	next prev 36/86
	next prev 37/86
	next prev 38/86
	next prev 39/86
	next prev 40/86
	next prev 41/86
	next prev 42/86
	next prev 43/86
	next prev 44/86
	next prev 45/86
	next prev 46/86
	next prev 47/86
	next prev 48/86
	next prev 49/86
	next prev 50/86
	next prev 51/86
	next prev 52/86
	next prev 53/86
	next prev 54/86
	next prev 55/86
	next prev 56/86
	next prev 57/86
	next prev 58/86
	next prev 59/86
	next prev 60/86
	next prev 61/86
	next prev 62/86
	next prev 63/86
	next prev 64/86
	next prev 65/86
	next prev 66/86
	next prev 67/86
	next prev 68/86
	next prev 69/86
	next prev 70/86
	next prev 71/86
	next prev 72/86
	next prev 73/86
	next prev 74/86
	next prev 75/86
	next prev 76/86
	next prev 77/86
	next prev 78/86
	next prev 79/86
	next prev 80/86
	next prev 81/86
	next prev 82/86
	next prev 83/86
	next prev 84/86
	next prev 85/86
	next prev 86/86

PDF: SCALE2012_new_tools_concepts_cloud.pdf

Keywords (from pdftotext):

slide 1:

Performance Analysis:
new tools and concepts
from the cloud
Brendan Gregg
Lead Performance Engineer, Joyent
brendan.gregg@joyent.com
SCaLE10x
Jan, 2012

slide 2:

whoami
• I do performance analysis
• I also write performance tools out of necessity
• Was Brendan @ Sun Microsystems, Oracle,
now Joyent

slide 3:

Joyent
• Cloud computing provider
• Cloud computing software
• SmartOS
• host OS, and guest via OS virtualization
• Linux, Windows
• guest via KVM

slide 4:

Agenda
• Data
• Example problems & solutions
• How cloud environments complicate performance
• Theory
• Performance analysis
• Summarize new tools & concepts
• This talk uses SmartOS and DTrace to illustrate
concepts that are applicable to most OSes.

slide 5:

Data
• Example problems:
• CPU
• Memory
• Disk
• Network
• Some have neat solutions, some messy, some none
• This is real world
• Some I’ve covered before, some I haven’t

slide 6:

CPU

slide 7:

CPU utilization: problem
• Would like to identify:
• single or multiple CPUs at 100% utilization
• average, minimum and maximum CPU utilization
• CPU utilization balance (tight or loose distribution)
• time-based characteristics
changing/bursting? burst interval, burst length
• For small to large environments
• entire datacenters or clouds

slide 8:

CPU utilization
• mpstat(1) has the data. 1 second, 1 server (16 CPUs):

slide 9:

CPU utilization
• Scaling to 60 seconds, 1 server:

slide 10:

CPU utilization
• Scaling to entire datacenter, 60 secs, 5312 CPUs:

slide 11:

CPU utilization
• Line graphs can solve some problems:
• x-axis: time, 60 seconds
• y-axis: utilization

slide 12:

CPU utilization
• ... but don’t scale well to individual devices
• 5312 CPUs, each as a line:

slide 13:

CPU utilization
• Pretty, but scale limited as well:

slide 14:

CPU utilization
• Utilization as a heat map:
• x-axis: time, y-axis: utilization
• z-axis (color): number of CPUs

slide 15:

CPU utilization
• Available in Cloud Analytics (Joyent)
• Clicking highlights and shows details; eg, hostname:

slide 16:

CPU utilization
• Utilization heat map also suitable and used for:
• disks
• network interfaces
• Utilization as a metric can be a bit misleading
• really a percent busy over a time interval
• devices may accept more work at 100% busy
• may not directly relate to performance impact

slide 17:

CPU utilization: summary
• Data readily available
• Using a new visualization

slide 18:

CPU usage
• Given a CPU is hot, what is it doing?
• Beyond just vmstat’s usr/sys ratio
• Profiling (sampling at an interval) the program
counter or stack back trace
• user-land stack for %usr
• kernel stack for %sys
• Many tools can do this to some degree
• Developer Studios/DTrace/oprofile/...

slide 19:

CPU usage: profiling
• Frequency count on-CPU user-land stack traces:
# dtrace -x ustackframes=100 -n 'profile-997 /execname == "mysqld"/ {
@[ustack()] = count(); } tick-60s { exit(0); }'
dtrace: description 'profile-997 ' matched 2 probes
CPU
FUNCTION:NAME
1 75195
:tick-60s
[...]
libc.so.1`__priocntlset+0xa
libc.so.1`getparam+0x83
libc.so.1`pthread_getschedparam+0x3c
libc.so.1`pthread_setschedprio+0x1f
mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x9ab
mysqld`_Z10do_commandP3THD+0x198
mysqld`handle_one_connection+0x1a6
libc.so.1`_thrp_setup+0x8d
libc.so.1`_lwp_start
mysqld`_Z13add_to_statusP17system_status_varS0_+0x47
mysqld`_Z22calc_sum_of_all_statusP17system_status_var+0x67
mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x1222
mysqld`_Z10do_commandP3THD+0x198
mysqld`handle_one_connection+0x1a6
libc.so.1`_thrp_setup+0x8d
libc.so.1`_lwp_start

slide 20:

CPU usage: profiling
• Frequency count on-CPU user-land stack traces:
# dtrace -x ustackframes=100 -n 'profile-997 /execname == "mysqld"/ {
@[ustack()] = count(); } tick-60s { exit(0); }'
dtrace: description 'profile-997 ' matched 2 probes
CPU
FUNCTION:NAME
1 75195
:tick-60s
[...]
libc.so.1`__priocntlset+0xa
libc.so.1`getparam+0x83
libc.so.1`pthread_getschedparam+0x3c
libc.so.1`pthread_setschedprio+0x1f
mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x9ab
mysqld`_Z10do_commandP3THD+0x198
mysqld`handle_one_connection+0x1a6
libc.so.1`_thrp_setup+0x8d
libc.so.1`_lwp_start
Over
500,000
lines
truncated
mysqld`_Z13add_to_statusP17system_status_varS0_+0x47
mysqld`_Z22calc_sum_of_all_statusP17system_status_var+0x67
mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x1222
mysqld`_Z10do_commandP3THD+0x198
mysqld`handle_one_connection+0x1a6
libc.so.1`_thrp_setup+0x8d
libc.so.1`_lwp_start

slide 21:

CPU usage: profiling data

slide 22:

CPU usage: visualization
• Visualized as a “Flame Graph”:

slide 23:

CPU usage: Flame Graphs
• Just some Perl that turns DTrace output into an
interactive SVG: mouse-over elements for details
• It’s on github
• http://github.com/brendangregg/FlameGraph
• Works on kernel stacks, and both user+kernel
• Shouldn’t be hard to have it process oprofile, etc.

slide 24:

CPU usage: on the Cloud
• Flame Graphs were born out of necessity on
Cloud environments:
• Perf issues need quick resolution
(you just got hackernews’d)
• Everyone is running diﬀerent versions of everything
(don’t assume you’ve seen the last of old CPU-hot
code-path issues that have been fixed)

slide 25:

CPU usage: summary
• Data can be available
• For cloud computing: easy for operators to fetch on
OS virtualized environments; otherwise agent driven,
and possibly other diﬃculties (access to CPU
instrumentation counter-based interrupts)
• Using a new visualization

slide 26:

CPU latency
• CPU dispatcher queue latency
• thread is ready-to-run, and waiting its turn
• Observable in coarse ways:
• vmstat’s r
• high load averages
• Less course, with microstate accounting
• prstat -mL’s LAT
• How much is it aﬀecting application performance?

slide 27:

CPU latency: zonedispqlat.d
• Using DTrace to trace kernel scheduler events:
#./zonedisplat.d
Tracing...
Note: outliers (>gt; 1 secs) may be artifacts due to the use of scalar globals
(sorry).
CPU disp queue latency by zone (ns):
dbprod-045
value ------------- Distribution ------------- count
512 |
1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@
2048 |@@@@@@@@@@
4096 |@
8192 |
16384 |
32768 |
65536 |
131072 |
262144 |
524288 |
1048576 |
2097152 |
4194304 |
8388608 |
16777216 |
[...]

slide 28:

CPU latency: zonedispqlat.d
• CPU dispatcher queue latency by zonename
(zonedispqlat.d), work in progress:
#!/usr/sbin/dtrace -s
#pragma D option quiet
dtrace:::BEGIN
printf("Tracing...\n");
printf("Note: outliers (>gt; 1 secs) may be artifacts due to the ");
printf("use of scalar globals (sorry).\n\n");
sched:::enqueue
/* scalar global (I don't think this can be thread local) */
start[args[0]->gt;pr_lwpid, args[1]->gt;pr_pid] = timestamp;
sched:::dequeue
/this->gt;start = start[args[0]->gt;pr_lwpid, args[1]->gt;pr_pid]/
this->gt;time = timestamp - this->gt;start;
/* workaround since zonename isn't a member of args[1]... */
this->gt;zone = ((proc_t *)args[1]->gt;pr_addr)->gt;p_zone->gt;zone_name;
@[stringof(this->gt;zone)] = quantize(this->gt;time);
start[args[0]->gt;pr_lwpid, args[1]->gt;pr_pid] = 0;
tick-1sec
printf("CPU disp queue latency by zone (ns):\n");
printa(@);
trunc(@);
Save timestamp
on enqueue;
calculate delta
on dequeue

slide 29:

CPU latency: zonedispqlat.d
• Instead of zonename, this could be process name, ...
• Tracing scheduler enqueue/dequeue events and
saving timestamps costs CPU overhead
• they are frequent
• I’d prefer to only trace dequeue, and reuse the
existing microstate accounting timestamps
• but one problem is a clash between unscaled and
scaled timestamps

slide 30:

CPU latency: on the Cloud
• With virtualization, you can have:
high CPU latency with idle CPUs
due to an instance consuming their quota
• OS virtualization
• not visible in vmstat r
• is visible as part of prstat -mL’s LAT
• more kstats recently added to SmartOS
including nsec_waitrq (total run queue wait by zone)
• Hardware virtualization
• vmstat st (stolen)

slide 31:

CPU latency: caps
• CPU cap latency from the host (zonecapslat.d):
#!/usr/sbin/dtrace -s
#pragma D option quiet
sched:::cpucaps-sleep
start[args[0]->gt;pr_lwpid, args[1]->gt;pr_pid] = timestamp;
sched:::cpucaps-wakeup
/this->gt;start = start[args[0]->gt;pr_lwpid, args[1]->gt;pr_pid]/
this->gt;time = timestamp - this->gt;start;
/* workaround since zonename isn't a member of args[1]... */
this->gt;zone = ((proc_t *)args[1]->gt;pr_addr)->gt;p_zone->gt;zone_name;
@[stringof(this->gt;zone)] = quantize(this->gt;time);
start[args[0]->gt;pr_lwpid, args[1]->gt;pr_pid] = 0;
tick-1sec
printf("CPU caps latency by zone (ns):\n");
printa(@);
trunc(@);

slide 32:

CPU latency: summary
• Partial data available
• New tools/metrics created
• although current DTrace solutions have overhead;
we should be able to improve that
• although, new kstats may be suﬃcient

slide 33:

Memory

slide 34:

Memory: problem
• Riak database has endless memory growth.
• expected 9GB, after two days:
$ prstat -c 1
Please wait...
PID USERNAME SIZE
RSS STATE
21722 103
43G
40G cpu0
15770 root
7760K 540K sleep
95 root
0K sleep
12827 root
128M
73M sleep
10319 bgregg
10M 6788K sleep
10402 root
22M 288K sleep
[...]
PRI NICE
99 -20
TIME CPU PROCESS/NLWP
72:23:41 2.6% beam.smp/594
23:28:57 0.9% zoneadmd/5
7:37:47 0.2% zpool-zones/166
0:49:36 0.1% node/5
0:00:00 0.0% sshd/1
0:18:45 0.0% dtrace/1
• Eventually hits paging and terrible performance
• needing a restart
• Is this a memory leak? Or application growth?

slide 35:

Memory: scope
• Identify the subsystem and team responsible
Subsystem
Team
Application
Voxer
Riak
Basho
Erlang
Ericsson
SmartOS
Joyent

slide 36:

Memory: heap profiling
• What is in the heap?
$ pmap 14719
14719:
beam.smp
000000000062D000
000000000067F000
00000001005C0000
00000002005BE000
00000004002E2000
00000004FFFD3000
00000005FFF91000
00000006FFF4C000
00000007FF9EF000
[...]
2168K r-x-328K rw--4193540K rw--4194296K rw--4192016K rw--4193664K rw--4191172K rw--4194040K rw--4194028K rw--4188812K rw--588224K rw---
/opt/riak/erts-5.8.5/bin/beam.smp
/opt/riak/erts-5.8.5/bin/beam.smp
/opt/riak/erts-5.8.5/bin/beam.smp
[ anon ]
[ anon ]
[ anon ]
[ anon ]
[ anon ]
[ anon ]
[ anon ]
[ heap ]
• ... and why does it keep growing?
• Would like to answer these in production
• Without restarting apps. Experimentation (backend=mmap,
other allocators) wasn’t working.

slide 37:

Memory: heap profiling
• libumem was used for multi-threaded performance
• libumem == user-land slab allocator
• detailed observability can be enabled, allowing heap
profiling and leak detection
• While designed with speed and production use in
mind, it still comes with some cost (time and space),
and aren’t on by default.
• UMEM_DEBUG=audit

slide 38:

Memory: heap profiling
• libumem provides some default observability
• Eg, slabs:
>gt; ::umem_malloc_info
CACHE
BUFSZ MAXMAL BUFMALLC
000000000070b028
000000000070c028
000000000070f028
32 1148038
96 1347040
000000000071a028
000000000071b028
000000000071e028
000000000071f028
[...]
AVG_MAL
MALLOCED
OVERHEAD %OVER
0.0%
1054998 1510.6%
1130491 805.4%
156179051 536.1%
58417287 424.3%
4806 215.9%
1168558 165.6%
190389780 162.5%
42279506 150.9%
6466801 135.0%
25818 118.9%
6497 80.1%
26211 86.0%
12276 79.4%
7220 61.5%

slide 39:

Memory: heap profiling
• ... and heap (captured @14GB RSS):
>gt; ::vmem
ADDR
NAME
fffffd7ffebed4a0 sbrk_top
fffffd7ffebee0a8
sbrk_heap
fffffd7ffebeecb0
vmem_internal
fffffd7ffebef8b8
vmem_seg
fffffd7ffebf04c0
vmem_hash
fffffd7ffebf10c8
vmem_vmem
00000000006e7000
umem_internal
00000000006e8000
umem_cache
00000000006e9000
umem_hash
00000000006ea000
umem_log
00000000006eb000
umem_firewall_va
00000000006ec000
umem_firewall
00000000006ed000
umem_oversize
00000000006f0000
umem_memalign
umem_default
INUSE
TOTAL
SUCCEED FAIL
4298117 84403
• The heap is 9 GB (as expected), but sbrk_top total is
14 GB (equal to RSS). And growing.
• Are there Gbyte-sized malloc()/free()s?

slide 40:

Memory: malloc() profiling
# dtrace -n 'pid$target::malloc:entry { @ = quantize(arg0); }' -p 17472
dtrace: description 'pid$target::malloc:entry ' matched 3 probes
value ------------- Distribution ------------- count
2 |
4 |
8 |@
16 |@@@@
32 |@@@@@@@@@
64 |@@@@@@@@@@@@@@@@@@
128 |@@@@@@@
256 |
512 |
1024 |
2048 |
4096 |
8192 |
16384 |
32768 |
65536 |
131072 |
262144 |
524288 |
1048576 |
2097152 |
• No huge malloc()s, but RSS continues to climb.

slide 41:

Memory: malloc() profiling
# dtrace -n 'pid$target::malloc:entry { @ = quantize(arg0); }' -p 17472
dtrace: description 'pid$target::malloc:entry ' matched 3 probes
value ------------- Distribution ------------- count
2 |
4 |
8 |@
16 |@@@@
32 |@@@@@@@@@
64 |@@@@@@@@@@@@@@@@@@
128 |@@@@@@@
256 |
512 |
1024 |
2048 |
4096 |
8192 |
16384 |
32768 |
65536 |
131072 |
262144 |
524288 |
1048576 |
2097152 |
This tool (one-liner)
profiles malloc()
request sizes
• No huge malloc()s, but RSS continues to climb.

slide 42:

Memory: heap growth
• Tracing why the heap grows via brk():
# dtrace -n 'syscall::brk:entry /execname == "beam.smp"/ { ustack(); }'
dtrace: description 'syscall::brk:entry ' matched 1 probe
CPU
FUNCTION:NAME
brk:entry
libc.so.1`_brk_unlocked+0xa
libumem.so.1`vmem_sbrk_alloc+0x84
libumem.so.1`vmem_xalloc+0x669
libumem.so.1`vmem_alloc+0x14f
libumem.so.1`vmem_xalloc+0x669
libumem.so.1`vmem_alloc+0x14f
libumem.so.1`umem_alloc+0x72
libumem.so.1`malloc+0x59
libstdc++.so.6.0.14`_Znwm+0x20
libstdc++.so.6.0.14`_Znam+0x9
eleveldb.so`_ZN7leveldb9ReadBlockEPNS_16RandomAccessFileERKNS_11Rea...
eleveldb.so`_ZN7leveldb5Table11BlockReaderEPvRKNS_11ReadOptionsERKN...
eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator13InitDataBl...
eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator4SeekERKNS_5...
eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator4SeekERKNS_5...
eleveldb.so`_ZN7leveldb12_GLOBAL__N_115MergingIterator4SeekERKNS_5S...
eleveldb.so`_ZN7leveldb12_GLOBAL__N_16DBIter4SeekERKNS_5SliceE+0xcc
eleveldb.so`eleveldb_get+0xd3
beam.smp`process_main+0x6939
beam.smp`sched_thread_func+0x1cf
beam.smp`thr_wrapper+0xbe
This shows
the user-land
stack trace
for every
heap growth

slide 43:

Memory: heap growth
• More DTrace showed the size of the malloc()s
causing the brk()s:
# dtrace -x dynvarsize=4m -n '
pid$target::malloc:entry { self->gt;size = arg0; }
syscall::brk:entry /self->gt;size/ { printf("%d bytes", self->gt;size); }
pid$target::malloc:return { self->gt;size = 0; }' -p 17472
dtrace: description 'pid$target::malloc:entry ' matched 7 probes
CPU
FUNCTION:NAME
brk:entry 8343520 bytes
brk:entry 8343520 bytes
[...]
• These 8 Mbyte malloc()s grew the heap
• Even though the heap has Gbytes not in use
• This is starting to look like an OS issue

slide 44:

Memory: allocator internals
• More tools were created:
• Show memory entropy (+ malloc - free)
along with heap growth, over time
• Show codepath taken for allocations
compare successful with unsuccessful (heap growth)
• Show allocator internals: sizes, options, flags
• And run in the production environment
• Briefly; tracing frequent allocs does cost overhead
• Casting light into what was a black box

slide 45:

Memory: allocator internals
gt; _sbrk_grow_aligned
gt; vmem_xalloc
| vmem_xalloc:entry
umem_oversize
->gt; vmem_alloc
->gt; vmem_xalloc
| vmem_xalloc:entry
sbrk_heap
->gt; vmem_sbrk_alloc
->gt; vmem_alloc
->gt; vmem_xalloc
| vmem_xalloc:entry
sbrk_top
->gt; vmem_reap

slide 46:

Memory: solution
• These new tools and metrics pointed to the
allocation algorithm “instant fit”
• Someone had suggested this earlier; the tools provided solid evidence that this
really was the case here
• A new version of libumem was built to force use of
VM_BESTFIT
• and added by Robert Mustacchi as a tunable:
UMEM_OPTIONS=allocator=best
• Customer restarted Riak with new libumem version
• Problem solved

slide 47:

Memory: on the Cloud
• With OS virtualization, you can have:
Paging without scanning
• paging == swapping blocks with physical storage
• swapping == swapping entire threads between main
memory and physical storage
• Resource control paging is unrelated to the page
scanner, so, no vmstat scan rate (sr) despite
anonymous paging
• More new tools: DTrace sysinfo:::anonpgin by
process name, zonename

slide 48:

Memory: summary
• Superficial data available, detailed info not
• not by default
• Many new tools were created
• not easy, but made possible with DTrace

slide 49:

Disk

slide 50:

Disk: problem
• Application performance issues
• Disks look busy (iostat)
• Blame the disks?
$ iostat -xnz 1
[...]
r/s
r/s
r/s
extended device statistics
w/s
kr/s
kw/s wait actv wsvc_t asvc_t %w %b device
334.9 15677.2 40484.9 0.0 1.0
1 69 c0t1d0
extended device statistics
w/s
kr/s
kw/s wait actv wsvc_t asvc_t %w %b device
407.1 14595.9 49409.1 0.0 0.8
1 56 c0t1d0
extended device statistics
w/s
kr/s
kw/s wait actv wsvc_t asvc_t %w %b device
438.0 10814.8 53242.1 0.0 0.8
1 57 c0t1d0
• Many graphical tools are built upon iostat

slide 51:

Disk: on the Cloud
• Tenants can’t see each other
• Maybe a neighbor is doing a backup?
• Maybe a neighbor is running a benchmark?
• Can’t see their processes (top/prstat)
• Blame what you can’t see

slide 52:

Disk: VFS
• Applications usually talk to a file system
• and are hurt by file system latency
• Disk I/O can be:
• unrelated to the application: asynchronous tasks
• inflated from what the application requested
• deflated “ “
• blind to issues caused higher up the kernel stack

slide 53:

Disk: issues with iostat(1)
• Unrelated:
• other applications / tenants
• file system prefetch
• file system dirty data fushing
• Inflated:
• rounded up to the next file system record size
• extra metadata for on-disk format
• read-modify-write of RAID5

slide 54:

Disk: issues with iostat(1)
• Deflated:
• read caching
• write buﬀering
• Blind:
• lock contention in the file system
• CPU usage by the file system
• file system software bugs
• file system queue latency

slide 55:

Disk: issues with iostat(1)
• blind (continued):
• disk cache flush latency (if your file system does it)
• file system I/O throttling latency
• I/O throttling is a new ZFS feature for cloud
environments
• adds artificial latency to file system I/O to throttle it
• added by Bill Pijewski and Jerry Jelenik of Joyent

slide 56:

Disk: file system latency
• Using DTrace to summarize ZFS read latency:
$ dtrace -n 'fbt::zfs_read:entry { self->gt;start = timestamp; }
fbt::zfs_read:return /self->gt;start/ {
@["ns"] = quantize(timestamp - self->gt;start); self->gt;start = 0; }'
dtrace: description 'fbt::zfs_read:entry ' matched 2 probes
value ------------- Distribution ------------- count
512 |
1024 |@
2048 |@@
4096 |@@@@@@@
8192 |@@@@@@@@@@@@@@@@@
16384 |@@@@@@@@@@
32768 |@
65536 |
131072 |
262144 |
524288 |
1048576 |
2097152 |
4194304 |@@@
8388608 |@
16777216 |

slide 57:

Disk: file system latency
• Using DTrace to summarize ZFS read latency:
$ dtrace -n 'fbt::zfs_read:entry { self->gt;start = timestamp; }
fbt::zfs_read:return /self->gt;start/ {
@["ns"] = quantize(timestamp - self->gt;start); self->gt;start = 0; }'
dtrace: description 'fbt::zfs_read:entry ' matched 2 probes
value ------------- Distribution ------------- count
512 |
1024 |@
2048 |@@
4096 |@@@@@@@
8192 |@@@@@@@@@@@@@@@@@
16384 |@@@@@@@@@@
32768 |@
65536 |
131072 |
262144 |
524288 |
1048576 |
2097152 |
4194304 |@@@
8388608 |@
16777216 |
Cache reads
Disk reads

slide 58:

Disk: file system latency
• Tracing zfs events using zfsslower.d:
# ./zfsslower.d 10
TIME
PROCESS
2011 May 17 01:23:12 mysqld
2011 May 17 01:23:13 mysqld
2011 May 17 01:23:33 mysqld
2011 May 17 01:23:33 mysqld
2011 May 17 01:23:51 httpd
ms FILE
19 /z01/opt/mysql5-64/data/xxxxx/xxxxx.ibd
10 /z01/var/mysql/xxxxx/xxxxx.ibd
11 /z01/var/mysql/xxxxx/xxxxx.ibd
10 /z01/var/mysql/xxxxx/xxxxx.ibd
14 /z01/home/xxxxx/xxxxx/xxxxx/xxxxx/xxxxx
• Argument is the minimum latency in milliseconds

slide 59:

Disk: file system latency
• Can trace this from other locations too:
• VFS layer: filter on desired file system types
• syscall layer: filter on file descriptors for file systems
• application layer: trace file I/O calls

slide 60:

Disk: file system latency
• And using SystemTap:
# ./vfsrlat.stp
Tracing... Hit Ctrl-C to end
[..]
ext4 (ns):
value |-------------------------------------------------- count
256 |
512 |
1024 |
2048 |
4096 |
8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 16321
16384 |
32768 |
65536 |
131072 |
262144 |
• Traces vfs.read to vfs.read.return, and gets the FS
type via: $file->gt;f_path->gt;dentry->gt;d_inode->gt;i_sb->gt;s_type->gt;name
• Warning: this script has crashed ubuntu/CentOS; I’m told RHEL is better

slide 61:

Disk: file system visualizations
• File system latency as a heat map (Cloud Analytics):
• This screenshot shows severe outliers

slide 62:

Disk: file system visualizations
• Sometimes the heat map is very surprising:
• This screenshot is from the Oracle ZFS Storage Appliance

slide 63:

Disk: summary
• Misleading data available
• New tools/metrics created
• Latency visualizations

slide 64:

Network

slide 65:

Network: problem
• TCP SYNs queue in-kernel until they are accept()ed
• The queue length is the TCP listen backlog
• may be set in listen()
• and limited by a system tunable (usually 128)
• on SmartOS: tcp_conn_req_max_q
• What if the queue remains full
• eg, application is overwhelmed with other work,
• or CPU starved
• ... and another SYN arrives?

slide 66:

Network: TCP listen drops
• Packet is dropped by the kernel
• fortunately a counter is bumped:
$ netstat -s | grep Drop
tcpTimRetransDrop
tcpTimKeepaliveProbe= 1594
tcpListenDrop
=3089298
tcpHalfOpenDrop
icmpOutDrops
sctpTimRetrans
sctpTimHearBeatProbe=
sctpListenDrop
tcpTimKeepalive
= 2582
tcpTimKeepaliveDrop =
tcpListenDropQ0
tcpOutSackRetrans
=1400832
icmpOutErrors
sctpTimRetransDrop =
sctpTimHearBeatDrop =
sctpInClosed
• Remote host waits, and then retransmits
• TCP retransmit interval; usually 1 or 3 seconds

slide 67:

Network: predicting drops
• How do we know if we are close to dropping?
• An early warning

slide 68:

Network: tcpconnreqmaxq.d
• DTrace script traces drop events, if they occur:
# ./tcpconnreqmaxq.d
Tracing... Hit Ctrl-C to end.
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop cpid:11504
[...]
• ... and when Ctrl-C is hit:

slide 69:

Network: tcpconnreqmaxq.d
tcp_conn_req_cnt_q distributions:
cpid:3063
max_q:8
value ------------- Distribution ------------- count
-1 |
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
1 |
cpid:11504
max_q:128
value ------------- Distribution ------------- count
-1 |
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
1 |@@
2 |@
4 |@
8 |
16 |
32 |
64 |
128 |
256 |
tcpListenDrops:
cpid:11504
max_q:128

slide 70:

Network: tcpconnreqmaxq.d
tcp_conn_req_cnt_q distributions:
cpid:3063
max_q:8
value ------------- Distribution ------------- count
-1 |
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
1 |
cpid:11504
max_q:128
value ------------- Distribution ------------- count
-1 |
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
1 |@@
2 |@
4 |@
Length of queue
8 |
16 |
measured
32 |
on SYN event
64 |
128 |
256 |
tcpListenDrops:
cpid:11504
max_q:128
value
use

slide 71:

Network: tcplistendrop.d
• More details can be fetched as needed:
# ./tcplistendrop.d
TIME
2012 Jan 19 01:22:49
2012 Jan 19 01:22:49
2012 Jan 19 01:22:49
2012 Jan 19 01:22:49
2012 Jan 19 01:22:49
2012 Jan 19 01:22:49
2012 Jan 19 01:22:49
2012 Jan 19 01:22:49
2012 Jan 19 01:22:49
2012 Jan 19 01:22:49
2012 Jan 19 01:22:49
[...]
SRC-IP
PORT
DST-IP
25691 ->gt; 192.192.240.212
18423 ->gt; 192.192.240.212
38883 ->gt; 192.192.240.212
10739 ->gt; 192.192.240.212
27988 ->gt; 192.192.240.212
28824 ->gt; 192.192.240.212
65070 ->gt; 192.192.240.212
56392 ->gt; 192.192.240.212
24628 ->gt; 192.192.240.212
11686 ->gt; 192.192.240.212
34629 ->gt; 192.192.240.212
PORT
• Just tracing the drop code-path
• Don’t need to pay the overhead of sniﬃng all packets

slide 72:

Network: DTrace code
• Key code from tcplistendrop.d:
fbt::tcp_input_listener:entry { self->gt;mp = args[1]; }
fbt::tcp_input_listener:return { self->gt;mp = 0; }
mib:::tcpListenDrop
/self->gt;mp/
this->gt;iph = (ipha_t *)self->gt;mp->gt;b_rptr;
this->gt;tcph = (tcph_t *)(self->gt;mp->gt;b_rptr + 20);
printf("%-20Y %-18s %-5d ->gt; %-18s %-5d\n", walltimestamp,
inet_ntoa(&this->gt;iph->gt;ipha_src),
ntohs(*(uint16_t *)this->gt;tcph->gt;th_lport),
inet_ntoa(&this->gt;iph->gt;ipha_dst),
ntohs(*(uint16_t *)this->gt;tcph->gt;th_fport));
• This uses the unstable interface fbt provider
• a stable tcp provider now exists, which is better for
more common tasks - like connections by IP

slide 73:

Network: summary
• For TCP, while many counters are available, they are
system wide integers
• Custom tools can show more details
• addresses and ports
• kernel state
• needs kernel access and dynamic tracing

slide 74:

Data Recap
• Problem types
• CPU utilization
scaleability
• CPU usage
scaleability
• CPU latency
observability
• Memory
observability
• Disk
observability
• Network
observability

slide 75:

Data Recap
• Problem types, solution types
scaleability
• CPU utilization
• CPU usage
scaleability
• CPU latency
observability
• Memory
observability
• Disk
observability
• Network
observability
visualizations
metrics

slide 76:

Theory

slide 77:

Performance Analysis
• Goals
• Capacity planning
• Issue analysis

slide 78:

Performance Issues
• Strategy
• Step 1: is there a problem?
• Step 2: which subsystem/team is responsible?
• Diﬃcult to get past these steps without reliable
metrics

slide 79:

Problem Space
• Myths
• Vendors provide good metrics with good coverage
• The problem is to line-graph them
• Realities
• Metrics can be wrong, incomplete and misleading,
requiring time and expertise to interpret
• Line graphs can hide issues

slide 80:

Problem Space
• Cloud computing confuses matters further:
• hiding metrics from neighbors
• throttling performance due to invisible neighbors

slide 81:

Example Problems
• Included:
• Understanding utilization across 5,312 CPUs
• Using disk I/O metrics to explain application
performance
• A lack of metrics for memory growth, packet drops, ...

slide 82:

Example Solutions: tools
• Device utilization heat maps for CPUs
• Flame graphs for CPU profiling
• CPU dispatcher queue latency by zone
• CPU caps latency by zone
• malloc() size profiling
• Heap growth stack backtraces
• File system latency distributions
• File system latency tracing
• TCP accept queue length distribution
• TCP listen drop tracing with details

slide 83:

Key Concepts
• Visualizations
• heat maps for device utilization and latency
• flame graphs
• Custom metrics often necessary
• Latency-based for issue analysis
• If coding isn’t practical/timely, use dynamic tracing
• Cloud Computing
• Provide observability (often to show what the problem isn’t)
• Develop new metrics for resource control eﬀects

slide 84:

DTrace
• Many problems were only solved thanks to DTrace
• In the SmartOS cloud environment:
• The compute node (global zone) can DTrace
everything (except for KVM guests, for which it has a
limited view: resource I/O + some MMU events, so far)
• SmartMachines (zones) have the DTrace syscall,
profile (their user-land only), pid and USDT providers
• Joyent Cloud Analytics uses DTrace from the global
zone to give extended details to customers

slide 85:

Performance
• The more you know, the more you don’t
• Hopefully I’ve turned some unknown-unknowns
into known-unknowns

slide 86:

Thank you
• Resources:
• http://dtrace.org/blogs/brendan
• More CPU utilization visualizations:
http://dtrace.org/blogs/brendan/2011/12/18/visualizing-device-utilization/
• Flame Graphs: http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/
and http://github.com/brendangregg/FlameGraph
• More iostat(1) & file system latency discussion:
http://dtrace.org/blogs/brendan/tag/filesystem-2/
• Cloud Analytics:
• OSCON slides: http://dtrace.org/blogs/dap/files/2011/07/ca-oscon-data.pdf
• Joyent: http://joyent.com
• brendan@joyent.com