FROSUG2009_Performance

FROSUG 2009: Little Shop of Performance Horrors

Video: http://www.youtube.com/watch?v=cklPFJysUYM

Meetup talk for the Front Range Open Solaris User Group (FROSUG) in Colorado 2009, by Brendan Gregg.

This talk covers the worst performance issues I had seen, and how to learn from these mistakes.

	next prev 1/34
	next prev 2/34
	next prev 3/34
	next prev 4/34
	next prev 5/34
	next prev 6/34
	next prev 7/34
	next prev 8/34
	next prev 9/34
	next prev 10/34
	next prev 11/34
	next prev 12/34
	next prev 13/34
	next prev 14/34
	next prev 15/34
	next prev 16/34
	next prev 17/34
	next prev 18/34
	next prev 19/34
	next prev 20/34
	next prev 21/34
	next prev 22/34
	next prev 23/34
	next prev 24/34
	next prev 25/34
	next prev 26/34
	next prev 27/34
	next prev 28/34
	next prev 29/34
	next prev 30/34
	next prev 31/34
	next prev 32/34
	next prev 33/34
	next prev 34/34

PDF: FROSUG2009_Performance_Horrors.pdf

Keywords (from pdftotext):

slide 1:

USE
IMPROVE
EVANGELIZE
Little Shop of
Performance Horrors
Brendan Gregg
Staff Engineer
Sun Microsystems, Fishworks
FROSUG 2009

slide 2:

USE
IMPROVE
EVANGELIZE
Performance Horrors
I usually give talks on:
– how to perform perf analysis!
– cool performance technologies!!
– awesome benchmark results!!!
in other words, things going right.
● This talk is about things going wrong:
– performance horrors
– learning from mistakes

slide 3:

USE
IMPROVE
EVANGELIZE
Horrific Topics
The worst perf issues I've ever seen!
● Common misconfigurations
● The encyclopedia of poor assumptions
● Unbelievably bad perf analysis
● Death by complexity
● Bad benchmarking
● Misleading analysis tools
● Insane performance tuning
● The curse of the unexpected

slide 4:

USE
IMPROVE
EVANGELIZE
The worst perf issues
I've ever seen!

slide 5:

USE
IMPROVE
EVANGELIZE
The worst perf issues I've ever seen!
SMC
– Administration GUI for Solaris
– Could take 30 mins to load on first boot

slide 6:

USE
IMPROVE
EVANGELIZE
The worst perf issues I've ever seen!
SMC
– Administration GUI for Solaris
– Could take 30 mins to load on first boot
Problems:
– 12 Million mostly 1 byte sequential read()s of
/var/sadm/smc/properties/registry.ser, a 72 KB file
– 7742 processes executed
– 9504 disk events, 2228 of them writes to the 72Kb
registry.ser file.
Happy ending – performance was improved in an update

slide 7:

USE
IMPROVE
EVANGELIZE
The worst perf issues I've ever seen!
SMC (cont.)
● Analysis using DTrace:
– syscall frequency counts
– syscall args
This is “low hanging fruit” for DTrace
● Lesson: examine high level events.
Happy ending – performance was improved in an update

slide 8:

USE
IMPROVE
EVANGELIZE
The worst perf issues I've ever seen!
nxge
– 10 GbE network driver
– tested during product development

slide 9:

USE
IMPROVE
EVANGELIZE
The worst perf issues I've ever seen!
nxge (cont.)
– 10 GbE network driver
– tested during product development
Problems:
– kstats were wrong (rbytes, obytes)
this made perf tuning very difficult until I realized what
was wrong!
– CR: 6687884 nxge rbytes and obytes kstat are wrong
Lessons:
– don't trust statistics you haven't double checked

slide 10:

USE
IMPROVE
EVANGELIZE
The worst perf issues I've ever seen!
nxge (cont.)
– 10 GbE network driver
– tested during product development
Problems (#2):
– memory leak starving the ZFS ARC
– The kernel grew to 122 Gbytes in 2 hours.
– 6844118 memory leak in nxge with LSO enabled
– Original CR title: “17 MB/s kernel memory leak...”
Lessons:
– Bad memory leaks can happen in the kernel too

slide 11:

USE
IMPROVE
EVANGELIZE
The worst perf issues I've ever seen!
nxge (cont.)
– 10 GbE network driver
– tested during product development
Problems (#3):
– LSO (large send offload) destroyed performance:
Priority changed from [3-Medium] to [1-Very High]
This is a 1000x performance regression.
brendan.gregg@sun.com 2008-05-01 23:25:58 GMT
– 6696705 enabling soft-lso with fix for 6663925 causes
nxge to perform very very poorly
Lessons:
All configurable options must be tested and retested during
development for regressions (such as LSO)

slide 12:

USE
IMPROVE
EVANGELIZE
Common
Misconfigurations

slide 13:

USE
IMPROVE
EVANGELIZE
Common misconfigurations
ZFS RAID-Z2 with half a JBOD
– half a JBOD may mean 12 disks. A RAID-Z2 stripe
may be 12 disks in width, therefore this
configuration acts like a single disk:
perf is that of the slowest disk in the stripe
 with so few stripes (1), a multi-threaded workload is much
more likely to scale
Max throughput config without:
– jumbo frames
– 10 GbE ports (they do work!)
sync write workloads without ZFS SLOG
devices

slide 14:

USE
IMPROVE
EVANGELIZE
Common misconfigurations
Not running the latest software bits
– perf issues are fixed often; always try to be on the
latest software versions
4 x 1 GbE trunks, and

slide 15:

USE
IMPROVE
EVANGELIZE
The Encyclopedia of
Poor Assumptions

slide 16:

USE
IMPROVE
EVANGELIZE
The Encyclopedia of Poor Assumptions
More CPUs == more performance
– not if the threads don't scale
Faster CPUs == more performance
– not if your workload is memory I/O bound
More IOPS capability == more performance
– slower IOPS?
Imagine a server with thousands of
slow disks
Network throughput/IOPS measured on the
client reflects that of the server
– client caching?

slide 17:

USE
IMPROVE
EVANGELIZE
The Encyclopedia of Poor Assumptions
System busses are fast
– The AMD HyperTransport was the #1 bottleneck for
the Sun Storage products
10 GbE can be driven by 1 client
– may be true in the future, but difficult to do now
– may assume that this can be done with 1 thread!
Performance observability tools are
designed to be the best possible
● Performance observability statistics (or
benchmark tools) are correct
– bugs happen!

slide 18:

USE
IMPROVE
EVANGELIZE
The Encyclopedia of Poor Assumptions
A network switch can drive all its ports to
top speed at the same time
– especially may not be true for 10 GbE switchs
PCI-E slots are equal
– test, don't assume; depends on bus architecture
Add flash memory SSDs to improve
performance!
– Probably, but really depends on the workload
– This is assuming that HDDs are slow; they usually
are, however their streaming performance can be
competitive (~100 Mbytes/sec)

slide 19:

USE
IMPROVE
EVANGELIZE
Unbelievably Bad
Performance Analysis

slide 20:

USE
IMPROVE
EVANGELIZE
Unbelievably bad perf analysis
The Magic 1 GbE NIC!
● How fast can a 1 GbE NIC run in one
direction?

slide 21:

USE
IMPROVE
EVANGELIZE
Unbelievably bad perf analysis
The Magic 1 GbE NIC!
● How fast can a 1 GbE NIC run in one
direction?
● Results sent to me include:
– 120 Mbytes/sec
– 200 Mbytes/sec
– 350 Mbytes/sec
– 800 Mbytes/sec
– 1.15 Gbytes/sec
Lesson: perform sanity checks

slide 22:

USE
IMPROVE
EVANGELIZE
Death by
Complexity!

slide 23:

USE
IMPROVE
EVANGELIZE
Death by complexity!
Performance isn't that hard, however it
often isn't that easy either...
● TCP/IP stack performance analysis
– heavy use of function pointers
ZFS performance analysis
– I/O processed asynchorously by the ZIO pipeline

slide 24:

USE
IMPROVE
EVANGELIZE
Bad Benchmarking

slide 25:

USE
IMPROVE
EVANGELIZE
Bad benchmarking
SPEC-SFS
http://blogs.sun.com/bmc/entry/eulogy_for_a_benchmark
– Copying a file from a local filesystem to an NFS
share, to performance test that NFS share
various opensource benchmark tools that
don't reflect your intended workload
Lesson: don't run benchmark tools blindly; learn
everything you can about what they do, and how
close they match your environment

slide 26:

USE
IMPROVE
EVANGELIZE
Misleading
Analysis Tools

slide 27:

USE
IMPROVE
EVANGELIZE
Misleading analysis tools
top
load averages:
0.03,
0.03,
17:05:29
236 processes: 233 sleeping, 2 stopped, 1 on cpu
CPU states: 97.7% idle,
0.8% user,
1.6% kernel,
0.0% iowait,
0.0% swap
Memory: 8191M real, 479M free, 1232M swap in use, 10G swap free
PID USERNAME LWP PRI NICE
SIZE
RES STATE
TIME
CPU COMMAND
101092 brendan
93M
25M sleep
187:42
0.28% realplay.bin
100297 root
26 100
-20
182M
177M sleep
58:13
0.14% akd
399362 brendan
95M
28M sleep
53:56
0.12% realplay.bin
115306 root
0K sleep
21:30
0.06% dtrace
100876 brendan
0K sleep
103:52
0.05% Xorg
– What does %CPU mean? Are they all CPU
consumers?
– What does RSS mean?

slide 28:

USE
IMPROVE
EVANGELIZE
Misleading analysis tools
vmstat
# vmstat 1
kthr
r b w
memory
swap
free
page
disk
mf pi po fr de sr s0 s1 s2 s3
faults
cpu
cs us sy id
0 0 0 10830436 501464 54 91 2
5 18 18
1 1835 4807 2067
3 94
0 0 0 10849048 490460 9 245 0
0 16 16
0 1824 3466 1664
4 96
0 0 0 10849048 490488 0
0 1470 3294 1227
1 99
0 0 0 10849048 490488 0
0 1440 3315 1226
1 99
0 0 0 10849048 490488 0
0 1447 3278 1236
1 98
– What does swap/free mean?
– Why do we care about de, sr?

slide 29:

USE
IMPROVE
EVANGELIZE
Insane
Performance Tuning

slide 30:

USE
IMPROVE
EVANGELIZE
Insane performance tuning
disabling CPUs
– turning off half the available CPUs can improve
performance (relieving scaleability issues)
binding network ports to fewer cores
– improves L1/L2 CPU cache hit rate
– reduces cache coherency traffic
reducing CPU clock rate
– if the workload is memory bound, this may have little
effect, but save heat, fan, vibration issues...

slide 31:

USE
IMPROVE
EVANGELIZE
Insane performance tuning
less memory
– systems with 256+ Gbytes of DRAM – codepaths
that walk DRAM
warming up the kmem caches
– before benchmarking, a freshly booted server won't
have its kmem caches populated. Warming them
up with any data can improve performance by 15%
or so.

slide 32:

USE
IMPROVE
EVANGELIZE
The Curse of the
Unexpected

slide 33:

USE
IMPROVE
EVANGELIZE
The Curse of the Unexpected
A switch has 2 x 10 GbE ports, and 40 x 1
GbE ports. How fast can it drive Ethernet?
– Unexpected: some cap at 11 Gbit/sec total!
Latency
– Heat map discoveries
– DEMO (http://blogs.sun.com/brendan)

slide 34:

USE
IMPROVE
EVANGELIZE
Thank you!
Brendan Gregg
Staff Engineer
brendan@sun.com
http://blogs.sun.com/brendan
“open” artwork and icons by chandan:
http://blogs.sun.com/chandan