10 Performance Wins

I originally posted this at http://dtrace.org/blogs/brendan/2012/08/09/10-performance-wins.

I work on weird and challenging performance issues in the cloud, often deep inside the operating system kernel. These have made for some interesting blog posts in the past, but there's a lot more I don't have time to share in that much detail. Here, I've summarized ten recent issues that myself and the other engineers at Joyent have worked on. I'd love to see similar summaries from other performance experts worldwide - it's useful to see the types of issues people are working on and the tools they are using.

#	Target	Analyzed	Key Tool	Fixed	Specific	Improvement
1	Redis	System	DTrace	Application	app config	up to 1000x
2	Riak	System	mpstat	Application	app config	2x
3	MySQL	System	DTrace	System	device tuning	up to 380x
4	Various	System	DTrace	System	kernel tuning	up to 2800x
5	Network Stack	System	DTrace	System	kernel code	up to 4.5x
6	Database	System	DTrace	Application	app config	~20%
7	Database	System	DTrace	System	library code	~10%
8	Riak	System	DTrace	System	library code	up to 100x
9	Various	System	DTrace	System	kernel code	up to 2x
10	KVM	System	DTrace	System	kvm config	8x

Longer summaries for each below.

The "Analyzed" column shows whether I analyzed this from the System (beginning from devices and system tools) or the Application (beginning with application-level analyzers). At the top of the performance support chain I'm generally working from the System: most of the Application-centric analysis has been performed by developers (the customer) before the issue reaches me.

The "Improvement" column gives a rough idea of the delivered wins, which in many cases are resolving regressions. "up to" are usually outliers - where occasional transactions are slow enough to be noticed and are an issue. There is often a spectrum: you notice the occasional 1000x slower events, but haven't noticed the 10x slower events that are present and caused by the same issue.

1. Monitoring ... up to 1000x

Description: Monitoring system latency sometimes exceeds 1s, when usually 1ms. Network packet drops were suspected to be the cause.

Analysis: System tools showed various possibilities, including memory capacity and disk I/O, as found via the USE Method. To determine which of these were hurting the application, kernel tracing using DTrace was used. This showed the application was blocked while paging anonymous memory back in, and was also blocked during file system fsync() calls. The application was misconfigured and sometimes briefly exceeded available memory, getting paged out; it was also configured in a way to call regular large fsync()s of up to a Gbyte.

2. Riak ... 2x

Description: Riak database had degraded performance after it was moved to a new system.

Analysis: The USE method and CPU analysis via mpstat showed a CPU in-balance. Further analysis with prstat and DTrace showed the Erlang VM was using half the CPU count it was supposed to, leaving CPUs idle. This was fixed through Erlang VM configuration.

3. MySQL ... up to 380x

Description: MySQL/Percona database had occasional slow queries, with no obvious reason.

Analysis: DTrace of MySQL showed the slow queries were usually blocked on file system reads, which were taking up to 3.8s, instead of the expected < 10ms. Kernel analysis using DTrace for the file system then the block device interface showed the slow I/O were originating from disks, which wasn't expected given no visible error rates. The pattern indicated a correlation with writes, leading to tuning of the storage controller (PERC) cache flush interval which greatly reduced the issue.

4. Various ... up to 2800x

Description: Different customers and systems reported occasional slow application requests over 1s, instead of < 10ms. A ping test confirmed that various systems would sometimes encounter network latency of up to 2800 ms, instead of < 1ms.

Analysis: DTrace analysis of the kernel, both of network packet events and other activities, showed the slow requests were related to getvmusage() calls, a system call that could take over two seconds for large systems. It was found possible that getvmusage() could run with a slightly higher priority than a network service thread that was bound to a CPU, causing priority inversion and the network packets to be unprocessed for the duration. The kernel preempt priority (kpreemptpri) was tuned to avoid this.

5. Network Stack ... up to 4.5x

Description: After boot, network latency gradually increased causing poor connect performance. Eventually caused Zeus to failover.

Analysis: mpstat showed a single CPU was hot in system time, which was profiled using DTrace and FlameGraphs. The kernel was in dce_lookup_and_add_v4(), and DTrace tracing of the kernel function time confirmed it had become CPU expensive. The DCE cache had grown too large and was expensive to search; this was fixed by changing the kernel code that performed DCE cache cleanup.

6. Database ... ~20%

Description: A newly moved database didn't perform quite as fast as the previous system.

Analysis: General investigation found higher CPU usage in user time. DTrace profiling of the database CPU time and with FlameGraphs showed mutex contention for malloc(), from the numerous database threads. The application startup was reconfigured to use the libumem library (object-caching memory allocation library) instead of libc, which reduced the contention.

7. Database ... ~10%

Description: A database was growing in memory size endlessly, hitting the system limit every day or two and encountering poor performance (paging) requiring a restart. Memory leak suspected, but the location unknown.

Analysis: Using DTrace and mdb ::umastat, this was tracked down to the system memory allocation library, libumem. libumem was necessary to use for multi-threaded scalability, however for this particular workload it suffered memory fragmentation in the slabs and endless heap growth. The libumem code was enhanced by Robert to support a tunable for selecting different libumem policies, which avoids the endless growth for this workload. Overall perf loss had been around 10%, averaged for the day; during paging it became several times slower. The final fix was to run the new libumem version with the new tunable ("UMEM_OPTIONS=allocator=best").

8. Riak ... up to 100x

Description: Riak queries would sometimes take over a second, when they should have taken less than 10ms. There were also TCP listen drops when the queries were slow.

Analysis: Kernel analysis using DTrace showed that this coincided with getvmusage() again, but this time due to it holding the address space lock (as_lock) while bitcask blocked on mmap()/munmap(). Analysis included using subsecond offset heatmaps. Callers of getvmusage() were changed to use less expensive techniques.

9. Various ... up to 2x

Description: Various applications, including MySQL and Riak, suffered higher than expected while under CPU load.

Analysis: prstat showed high CPU latency; DTrace profiling showed busy tenants under high load would sometimes not receive the expected proportion of CPU. Further kernel analysis using DTrace of the FSS scheduler showed it was not adjusting priorities quickly enough. The kernel FSS code was fixed.

10. KVM ... 8x

Description: Network transmit performance from Linux KVM guests was as slow as 110 Mbit/sec, confirmed using iperf.

Analysis: I used many existing performance tools, and developed new ones based on dynamic tracing. DTrace was used to trace the host OS and KVM, and both SystemTap and DTrace for Linux was used to trace the guest. Targets were the guest TCP/IP stack, both guest and host kernel schedulers, guest virtio driver, host KVM implementation and drivers, host network device driver, as well as other system characteristics down to system bus activity (observed using CPU perf counters). Separate FlameGraphs were generated for every thread involved, showing both user- and kernel-time. Numerous problems were found (printed out, and stuck to the wall), however most were symptoms of the low throughput and not the cause. Coalescing was suspected to be the worst issue from the DTrace analysis, and confirmed when Robert performed an instruction-step analysis of KVM boot via mdb, and could see that it was not set correctly (tx=timer). In order to go faster, networking needed to run more slowly and coalesce more packets. Once fixed, the network throughput improved to around 900 Mbit/sec.

Another recent performance issue is Robert Mustacchi's libumem caching.

Brendan Gregg's Blog