Another 10 Performance Wins

I originally posted this at http://dtrace.org/blogs/brendan/2014/02/11/another-10-performance-wins.

Following on from my earlier 10 performance wins post, here is another group of 10 I have worked on.

#	Target	Analyzed	Key Tool	Fixed	Specific	Improvement
11	redis	System	DTrace	System	scheduler	41%
12	rsync	System	DTrace	Application	app config	5x
13	mongoperf	System	DTrace	System	ZFS tuning	up to 8x
14	backups	System	iostat	System	OS tuning	2x - 4x
15	Percona	System	taskset	System	OS tuning	16x
16	ZFS	System	DTrace	System	OS tuning	up to 100x
17	Sphinx	System	Flame Graphs	Build	compiler options	75%
18	rsync	System	methodology	Application	system tuning	2.3x
19	ab	System	DTrace	System	TCP issue	5x
20	Elasticsearch	System	arcstat	Benchmark	config	50%

Longer summaries for each below. See the first post for notes on the above columns.

Issues 13, 15, and 18 are for Linux instances under the KVM hypervisor (HW virtualization); the remainder are for SmartOS instances (OS virtualization).

While I began documenting these performance wins to share the tools I was using, it's also useful to consider the methodologies as well. I'll mention key methodologies in the descriptions that follow. For most issues, I'm using the USE method to look for resource bottlenecks, and the TSA method to direct latency investigations.

11. redis ... 41%

Description: A potential customer benchmarked redis performance and found it not quite good enough for their needs.

Analysis: Microstate accounting showed that the single-threaded redis-server was hot on-CPU, mostly in system time (TSA method). DTrace profiling of kernel activity showed most time processing network I/O, with 20% in polling. This identified one potential win: the polling time can be improved by use of event ports, which was integrated in a newer version of Redis (2.6). To attack the bulk of the CPU time, in network I/O, locality was investigated by using DTrace to profile CPU usage. This showed poor CPU affinity, with the hot thread walking across even-numbered CPUs instead of staying put, which is a known kernel scheduler bug that has been fixed in a newer kernel version. Estimating the speedup for the fix, the thread was manually bound to a CPU, which improved performance by 41%.

12. rsync ... 5x

Description: A routine task copied data between systems using rsync. The throughput was low, and this had become a blocker.

Analysis: prstat and microstate accounting showed that an thread was hot on-CPU on the remote sending system, spending 95% in user-time (TSA method). DTrace was used to profile on-CPU user stacks for sshd, which showed that it was mostly spending time in compression. By disabling compression (using "-o 'Compression no'"), performance improved from 7.0 Mbytes/sec to 35.1 Mbytes/sec. DTrace was then used to profile both the sender and receiver to determine the next bottleneck (why stop at 35?), which was various other factors, including I/O overheads and encryption (arcfour). netstat was used to check the health of the network. (See issue 18 for more about rsync.)

13. mongoperf ... up to 8x

Description: The performance of a KVM/Linux cloud system intended for MongoDB was being evaluated using mongoperf, which showed it was much slower than expected. Is something wrong, or is this system just slow?

Analysis: strace was used to show that mongoperf was executing read/write/fsync in a loop, resulting in synchronous disk I/O. This bypassed the file system caching and buffering that was expected to improve performance, and was not a realistic test of the intended MongoDB workload. To see if there were any performance issues anyway, I traced I/O latency using DTrace in both the Linux guest (dtrace4linux) and the SmartOS host, which found that it was usually fast (0.1 ms: buffered by the battery-backed storage subsystem). However, occasionally it would have high latency, over 500 ms. DTrace of I/O with ZFS spa_sync() showed some were queueing behind TXG flushes. With ZFS tuning to increase the frequency of flushes (by 5x) and therefore reduce the size, these outliers moved to around 60ms (tail-end latency on a shorter queue).

14. backups ... 2x - 4x

Description: A scheduled backup process was transferring data slowly.

Analysis: iostat showed that the disks were very busy with reads (USE method). Large files were being copied, however, the iostat numbers strongly suggested a random read workload, which didn't make sense. I checked the ZFS prefetch property, and found (and then remembered) that it had been disabled in our platform software. DTrace of the io provider probes, and ZFS internals, confirmed that this was a streaming workload that should benefit from prefetch. I created a simulation of the workload in the lab to test prefetch, and then tested in production, where it showed a 2x to 4x improvement. After some more testing, we turned on ZFS prefetch by default in the cloud platform software, to benefit all customers.

15. Percona ... 16x

Description: A benchmark evaluation of Percona on Joyent KVM/Linux vs AWS was not favorable. Business would be lost unless performance could be competitive.

Analysis: The benchmark was of custom queries run via mysqlslap. These updated a single row of a database - the same row - and were run from hundreds of threads in parallel. While this seemed like an odd test that would result in contention, the question was: why was it slower or our instances? Analysis began on the same ubuntu 12.04 instances that were tested, with 8 vCPUs. mpstat and pidstat -t showed that there was a scalability issue: despite a high number of threads and concurrency, the CPUs were about 88% idle, with 8% user and 4% system time.

To get a handle on what was occurring, I wanted to profile CPU usage, and also trace user- and kernel-stacks during scheduler off-CPU events. I switched SmartOS so that I could use DTrace for this, and got the same database and load running. prstat and DTrace confirmed that mysql was experiencing heavy lock contention, and additional CPUs allowed additional threads to spin and contend, creating negative scalability. I tested this experimentally, on the Linux instance I offlined all CPUs except for one (echo 0 to the /sys/...cpu.../online files). This would hurt most workloads, however, it improved the benchmark by 16x, and made the result very competitive. A similar result was seen by binding the mysqld to one CPU using taskset. I've included this as an example of negative scalability, diagnosed experimentally. The real root cause (not described here) was later found using a different tool and fixed.

16. ZFS ... up to 100x

Description: Semi-regular spikes in I/O latency on an SSD postgres server.

Analysis: The customer reported multi-second I/O latency for a server with flash memory-based solid state disks (SSDs). Since this SSD type was new in production, it was feared that there may be a new drive or firmware problem causing high latency. ZFS latency counters, measured at the VFS interface, confirmed that I/O latency was dismal, sometimes reaching 10 seconds for I/O. The DTrace-based iosnoop tool (DTraceToolkit) was used to trace at the block device level, however, no seriously slow I/O was observed from the SSDs. I plotted the iosnoop traces using R for evidence of queueing behind TXG flushes, but they didn't support that theory either.

This was difficult to investigate since the slow I/O was intermittent, sometimes only occurring once per hour. Instead of a typical interactive investigation, I developed various ways to log activity from DTrace and kstats, so that clues for the issue could be examined afterwards from the logs. This included capturing which processes were executed using execsnoop, and dumping ZFS metrics from kstat, including arcstats. This showed that various maintenance processes were executing during the hour, and, the ZFS ARC, which was around 210 Gbytes, would sometimes drop by around 6 Gbytes. Having worked performance issues with shrinking ARCs before, I developed a DTrace script to trace ARC reaping along with process execution, and found that it was a match with a cp(1) command. This was part of the maintenance task, which was copying a 30 Gbyte file, hitting the ARC limit and triggering an ARC shrink. Shrinking involves holding ARC hash locks, which can cause latency, especially when shrinking 6 Gbytes worth of buffers. The zfs:zfs_arc_shrink_shift tunable was adjusted to reduce the shrink size, which also made them more frequent. The worst-case I/O improved from 10s to 100ms.

17. Sphinx ... 75%

Description: The Sphinx search engine was twice as fast on AWS Linux than SmartOS, when all other factors were identical.

Analysis: The TSA Method showed over 90% of time was spent in USR mode. DTrace CPU profiling data was made into a Flame Graph, which showed Sphinx acting normally. The Flame Graph revealed the request functions, which were traced with the pid provider and both timestamp and vtimestamp deltas to confirm USR time. On AWS Linux, perf profiling data was collected and also made into a Flame Graph. By examining them both and spotting the difference, it was clear that not the same application code was running, and a case of a function being elided on Linux was found. A renewed investigation into compiler differences found that the Linux build used -O3 (optimizations), whereas the SmartOS build did not. This was added to the SmartOS build, roughly doubling performance.

18. rsync ... 2x

Description: A customer needed better rsync throughput, which was low on Joyent Linux compared to AWS.

Analysis: Joyent Linux could rsync at around 16 MB/s, whereas AWS Linux was around 50 MB/s. Some network and ssh tuning was initially suggested, however, the customer wanted to know why such tuning wasn't necessary on AWS, which was already performing well. This was debugged using static performance tuning: checking the configurations and versions of everything, as well as active analysis. Several areas to improve performance were identified, the greatest was that the default version of OpenSSH on Joyent Linux was 5.9p1, whereas on AWS it was 6.2p2, which ran over two times faster. This issue was discovered by the static performance tuning methodology, and not a tool. The other issues identified did involve many tools: sar, pidstat, dtrace4linux, perf, ktap, Flame Graphs. I wanted to use the OpenSSH version difference as the example here, as it was an example of a methodology-based win and not a tool.

19. ab ... 5x

Description: During a capacity planning exercise, a customer noticed AWS had around 5x higher node.js HTTP throughput than Joyent SmartOS.

Analysis: Apache Bench (ab) was used to drive load to a node.js process on localhost. I used netstat to investigate TCP activity, which showed the connection rate began at about 5k per second, and after about two seconds dropped down to 1k per second. When performance was poor, both node and ab spent time in the sleep state, blocked on something else. This sounded like a TCP TIME_WAIT issue I've debugged earlier, but the number of open TCP sessions suggested otherwise (only 11k; was expecting 33k+). Using DTrace to trace syscall latency, showed this time was in portfs: the SmartOS event ports implementation (epoll). This means that time was spent blocked on something else, which then woke up the file descriptors. Using DTrace to trace the actual wakeup which unblocked node, showed that it was TCP retransmits, despite this being a localhost benchmark. A packet capture of the retransmits, and then more DTrace of the kernel code paths, showed that these were caused by packets arriving during TIME_WAIT. This was the TIME_WAIT issue I've debugged earlier, although manifesting differently. The problem is one of benchmarking from a single client IP address, and new SYNs colliding with earlier sessions still in TIME_WAIT. Running this from multiple clients – which is the real world use case – showed performance could be steady at 5k connections/sec. Longer writeup here.

20. Elasticsearch ... 50%

Description: Performance of an Elasticsearch benchmark on SmartOS was only 15% faster than AWS. Why not much more?

Analysis: Given the various performance features with SmartOS (OS virtualization, ZFS ARC, CPU bursting), the expectation was much more than 15%. Various performance analysis tools were used for analysis while the benchmark was running (the "active benchmarking" methodology), with arcstat showing a moderate rate of ARC misses, and iostat showing a random read workload. Other resources were not stressed, and this looked like a disk-bound benchmark. However, over time, the ARC become warmer, and the benchmark numbers improved. After about 20 minutes of testing, a steady state was reached, where the working set was fully cached, and disk reads reached zero. At this point, the benchmark become CPU-bound, and limited by a cloud imposed CPU resource control (CPU cap), as shown by prstat and kstat. In the warm state, SmartOS was running 50% faster than earlier. DTrace was used to investigate further improvements, such as comparing the Elasticsearch I/O size (~1 Kbyte) with the ZFS record size (128 Kbyte default).

Brendan Gregg's Blog