File System Latency: part 1

I originally posted this at http://dtrace.org/blogs/brendan/2011/05/11/file-system-latency-part-1.

When considering I/O performance from the perspective of applications, file systems can prove to be a better target for analysis than disks. With modern file systems using more DRAM-based cache and performing more asynchronous disk I/O, what the application experiences can vastly differ to what the disks are doing. I'll demonstrate this by examining I/O performance of a MySQL database at both the disk and file system level.

In this post I'll discuss the commonly used approach, disk I/O analysis using iostat(1M), and in the following post I'll show file system analysis using DTrace.

Disk I/O

Historically there has been a focus on performance of the storage level devices - the disks themselves - when trying to understand how disk I/O affects application performance. This includes the use of tools such as iostat(1M), which prints various I/O statistics for disk devices. Either system administrators are running iostat(1M) directly at the command line, or are using it via another interface. Some monitoring software will use iostat(1M) to fetch the disk statistics which they then archive and plot (eg, munin).

Here is an iostat(1M) screenshot from a Solaris-based system which is running a MySQL database as well as other applications (with some extraneous output lines trimmed):

# iostat -xnz 1 10
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    1.1   33.8   78.8 1208.1  0.0  1.0    0.0   27.8   0   4 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  175.4    0.0 22449.9    0.0  0.0  1.1    0.0    6.1   0  82 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  106.1  379.2 13576.5 22036.7  0.0  2.4    0.0    4.9   0  85 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  139.9    0.0 17912.6    0.0  0.0  1.0    0.0    6.8   0  82 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  176.1    0.0 22538.0    0.0  0.0  1.0    0.0    5.7   0  85 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  208.0    0.0 26619.9    0.0  0.0  1.9    0.0    9.2   0  99 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  208.0    0.0 26624.4    0.0  0.0  1.7    0.0    8.2   0  95 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  106.0  368.9 13566.1 26881.3  0.0  2.7    0.0    5.7   0  93 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  146.0    0.0 18691.3    0.0  0.0  0.9    0.0    6.4   0  88 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   84.2    0.0 10779.2    0.0  0.0  0.5    0.0    6.1   0  42 c0t1d0

These statistics show an average I/O service time ("asvc_t") between 4.9 and 9.2 milliseconds and a high percent busy ("%b") rate, reaching 99% busy in one interval. The MySQL database on this server is suffering a degree of slow queries (queries longer than 1 second), and based on the iostat(1M) output you may be guessing why - the disks. For an application, this looks like an AWFUL system to be running on.

iostat(1M) shows disk level performance.

And I do love iostat(1M), especially the "-x" form of output. I'll save going into details for a later post. Here, I'll explain why iostat(1M) can become pretty confusing to interpret when your perspective is an application.

Other Processes

I'll explain a simple issue to start with. I said earlier that the system above was running MySQL and other applications. So, the heavy disk I/O could be caused by - and largely affecting - some other application. What if MySQL was caching very well in DRAM and hardly using the disks, while a nightly backup processes walked the entire file system rattling the disks? You might see output like the above, with the slow MySQL queries I mentioned caused by something else entirely.

iostat(1M)'s disk I/O is from all processes, which may not be the one you're looking at.

I've worked this issue before, creating psio and later iosnoop and iotop to identify disk I/O by-process (and filename). But they can't always identify the process and file, as happens with the ZFS file system. Fixing this isn't easy, and can lead to the question - is looking at the disks the right place to start with? We can get a handle on this by considering the I/O stack.

I/O Stack

Typically, applications are not performing I/O to the disks directly, rather they do so via a file system. And file systems work hard to prevent applications from suffering disk I/O latency directly, such as by using DRAM to buffer writes and to cache and prefetch reads.

File System in the I/O Stack

Here is an example I/O stack showing key components of file system, based loosely on ZFS:

                   +---------------+     +---------------+
                   |  Application  | ... |  Application  |
                   |    mysqld     |     |    (others)   |
                   +---------------+     +---------------+
                    reads ^ | writes     //
                          | |           //  
                          | |          //      user-land
       -------------system calls---------------------------
                          | |        //        kernel
          +--------------------------------+
          |  File System  | |      //      |
          |               | V     //       |
          |        [ access locks ]        |
          |               | |              |
          |      +------------------+      |
          |      |    DRAM cache    |      |
          |      +------------------+      |
          |          ^    ^ |    |         |
          |          |    | |    V         |
          | +----------+  | |  +---------+ |
          | | prefetch |  | |  | flusher | |
          | +------^---+  | |  +----|----+ |
          |      +-|----------------V-+    |
          |      |   on-disk layout   |    |
          |      |           +--------|    |
          |      |           |metadata|    |
          |      +--------------------+    |
          |        |      | |   |^  |      |
          +--------------------------------+
                   |      | |   ||  |
                   |      | V   V|  V
              +-------------------------+
              |   Disk Device Drivers   |      iostat(1M)
              +-------------------------+

This diagram shows that there are other sources of disk I/O other than what the application is directly (synchronously) requesting. For instance, on the write-side, the application may "dirty" buffers in the file system cache and consider the I/O completed, however the file system doesn't perform the disk I/O until much later - seconds - by batching together dirty data and writing them in bulk. This was evident in the previous iostat(1M) output with the bursts of writes (see the "kw/s" column) - which does not reflect how the application is actually performing writes.

I/O Inflation

Apart from other sources of disk I/O adding to the confusion, there is also what happens to the direct I/O itself - particularly with the on-disk layout layer. This is a big topic I won't go into much here, but I'll enumerate a single example of I/O inflation to consider:

An application performs a 1 byte write to an existing file.
The file system identifies the location as part of a 128 Kbyte file system "record", which is not cached (but the metadata to reference it is).
The file system requests the record be loaded from disk.
The disk device layer breaks the 128 Kbyte read into smaller reads suitable for the device.
The disks perform multiple smaller reads, totaling 128 Kbytes.
The file system now replaces the 1 byte in the record with the new byte.
Sometime later, the file system requests the 128 Kbyte "dirty" record be written back to disk.
The disks write the 128 Kbyte record (broken up if needed).
The file system writes new metadata; eg, references (for Copy-On-Write), or atime (access time).
The disks perform more writes.

So, while the applicaiton performed a single 1 byte write, the disks performed multiple reads (128 Kbytes worth) and even more writes (over 128 Kbytes worth).

Application I/O to the file system != Disk I/O

It can be even worse than that example, such as if the required metadata to reference the location had to be read in the first place, and if the file system (volume manager) employed RAID with a stripe size larger than the record size.

I/O Deflation

Having mentioned inflation, I should mention deflation possibilities in case it isn't obvious. These include caching in DRAM to satisfy reads, and cancellation of buffered writes (data rewritten before it has been flushed to disk).

I've recently been watching production ZFS file systems running with over a 99.9% cache hit rate, meaning that only a trickle of reads are reaching disk.

Considering Disk I/O for understanding Application I/O

To summarize so far: looking at how hard the disks are rattling, like we did above using iostat(1M), is far away from what the target application is actually experiencing. Application I/O can be inflated and deflated by the file system by the time it reaches disks, making it difficult to make a direct connection between disk and application I/O. Disk I/O also includes requests from other file system components, such as prefetch, the background flusher, on-disk layout metadata, as well as other users of the file system (other applications). If you were to identify an issue at the disk level, it's hard to tell how much it matters to the application of interest.

iostat(1M) includes other file system I/O, which may not directly affect the performance of the target application.

Summarizing the previous issues:

iostat(1M) shows disk level performance, not file system performance.

In the next post, I'll show how file system performance can be analyzed.

Brendan Gregg's Blog