File System Latency: part 2

I originally posted this at http://dtrace.org/blogs/brendan/2011/05/13/file-system-latency-part-2.

This is part 2 of File System Latency, a series on storage I/O performance from the application perspective. In the first part I explained why disk I/O is difficult to associate with an application, and why it can be altered from what the application requested. In this part I'll focus more on the file system, and show why I/O that never reaches disk can be important to study.

File System Latency

What matters most to the application is the latency of its requests to the file system, which can be measured here (see part 1 for the full diagram):

                   +---------------+ 
                   |  Application  | 
                   |    mysqld     | 
                   +---------------+ 
           HERE --> reads ^ | writes 
                          | |        
                          | |                  user-land
       -------------system calls---------------------------
                          | |                  kernel
          +--------------------------------+
          |  File System  | |              |
          |               | V              |
          |        [ access locks ]        |
          |               | |              |
          |/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\|
[...]

Measuring I/O and latency at this level is much more interesting - it is directly affecting the application.

If we are also able to examine application context during the latency - whether it's during a sensitive code-path - then we can answer with certainty whether there is a file system issue that is affecting the application or not, and whether that's worth investigating further. Being able to answer this early in the diagnosis phase can be immensely useful, so that the right path is taken sooner.

Apart from having better relevance to the application than disk I/O, file system I/O is also inclusive of other phenomenon that can be worth examining, including cache hits, lock latency, additional queueing and disk-cache flush latency.

DRAM cache hits

Reads and writes may be served from the file system DRAM cache instead of disk (if present, enabled and eligible for the I/O type). For reads:

                   +-----------------------------+ 
                   |  Application                | 
                   |    mysqld                   | 
                   +-----------------------------+ 
                    read |||   ^^ cache    ^ cache
                requests |||   || hits     | misses
                         |||   ||          |
       ----------------------------------------------------
                         |||   ||          |               <-- file system I/O
          +-------------------------------------------+
          |  File System |||   ||          |          |
          |              VVV   ||          |          |
          |   +----------------------------------+    |
          |   |     DRAM cache                   |    |
          |   +----------------------------------+    |
          |                         |    ^            |
          +-------------------------------------------+
                               disk |    |
                               read |    |                 <-- disk I/O 
                                    V    | 
                                  .--------.
                                  `--------'
                                  |  disks |
                                  `--------'

Since these cache hits don't reach the disk, they are never observable using iostat(1M). They will be visible when tracing at the file system level, and if the file system cache is performing well you may see orders of magnitude more I/O than at the disk level.

If cache hits are "good" I/O, it may not be immediately obvious why we'd even want to see them. Here are three reasons to consider:

load analysis: by observing all requested I/O you know exactly how the application is using the file system - the load applied - which may lead to tuning or capacity planning decisions.
unnecessary work: identifying I/O that shouldn't be sent to the file system to start with, whether it's a cache hit or not.
latency: cache hits may not be as fast as you think.

What if the I/O was slow due to file system lock contention, even though no disk I/O was involved?

Lock latency

File systems employ locks for data integrity in a multi-threaded environment, and the latency incurred with lock contention will be included when tracing at this level. The time waiting to acquire the lock could dominate I/O latency as seen by the application:

                   +-------------------------------+ 
                   |  Application                  | 
                   |    mysqld                     | 
                   +-------------------------------+ 
                      I/O |    I/O latency     ^ I/O 
                  request |  <-------------->  | completion
                          |                    |
       ----------------------------------------------------
                          |                    |   
          +-------------------------------------------+
          |  File System  |                    |      |
          |               |   lock wait        |      |
          |               X-------------->.    |      |
          |                               |I/O |      |
          |                               |wait|      |
          |                               V    |      |
          |/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/|

In which case, the disks paint a rosier picture than reality, as their latency could be dwarfed by lock latency. While this is unlikely to happen - it could happen - and if you are chasing down mysterious I/O latency you don't want to leave any stone unturned.

High lock wait (contention) could happen for a number of reasons, including extreme I/O conditions or file system bugs (remembering that the file system is software, and software can have bugs). Lock and other sources of file system latency won't be visible from iostat(1M).

If you only use iostat(1M), you may be flying blind regarding lock and other file system issues.

There is one other latency source that iostat(1M) does directly show: waiting on one of the I/O queues. I'll dig into queueing a little here, and explain why we need to return back to file system latency.

Queueing

I/O can be queued in the kernel before it is issued to disk.

I've been trying to describe file systems generically to avoid getting sidetracked into implementation and internals, but will dip into ZFS a little here. On Solaris-based systems an I/O can queue in the ZFS I/O pipeline ("ZIO pipeline"), and then ZFS vdev queues, and finally a SCSI "sd" block device driver queue. iostat(1M)'s "wsvc_t" does show queue latency for the sd driver (and the "wait" and "%w" columns relate to this queue as well), but these don't reflect ZFS queueing.

                   read
         +-----------|-----------------------------------------+
         |  ZFS      V                                         |
         |        +---------------------------+                |
         |        |            ARC            |                |
         |        +---------------------------+                |
         |           |     __________________                  |
         |           +--> ()  ZIO pipeline   )---.             |
         |                 `""""""""""""""""'    |             |
         |                       vdev queue      V             |
         |                    +-+-+-+-+-+-+-+-+-+-+            |
         |                    |_|_|_|_|_|_|_|_|_|_|            |
         |                     |     <---                      |
         +---------------------|-------------------------------+
                               V  sd queue
                              +-+-+-+-+-+-+-+-+       wsvc_t 
                              |_|_|_|_|_|_|_|_|     <-------- iostat(1M)
                                     --->    |                  
                                             V                  /
                                          .------.             /
                                          `------'    asvc_t  /
                                          | disk |  <--------'
                                          `------'

So, iostat(1M) gets a brief reprieve - it's not just disk I/O latency. The block device driver queue latency can be seen as well.

However, similarly to disk I/O latency, queue latency may not matter unless the application is waiting for that I/O to complete. To understand this from the application perspective, we are still best served by measuring latency at the file system level - which will include any queueing latency from any queue that the application I/O has synchronously waited for.

Cache Flush

Some file systems take care to make sure that synchronous write I/O - where the application has requested that the write not complete until it is on stable storage - really is on stable storage. (I think ZFS is currently the only one to do this.) It can work by sending SCSI cache flush commands to the disk devices, and not completing the application I/O until the cache flush command has completed. This ensures that the data really is on stable storage - and not buffered.

                   +--------------------------------------+ 
                   |  Application                         | 
                   +--------------------------------------+ 
        synchronous write |       I/O latency        ^ synchronous write
                  request |  <-------------------->  | completion
                          |                          |
       -----------------------------------------------------------
                          |                          |
          +---------------------------------------------------+
          |  File System  |                          |        |
          |               |                          |        |
          +---------------------------------------------------+
                     disk |        ^     | SCSI      |     
                    write | iostat |     | cache     | 
                          | asvc_t |     | flush     | 
                          | <----> |     | command   |     
                          |        |     |           | 
              .---------------------------------------------.
              |  Disk     |        |     |           |      |
              |  Device   V        |     V           |      |
              |   +-------------------------------------+   |
              |   | on-disk DRAM cache                  |   |
              |   +-------------------------------------+   |
              |                          |  actual   ^      |
              |                          |  disk     |      |
              |                          |  latency  |      |
              |                          | <-------> |      |
              |                          V           |      |
              |                         .-------------.     |
              |                         `-------------'     |
              |                         |    disk     |     |
              |                         `-------------'     |
              '---------------------------------------------'

The application is actually waiting for the SCSI flush command to complete, which is not (currently) included in iostat(1M). This means the application can be suffering write latency issues - that really are caused by disk latency - that are invisible via iostat(1M). I've wrestled with this issue before, and included scripts in the DTrace book to show the SCSI cache flush latency.

If the latency is measured at the file system interface, this latency will include cache flush commands.

Issues Missing from Disk I/O

Part 1 showed how application storage I/O was confusing to understand from the disk level. In this part I showed some scenarios where issues are just not visible.

This isn't really a problem with iostat(1M) - it's a great tool for system administrators to understand the usage of their resources. But the applications are far, far away from the disks - and have a complex file system in-between. For application analysis, iostat(1M) may provide clues that disks could be causing issues, but you really want to measure at the file system level to directly associate latency with the application, and to be inclusive of other file system latency issues.

In the next post I'll measure file system latency on a running application (MySQL).

Brendan Gregg's Blog