Linux bcc ext4 Latency Tracing

Storage I/O performance issues are often studied at the block device layer, but instrumenting the file system instead provides more relevant metrics for understanding how applications are affected.

My ext4dist tool does this for the ext4 file system, and traces reads, writes, opens, and fsyncs, and summarizes their latency as a power-of-2 histogram. For example:

# ext4dist
Tracing ext4 operation latency... Hit Ctrl-C to end.
^C

operation = 'read'
     usecs               : count     distribution
         0 -> 1          : 1210     |****************************************|
         2 -> 3          : 126      |****                                    |
         4 -> 7          : 376      |************                            |
         8 -> 15         : 86       |**                                      |
        16 -> 31         : 9        |                                        |
        32 -> 63         : 47       |*                                       |
        64 -> 127        : 6        |                                        |
       128 -> 255        : 24       |                                        |
       256 -> 511        : 137      |****                                    |
       512 -> 1023       : 66       |**                                      |
      1024 -> 2047       : 13       |                                        |
      2048 -> 4095       : 7        |                                        |
      4096 -> 8191       : 13       |                                        |
      8192 -> 16383      : 3        |                                        |

operation = 'write'
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 75       |****************************************|
        16 -> 31         : 5        |**                                      |

operation = 'open'
     usecs               : count     distribution
         0 -> 1          : 1278     |****************************************|
         2 -> 3          : 40       |*                                       |
         4 -> 7          : 4        |                                        |
         8 -> 15         : 1        |                                        |
        16 -> 31         : 1        |                                        |

This output shows a bi-modal distribution for read latency, with a faster mode of less than 7 microseconds, and a slower mode of between 256 and 1023 microseconds. The count column shows how many events fell into that latency range. It's likely that the faster mode was a hit from the in-memory file system cache, and the slower mode is a read from a storage device (disk).

This "latency" is measured from when the operation was issued from the VFS interface to the file system, to when it completed. This spans everything: block device I/O (disk I/O), file system CPU cycles, file system locks, run queue latency, etc. This is a better measure of the latency suffered by applications reading from the file system than measuring this down at the block device interface. Measuring at the block device level is better suited for other uses: resource capacity planning.

Note that this tool only traces the common file system operations previously listed: other file system operations (eg, inode operations including getattr()) are not traced.

ext4dist is a bcc tool that uses kernel dynamic tracing (via kprobes) with BPF. bcc is a front-end and a collection of tools that use new Linux enhanced BPF tracing capabilities.

I also wrote ext4slower, to trace these ext4 operations that are slower than a custom threshold. Eg, 1 millisecond:

# ext4slower 1
Tracing ext4 operations slower than 1 ms
TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
06:49:17 bash           3616   R 128     0           7.75 cksum
06:49:17 cksum          3616   R 39552   0           1.34 [
06:49:17 cksum          3616   R 96      0           5.36 2to3-2.7
06:49:17 cksum          3616   R 96      0          14.94 2to3-3.4
06:49:17 cksum          3616   R 10320   0           6.82 411toppm
06:49:17 cksum          3616   R 65536   0           4.01 a2p
06:49:17 cksum          3616   R 55400   0           8.77 ab
06:49:17 cksum          3616   R 36792   0          16.34 aclocal-1.14
06:49:17 cksum          3616   R 15008   0          19.31 acpi_listen
06:49:17 cksum          3616   R 6123    0          17.23 add-apt-repository
06:49:17 cksum          3616   R 6280    0          18.40 addpart
06:49:17 cksum          3616   R 27696   0           2.16 addr2line
06:49:17 cksum          3616   R 58080   0          10.11 ag
06:49:17 cksum          3616   R 906     0           6.30 ec2-meta-data
06:49:17 cksum          3616   R 6320    0          10.00 animate.im6
[...]

This is great for proving or exonerating the storage subsystem (file systems, volume managers, and disks) as a source of high latency events. Let's say you had occasional 100 ms application request outliers, and suspected it may be a single slow I/O (of 100 ms) or several (adding to 100 ms). Running "ext4slower 10" would print everything beyond 10 ms, proving or exonerating these theories. (Note that it could be thousands of sub-1 ms I/O, not caught by a "ext4slower 10".)

The output will be far too verbose, but you can also use a threshold of "0" to dump all events:

# ext4slower 0
Tracing ext4 operations
TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
06:58:05 supervise      1884   O 0       0           0.00 status.new
06:58:05 supervise      1884   W 18      0           0.02 status.new
06:58:05 supervise      1884   O 0       0           0.00 status.new
06:58:05 supervise      1884   W 18      0           0.01 status.new
06:58:05 supervise      15817  O 0       0           0.00 run
06:58:05 supervise      15817  R 92      0           0.00 run
06:58:05 supervise      15817  O 0       0           0.00 bash
06:58:05 supervise      15817  R 128     0           0.00 bash
06:58:05 supervise      15817  R 504     0           0.00 bash
06:58:05 supervise      15817  R 28      0           0.00 bash
06:58:05 supervise      15817  O 0       0           0.00 ld-2.19.so
[...]

Do run ext4dist first, to check the rate of file system operations. If it's millions per second, then running "ext4slower 0" will try to print millions of lines of output per second – probably not what you want, and will cost overhead on the system.

These tools are in bcc, and which has man pages (under /man/man8) and text files with more example screenshots (under /tools/*_example.txt). Each tool also has a USAGE message, eg:

# ext4slower -h
usage: ext4slower [-h] [-j] [-p PID] [min_ms]

Trace common ext4 file operations slower than a threshold

positional arguments:
  min_ms             minimum I/O duration to trace, in ms (default 10)

optional arguments:
  -h, --help         show this help message and exit
  -j, --csv          just print fields: comma-separated values
  -p PID, --pid PID  trace this PID only

examples:
    ./ext4slower             # trace operations slower than 10 ms (default)
    ./ext4slower 1           # trace operations slower than 1 ms
    ./ext4slower -j 1        # ... 1 ms, parsable output (csv)
    ./ext4slower 0           # trace all operations (warning: verbose)
    ./ext4slower -p 185      # trace PID 185 only

There are also equivalent tools in bcc for btrfs, xfs, and zfs (so far).

Brendan Gregg's Blog

Linux bcc ext4 Latency Tracing