Activity of the ZFS ARC

I originally posted this at http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc

Disk I/O is still a common source of performance issues, despite modern cloud environments, modern file systems and huge amounts of main memory serving as file system cache. Understanding how well that cache is working is a key task while investigating disk I/O issues. In this post, I'll show the activity of the ZFS file system Adaptive Replacement Cache (ARC).

There are often more statistics available than you realize (or have been documented), which may certainly be true with the ARC. Apart from showing these statistics, I'll also show how to extend observability using dynamic tracing (DTrace). These tracing techniques are also applicable to any kernel subsystem. This is an advanced topic, where I'll sometimes dip into kernel code.

Architecture

For background on the ZFS ARC, see the paper ARC: A Self-Tuning, Low Overhead Replacement Cache, by Nimrod Megiddo and Dharmendra S. Modha. In a nutshell, the ARC achieves a high cache hit rate by using multiple cache algorithms at the same time: most recently used (MRU) and most frequently used (MFU). Main memory is balanced between these algorithms based on their performance, which is known by maintaining extra metadata (in main memory) to see how each algorithm would perform if it ruled all of memory. Such extra metadata is held on "ghost lists".

The ZFS ARC has some changes beyond this design, as described in the block comment at the top of uts/common/fs/zfs/arc.c. These changes include the ability to lock pages, vary the size of the cache, and to cache buffers of different sizes.

Lists

        +----------------------------------------------------------+
        | ZFS ARC                                                  |
        |   +---------------+----------------------------------+   |
        |   | MRU           |                        MRU ghost |   |
        |   +---------------+---------------+------------------+   |
        |   | MFU                           |        MFU ghost |   |
        |   +-------------------------------+------------------+   |
        |             <--- available main memory --->              |
        |                                                          |
        +----------------------------------------------------------+

The MRU + MFU lists refer to the data cached in main memory; the MRU ghost + MFU ghost lists consist of themselves only (the metadata) to track algorithm performance.

This is a simplification to convey the basic principle. The current version of the ZFS ARC splits the lists above into separate data and metadata lists, and also has a list for anonymous buffers and one for L2ARC only buffers (which I added when I developed the L2ARC).

The actual lists are these, from arc.c:

typedef struct arc_state {
        list_t  arcs_list[ARC_BUFC_NUMTYPES];   /* list of evictable buffers */
        uint64_t arcs_lsize[ARC_BUFC_NUMTYPES]; /* amount of evictable data */
        uint64_t arcs_size;     /* total amount of data in this state */
        kmutex_t arcs_mtx;
} arc_state_t;

/* The 6 states: */
static arc_state_t ARC_anon;
static arc_state_t ARC_mru;
static arc_state_t ARC_mru_ghost;
static arc_state_t ARC_mfu;
static arc_state_t ARC_mfu_ghost;
static arc_state_t ARC_l2c_only;

These lists exhibit MRU- and MFU-like behavior, but aren't strictly MRU/MFU. This can be understood from a lifecycle of an ARC buffer: on the first access, it is created and moved to the head of the MRU list. On the second access, it is moved to the head of the MFU list. On the third access, it moves back to the start of the MFU list. (Other lifecycles are possible, this is just one example.) So, the most recently accessed buffer may be at the start of the MFU list, not the MRU list. And, the most frequently accessed buffer may not be at the very start of the MFU list.

Locks

Data exists in the cache as buffers, where the primary structures are the arc_buf_hdr_t (header struct, defined in arc.c) and arc_buf_t (buffer struct, defined in arc.h). Access to these is protected by a hash table based on the 128-bit ZFS data virtual address (DVA). The hash table has 256 buffer chains (BUF_LOCKS, which may vary based on your ZFS version), each protected by a padded lock (to avoid false sharing). From arc.c:

#define HT_LOCK_PAD     64

struct ht_lock {
        kmutex_t        ht_lock;
#ifdef _KERNEL
        unsigned char   pad[(HT_LOCK_PAD - sizeof (kmutex_t))];
#endif
};

#define BUF_LOCKS 256
typedef struct buf_hash_table {
        uint64_t ht_mask;
        arc_buf_hdr_t **ht_table;
        struct ht_lock ht_locks[BUF_LOCKS];
} buf_hash_table_t;

These are optimized for performance since ARC buffers can be accessed, modified and moved between lists frequently.

For more details on ARC lists and locks, see the block comments in arc.c, and the overview by Joerg Moellenkamp.

Sizing

The ARC grows to fill available memory on the system, on the principle that if there is free memory, use it. It shouldn't do this at the expense of applications, ie, it shouldn't push out application memory (at least, in any large and sustained way). It keeps its size in check via:

allocation: once the ARC size has grown to its expected maximum, it will begin evicting buffers during new allocations. There is also some logic in arc_evict() to "recycle" a buffer of equal size, an optimization to avoid doing an evict-free-alloc path of the same size.
reclaim thread: this is arc_reclaim_thread(), which wakes up every second (or sooner if signaled by the arc_reclaim_thr_cv conditional variable) and will attempt to reduce the size of the ARC to the target size. It calls arc_kmem_reap_now() to clean up the kmem caches, and arc_adjust() to resize the ARC lists. If arc_shrink() is called by arc_kmem_reap_now(), the target ARC size is reduced by arc_shrink_shift (or needfree), which means shrinking the ARC by 3%. If you plot the ARC size, you sometimes see these arc_shrink() steps appearing as teeth on a saw - a sharp drop followed by a gradual increase.

This is a brief summary, and includes keywords so you can find the right places in the source to start reading. I should note that the ARC did have sizing issues in the past, where it did seem to push out application memory; those were since fixed. (One issue was where it didn't account for its own footprint accurately, missing a source of metadata into its size calculation, which meant the ARC was reaping later than it should have.)

Statistics

On Solaris-based systems, ARC statistics are available from kstat (kernel statistics), the same resource used by tools such as vmstat(1M) and iostat(1M). kstats are global (entire system, not individual zones) and accessible from non-root users. On the down side, they usually are not documented and are not considered a stable interface.

On FreeBSD, the same kstats for the ARC are available via sysctl (kstat.zfs.misc.arcstats).

ARC Hit/Miss Rate

ARC hit or miss rate can be determined from the kstats zfs::arcstats:hits and zfs::arcstats:misses. To watch a rate over time, they can be processed using a little awk (example for Solaris-based systems):

# cat -n archits.sh 
     1  #!/usr/bin/sh
     2  
     3  interval=${1:-5}        # 5 secs by default
     4  
     5  kstat -p zfs:0:arcstats:hits zfs:0:arcstats:misses $interval | awk '
     6          BEGIN {
     7                  printf "%12s %12s %9s\n", "HITS", "MISSES", "HITRATE"
     8          }
     9          /hits/ {
    10                  hits = $2 - hitslast
    11                  hitslast = $2
    12          }
    13          /misses/ {
    14                  misses = $2 - misslast
    15                  misslast = $2
    16                  rate = 0
    17                  total = hits + misses
    18                  if (total)
    19                          rate = (hits * 100) / total
    20                  printf "%12d %12d %8.2f%%\n", hits, misses, rate
    21          }
    22  '

This program could be shorter; I've spent some extra lines to write it more clearly. You could also write this in Perl (see my Sun::Solaris::Kstat examples), or C via libkstat.

$ ./archits.sh 1
        HITS       MISSES   HITRATE
651329528960    370490565    99.94%
       22600           11    99.95%
       17984            6    99.97%
        8978            8    99.91%
       87041           28    99.97%
       89861           10    99.99%
[...]

The first line is the summary since boot, then interval summaries. These counters are system wide. The hit rate on this system is impressive (99.94% since boot), although hit rates can be misleading. I'm usually studying the MISSES column, as a linear measure of pain.

arcstat.pl

Neelakanth Nadgir wrote arcstat.pl (Solaris), which prints various statistics including reads, misses and the size of the ARC. Mike Harsch delevoped arcstat.pl further, including L2ARC statistics.

$ ./arcstat.pl 1
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
04:45:47     0     0      0     0    0     0    0     0    0    14G   14G  
04:45:49   15K    10      0    10    0     0    0     1    0    14G   14G  
04:45:50   23K    81      0    81    0     0    0     1    0    14G   14G  
04:45:51   65K    25      0    25    0     0    0     4    0    14G   14G  
04:45:52   30K    11      0    11    0     0    0     3    0    14G   14G  
[...]

Instead of hit rates, this tool uses miss rates.

In Neel's version the first line is the summary since boot; this isn't the case in Mike's current L2ARC version: an extra snap_stats() for an early L2ARC check means that by the time the statistics loop is reached the first iteration is comparing "now with now" instead of "now with boot".

Jason Hellenthal has created a FreeBSD version.

All statistics

All the kstats from the arcstat group (which feed the tools seen above) can be listed using:

$ kstat -pn arcstats
zfs:0:arcstats:c        15730138449
zfs:0:arcstats:c_max    50447089664
zfs:0:arcstats:c_min    6305886208
zfs:0:arcstats:class    misc
zfs:0:arcstats:crtime   95.921230719
zfs:0:arcstats:data_size        13565817856
zfs:0:arcstats:deleted  388469245
zfs:0:arcstats:demand_data_hits 611277816567
zfs:0:arcstats:demand_data_misses       258220641
zfs:0:arcstats:demand_metadata_hits     40050025212
zfs:0:arcstats:demand_metadata_misses   88523590
zfs:0:arcstats:evict_skip       5669994
zfs:0:arcstats:hash_chain_max   20
zfs:0:arcstats:hash_chains      248783
zfs:0:arcstats:hash_collisions  2106095400
zfs:0:arcstats:hash_elements    971654
zfs:0:arcstats:hash_elements_max        5677254
zfs:0:arcstats:hdr_size 188240232
zfs:0:arcstats:hits     651328694708
[...l2arc statistics truncated...]
zfs:0:arcstats:memory_throttle_count    0
zfs:0:arcstats:mfu_ghost_hits   55377634
zfs:0:arcstats:mfu_hits 649347616033
zfs:0:arcstats:misses   370489546
zfs:0:arcstats:mru_ghost_hits   127477329
zfs:0:arcstats:mru_hits 1980639328
zfs:0:arcstats:mutex_miss       11530337
zfs:0:arcstats:other_size       1967741376
zfs:0:arcstats:p        14713329404
zfs:0:arcstats:prefetch_data_hits       21342
zfs:0:arcstats:prefetch_data_misses     20782630
zfs:0:arcstats:prefetch_metadata_hits   831587
zfs:0:arcstats:prefetch_metadata_misses 2962685
zfs:0:arcstats:recycle_miss     27036925
zfs:0:arcstats:size     15721799464
zfs:0:arcstats:snaptime 29379870.8764106

More of the activity related statistics will be discussed in the next sections.

Demand/Prefetch

Hits and misses can be broken down into four components, such that:

hits = demand_data_hits + demand_metadata_hits + prefetch_data_hits + prefetch_metadata_hits

And similar for misses. Prefetch and demand refer to how the ARC request was initiated; data and metadata refer to the type of data requested.

Prefetch is the ZFS read-ahead feature, to predict and pre-cache blocks for streaming (sequential) workloads. All the prefetch statistics refer to ARC requests that have originated from the ZFS prefetch algorithm - which happens before the ARC - and without knowing whether the data is already cached in the ARC. So, a prefetch hit means that ZFS initiated a prefetch, which was then found in the ARC. A prefetch miss happens when that prefetch request was not in the ARC, and so initiated a disk I/O request (normal behavior). Demand is the opposite of prefetch: direct requests to the ARC, not predicted requests.

Another way to understand prefetch statistics is to follow the code. In dbuf.c, see the ARC_PREFETCH flag set in dbuf_prefetch(), which is then checked in arc.c via the ARCSTAT_CONDSTAT macro to determine which kstat to increment.

You can also add these up in other ways; eg:

streaming ratio = prefetch_* / (hits + misses)

At least, that identifies the ratio of the workload that ZFS has identified as streaming. This can be turned into a kstat tool (awk/Perl/C), as with hits/misses earlier, to show both summary since boot and interval summaries (current activity).

Data/Metadata

Metadata describes the ZFS dataset (file system or volume) and the objects within it. The data is the contents of those objects, including file, directory and volume blocks.

metadata ratio = *_metadata_* / (hits + misses)

This may be useful to check for considering the effect of picking a small recsize setting (thus increasing metadata), or when considering the effect of setting primarycache to metadata only.

Others

Some other activity related kstats worth mentioning for the ARC:

mru_hits, mru_ghost_hits, mfu_hits, mfu_ghost_hits, p: Comparing the mru_hits and mfu_hits statistic with misses can determine the performance of each ARC list type (it's not comparing performance of the MRU/MFU algorithms alone, since these aren't strictly MRU/MFU, as mentioned in Architecture). By adding _hits + _ghost_hits for each type, and then comparing the ratio of each type over time, you can also identify if the workload changes in terms of ARC MRU/MFU. And you can also see how quickly the ARC adapts to the workload, by watching the "p" statistic (ARC parameter) change.
hash_chain_max, hash_collisions: These show how well the DVA hash table is hashing. hash_chain_max is the longest length seen for a chain, when DVAs hash to the same table entry, and is usually less than 10. If that was much higher, performance may degrade as the hash locks are held longer while the chains are walked, assuming the max is reflective and not an anomaly caused by some short event. This could be double checked by studying the hash_collisions rate. If an issue is found, the number of hash table entries (BUF_LOCKS) could be increased in arc.c, and ZFS recompiled (this isn't a regular tunable); although I wouldn't expect needing to tune this for a while.

Other kstats in the arcstats group describe sizes of the ARC, and the L2ARC.

arc_summary.pl

Another Perl Sun::Solaris::Kstat-based ARC tool worth mentioning is Ben Rockwood's arc_summary.pl, which prints a neat summary of the hit/miss rate and many of the other counters. Jason Hellenthal also ported the tool to FreeBSD.

$ ./arc_summary.pl
System Memory:
         Physical RAM:  49134 MB
         Free Memory :  1925 MB
         LotsFree:      767 MB

ZFS Tunables (/etc/system):
         set zfs:zil_disable=1
         set zfs:zfs_prefetch_disable=1
         set zfs:zfs_nocacheflush=1

ARC Size:
         Current Size:             15172 MB (arcsize)
         Target Size (Adaptive):   15256 MB (c)
         Min Size (Hard Limit):    6013 MB (zfs_arc_min)
         Max Size (Hard Limit):    48110 MB (zfs_arc_max)

ARC Size Breakdown:
         Most Recently Used Cache Size:          77%    11865 MB (p)
         Most Frequently Used Cache Size:        22%    3391 MB (c-p)

ARC Efficency:
         Cache Access Total:             654018720316
         Cache Hit Ratio:      99%       653646329407           [Defined State for buffer]
         Cache Miss Ratio:      0%       372390909      [Undefined State for Buffer]
         REAL Hit Ratio:       99%       653645890054           [MRU/MFU Hits Only]

         Data Demand   Efficiency:    99%
         Data Prefetch Efficiency:     0%

        CACHE HITS BY CACHE LIST:
          Anon:                       --%        Counter Rolled.
          Most Recently Used:          0%        1989696958 (mru)       [ Return Customer ]
          Most Frequently Used:       99%        651656193096 (mfu)             [ Frequent Customer ]
          Most Recently Used Ghost:    0%        128471495 (mru_ghost)  [ Return Customer Evicted, Now Back ]
          Most Frequently Used Ghost:  0%        55618357 (mfu_ghost)   [ Frequent Customer Evicted, Now Back ]
        CACHE HITS BY DATA TYPE:
          Demand Data:                93%        613371468593 
          Prefetch Data:               0%        21342 
          Demand Metadata:             6%        40274007879 
          Prefetch Metadata:           0%        831593 
        CACHE MISSES BY DATA TYPE:
          Demand Data:                69%        259735783 
          Prefetch Data:               5%        20782630 
          Demand Metadata:            23%        88909678 
          Prefetch Metadata:           0%        2962818 
---------------------------------------------

Percentages and raw counters are provided, and the four breakdowns of hit/miss statistics (which I documented above; Ben's been bugging me to document the arcstats for a while).

Tracing

Apart from statistics, the activity of the ARC can also be observed by tracing function points and probes in the kernel. While statistics are always enabled and collected, tracing is enabled when needed, and costs much higher overhead. This overhead is relative to the frequency of the traced events, which for the ARC can be very frequent (hundreds of thousands of events per second). I usually only trace the ARC for short periods (seconds or minutes) to gather debug data.

There isn't a stable DTrace provider for the ARC (and there probably never will be - other areas make much more sense), but there are sdt-provider probes in the ARC code:

# dtrace -ln 'sdt:zfs::arc-*'
   ID   PROVIDER            MODULE                          FUNCTION NAME
19307        sdt               zfs                   arc_read_nolock arc-miss
19310        sdt               zfs                   arc_evict_ghost arc-delete
19311        sdt               zfs                         arc_evict arc-evict
19312        sdt               zfs                   arc_read_nolock arc-hit
19313        sdt               zfs                   arc_buf_add_ref arc-hit

If these didn't exist, you could use the fbt provider. I'd begin by inspecting the functions listed in the FUNCTION column.

Note that neither of these providers (sdt or fbt) are available from within Solaris zones - these must be traced from the global zone. They are also both considered unstable interfaces, meaning the one-liners and scripts that follow may not work on future versions of the ARC without maintenance to match the code changes.

ARC accesses by applicaiton

Checking which applications are (directly) using the ARC:

# dtrace -n 'sdt:zfs::arc-hit,sdt:zfs::arc-miss { @[execname] = count() }'
dtrace: description 'sdt:zfs::arc-hit,sdt:zfs::arc-miss ' matched 3 probes
^C

  sendmail                                                          1
  qmgr                                                              3
[...]
  nscd                                                             81
  httpd                                                           243
  imapd                                                          1417
  python2.6                                                      2572
  awstats.pl                                                     4285
  php                                                            6934
  mysqld                                                       105901

This frequency counts the execname during ARC access. mysqld was the heaviest user, with 105,901 accesses while tracing.

The kernel will show up as "sched", for activities including ZFS transaction group flushes (TXG flush).

ARC accesses by kernel call path

For more details on why the ARC is being accessed, the kernel calling stack can be frequency counted:

# dtrace -n 'sdt:zfs::arc-hit,sdt:zfs::arc-miss {
    @[execname, probefunc, stack()] = count(); }'
dtrace: description 'sdt:zfs::arc-hit,sdt:zfs::arc-miss ' matched 3 probes
^C
[...]
  sched                                               arc_buf_add_ref                                   
              zfs`dbuf_hold_impl+0xea
              zfs`dbuf_hold+0x2e
              zfs`dmu_buf_hold+0x75
              zfs`zap_lockdir+0x67
              zfs`zap_update+0x5b
              zfs`uidacct+0xc4
              zfs`zfs_space_delta_cb+0x112
              zfs`dmu_objset_do_userquota_callbacks+0x151
              zfs`dsl_pool_sync+0xfe
              zfs`spa_sync+0x32b               ← spa sync
              zfs`txg_sync_thread+0x265
              unix`thread_start+0x8
               26
[...]
  python2.6                                           arc_buf_add_ref                                   
              zfs`dbuf_hold_impl+0xea
              zfs`dbuf_hold+0x2e
              zfs`dmu_buf_hold+0x75
              zfs`zap_get_leaf_byblk+0x56
              zfs`zap_deref_leaf+0x78
              zfs`fzap_cursor_retrieve+0xa7
              zfs`zap_cursor_retrieve+0x152
              zfs`zfs_readdir+0x2b8   
              genunix`fop_readdir+0xab         ← read directory
              genunix`getdents64+0xbc
              unix`_sys_sysenter_post_swapgs+0x149
             2130
[...]
  mysqld                                              arc_buf_add_ref                                   
              zfs`dbuf_hold_impl+0xea
              zfs`dbuf_hold+0x2e
              zfs`dmu_buf_hold_array_by_dnode+0x1a7
              zfs`dmu_buf_hold_array+0x71
              zfs`dmu_read_uio+0x4d
              zfs`zfs_read+0x19a
              genunix`fop_read+0x6b            ← read
              genunix`read+0x2b8
              genunix`read32+0x22
              unix`_sys_sysenter_post_swapgs+0x149
           101955

The output was many pages long; I've truncated to include a few different stacks, and added annotations.

ARC misses by user-land call path

Here's another view of ARC access call paths, this time for misses only, and the user-land stack trace that led to the miss. I've filtered on "mysqld" processes only:

# dtrace -n 'sdt:zfs::arc-miss /execname == "mysqld"/ {
    @[execname, probefunc, ustack()] = count(); }'
dtrace: description 'sdt:zfs::arc-miss ' matched 1 probe
^C
[...]
  mysqld                                              arc_read_nolock                                   
              libc.so.1`__read+0x15
              mysqld`my_read+0x43
              mysqld`_Z7openfrmP3THDPKcS2_jjjP8st_table+0x95
              mysqld`_ZL17open_unireg_entryP3THDP8st_tablePKcS4_S4_P10TABLE_LISTP1...
              mysqld`_Z10open_tableP3THDP10TABLE_LISTP11st_mem_rootPbj+0x6d7
              mysqld`_Z11open_tablesP3THDPP10TABLE_LISTPjj+0x1b0
              mysqld`_Z30open_normal_and_derived_tablesP3THDP10TABLE_LISTj+0x1b
              mysqld`_Z14get_all_tablesP3THDP10TABLE_LISTP4Item+0x73b
              mysqld`_Z24get_schema_tables_resultP4JOIN23enum_schema_table_state+0x18
              mysqld`_ZN4JOIN4execEv+0x59e
              mysqld`_Z12mysql_selectP3THDPPP4ItemP10TABLE_LISTjR4ListIS1_ES2_jP8s...
              mysqld`_Z13handle_selectP3THDP6st_lexP13select_resultm+0x102
              mysqld`_Z21mysql_execute_commandP3THD+0x51c6
              mysqld`_Z11mysql_parseP3THDPKcjPS2_+0x1be
              mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x895
              mysqld`handle_one_connection+0x318
              libc.so.1`_thrp_setup+0x7e
              libc.so.1`_lwp_start
              124

The kernel stack trace could be included as well, showing the complete call path from user-land to a kernel event.

ARC access sizes

Digging a bit deeper; the sdt probes used previously were declared as:

        DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);

Which means arg0 is an arc_buf_hdr_t. It's declared in arc.c, and contains various members including:

struct arc_buf_hdr {
[...]
        arc_buf_t               *b_buf;
        uint32_t                b_flags;
[...]
        arc_buf_contents_t      b_type;
        uint64_t                b_size;
        uint64_t                b_spa;
[...]
        clock_t                 b_arc_access;
[...]

Let's pick out the size, and trace ARC accesses by buffer size:

# dtrace -n 'sdt:zfs::arc-hit,sdt:zfs::arc-miss {
    @["bytes"] = quantize(((arc_buf_hdr_t *)arg0)->b_size); }'
dtrace: description 'sdt:zfs::arc-hit,sdt:zfs::arc-miss ' matched 3 probes
^C

  bytes                                             
           value  ------------- Distribution ------------- count    
               0 |                                         0        
               1 |                                         82       
               2 |                                         1        
               4 |                                         0        
               8 |                                         1        
              16 |                                         1        
              32 |                                         1        
              64 |                                         1        
             128 |                                         0        
             256 |                                         0        
             512 |@                                        1526     
            1024 |                                         605      
            2048 |                                         780      
            4096 |                                         913      
            8192 |@                                        1094     
           16384 |@@                                       4386     
           32768 |                                         618      
           65536 |@@                                       4196     
          131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         60811    
          262144 |                                         0

Here I used a power-of-2 quantization, which showed that most of the buffers were in the 128 Kbyte range. (Which is also the default recsize for the datasets on this system.) Smaller buffers will exist due to cases including files and directories that are smaller than 128k.

Other members of arc_buf_hdr_t can be retrieved and inspected in similar ways.

ARC buffer age

Here's a neat use of the b_arc_access member, which tracks the time that the buffer was last accessed in terms of clock ticks. This time the fbt provider is used, to trace arc_access() before and after it updates b_arc_access:

# cat -n arcaccess.d 
     1  #!/usr/sbin/dtrace -s
     2  
     3  #pragma D option quiet
     4  
     5  dtrace:::BEGIN
     6  {
     7          printf("lbolt rate is %d Hertz.\n", `hz);
     8          printf("Tracing lbolts between ARC accesses...");
     9  }
    10  
    11  fbt::arc_access:entry
    12  {
    13          self->ab = args[0];
    14          self->lbolt = args[0]->b_arc_access;
    15  }
    16  
    17  fbt::arc_access:return
    18  /self->lbolt/
    19  {
    20          @ = quantize(self->ab->b_arc_access - self->lbolt);
    21          self->ab = 0;
    22          self->lbolt = 0;
    23  }

Running for 10 seconds:

# ./arcaccess.d -n 'tick-10s { exit(0); }'
lbolt rate is 100 Hertz.
Tracing lbolts between ARC accesses...

           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  729988   
               1 |                                         3805     ← 10 ms
               2 |                                         3038     
               4 |                                         2028     
               8 |                                         1428     
              16 |                                         1398     
              32 |                                         1618     
              64 |                                         2883     ← 1 second
             128 |                                         738      
             256 |                                         681      
             512 |                                         338      
            1024 |                                         569      
            2048 |                                         166      
            4096 |                                         607      ← 1 minute
            8192 |                                         632      
           16384 |                                         808      
           32768 |                                         373      
           65536 |                                         110      
          131072 |                                         142      
          262144 |                                         39       ← 1 hour
          524288 |                                         5        
         1048576 |                                         97       
         2097152 |                                         10       
         4194304 |                                         44       
         8388608 |                                         617      ← 1 day
        16777216 |                                         1        
        33554432 |                                         0

This is interesting data. It shows that most buffers were accessed less than one clock tick apart (10 ms), with 729,988 accesses in the 0 to 1 tick range. The oldest buffer accessed was in the 16777216+ range, which (converting lbolts @100 Hertz into time) means it is at least 46 hours since last access. The above output has been annotated to show where times fall in the lbolt ranges (eg, 1 second falls in the 64 - 127 lbolt range).

This gives us an insight into the age of the oldest buffers in the ARC (at least, in terms of access rate - not birth), and, of its churn rate. This particular ARC is 25 Gbytes, and has been running with a 99.94% hit rate as shown earlier - which may be less surprising now we know that it is so large that it can contain buffers accessed 40+ hrs apart.

ARC hash lock

To get a handle on ARC hash lock contention (instead of using more heavyweight tools like lockstat(1M)), you can try tracing the time for arc_buf_add_ref(), since it grabs the buffer hash lock:

# dtrace -n 'arc_buf_add_ref:entry { self->s = timestamp; }
    arc_buf_add_ref:return /self->s/ {
        @["ns"] = quantize(timestamp - self->s); self->s = 0; }'
dtrace: description 'arc_buf_add_ref:entry ' matched 2 probes
^C

  ns                                                
           value  ------------- Distribution ------------- count    
             256 |                                         0        
             512 |@                                        2123     
            1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@          44784    
            2048 |@@@@@                                    7556     
            4096 |@@                                       2267     
            8192 |                                         385      
           16384 |                                         4        
           32768 |                                         0        
           65536 |                                         1        
          131072 |                                         0

Most of the times were in the 1 - 2 us range, with only a single occurrence passing 65 us.

ARC reap

Here's a simple script to provide insight into the ARC reclaim thread - an asynchronous task that keeps the size of the ARC in check.

# cat -n arcreap.d 
     1  #!/usr/sbin/dtrace -s
     2  
     3  fbt::arc_kmem_reap_now:entry,
     4  fbt::arc_adjust:entry
     5  {
     6          self->start[probefunc] = timestamp;
     7  }
     8  
     9  fbt::arc_shrink:entry
    10  {
    11          trace("called");
    12  }
    13  
    14  fbt::arc_kmem_reap_now:return,
    15  fbt::arc_adjust:return
    16  /self->start[probefunc]/
    17  {
    18          printf("%Y %d ms", walltimestamp,
    19              (timestamp - self->start[probefunc]) / 1000000);
    20          self->start[probefunc] = 0;
    21  }

Different functions are traced: arc_kmem_reap_now(), to see the time taken to reap the ARC kmem caches; arc_adjust(), for resizing the ARC lists; and arc_shrink(), to know when the ARC size has been stepped down (this isn't timed, since any real work will be done by arc_adjust()).

# ./arcreap.d
dtrace: script './arcreap.d' matched 5 probes
CPU     ID                    FUNCTION:NAME
  0  64929                 arc_shrink:entry   called                           
  0  62414                arc_adjust:return 2012 Jan  9 23:10:01 18 ms
  9  62420         arc_kmem_reap_now:return 2012 Jan  9 23:10:03 1511 ms
  0  62414                arc_adjust:return 2012 Jan  9 23:10:24 0 ms
  6  62414                arc_adjust:return 2012 Jan  9 23:10:49 0 ms

This isn't the only way the ARC keeps its size sane; it will also evict/recycle buffers during allocation, as mentioned in the Architecture section. This reclaim thread is the more aggressive method, so if you have occasional odd ARC behavior it may be handy to check if it is related to reclaims.

Evicts by list and type

Tracing the function that does eviction, with details:

# cat -n arcevict.d 
     1  #!/usr/sbin/dtrace -s
     2  
     3  #pragma D option quiet
     4  
     5  dtrace:::BEGIN
     6  {
     7          trace("Tracing ARC evicts...\n");
     8  }
     9  
    10  fbt::arc_evict:entry
    11  {
    12          printf("%Y %-10a %-10s %-10s %d bytes\n", walltimestamp, args[0],
    13              arg4 == 0 ? "data" : "metadata",
    14              arg3 == 0 ? "evict" : "recycle",  arg2);
    15  }

Sample output:

# ./arcevict.d 
Tracing ARC evicts...
2012 Jan  8 08:13:03 zfs`ARC_mru data       evict      812181411 bytes
2012 Jan  8 08:13:03 zfs`ARC_mfu data       evict      5961212 bytes
2012 Jan  8 08:13:03 zfs`ARC_mfu data       recycle    131072 bytes
2012 Jan  8 08:13:04 zfs`ARC_mfu data       recycle    131072 bytes
2012 Jan  8 08:13:07 zfs`ARC_mfu data       recycle    131072 bytes
2012 Jan  8 08:13:07 zfs`ARC_mfu data       recycle    131072 bytes
2012 Jan  8 08:13:08 zfs`ARC_mfu metadata   recycle    16384 bytes
2012 Jan  8 08:13:08 zfs`ARC_mfu data       recycle    131072 bytes
[...]

The output begins by catching an 800 Mbyte evict from the ARC MRU data list, followed by a 6 Mbyte evict from the MFU data list. After that, buffers were evicted due to the recycle code path, which recycles buffers when the ARC is getting full instead of allocating new ones.

To understand (and maintain) the arg mappings above, see the invocations of arc_evict() in arc.c. Eg, from arc_adjust():

        if (adjustment > 0 && arc_mru->arcs_lsize[ARC_BUFC_DATA] > 0) {
                delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_DATA], adjustment);
                (void) arc_evict(arc_mru, NULL, delta, FALSE, ARC_BUFC_DATA);
                adjustment -= delta;
        }

This is the first arc_evict() in arc_adjust(), which is why the ARC MRU data list is hit up first.

And more

The previous tracing examples show the sort of additional information that can be obtained using static tracing (the sdt provider) and dynamic tracing (the fbt provider). With dynamic tracing, a lot more can be seen as needed. Every function that makes up the ARC can be traced, along with their arguments.

One detail that is actually difficult to trace is the file names during ARC accesses, since vnode pointers are not passed down to the ARC layer. It is possible, and has been done before (I don't have an example on-hand though). You could more easily cache them from upper layers (eg, VFS; see the sample chapter from the DTrace book).

Conclusion

In this post, I examined ZFS ARC activity in detail, starting with statistics provided by kstat and then tracing provided by DTrace. Apart from calculating hit and miss rates, I discussed other statistics including prefetch and metadata ratios. I then used tracing to observe information from the ARC - including who is using the ARC and why, ARC buffer sizes, the age of the ARC buffers, lock contention timings and eviction details. More can be traced as needed: ZFS with DTrace provides great performance and observability.

I've spent much time on kernel internals, but I haven't really blogged about the deeper areas. I'm trying to change that, at least occasionally, starting with this post on ARC activity. I hope it is useful.

Thanks to the original ZFS team - especially Mark Maybee - for writing the ARC and explaining details to me, and to Bryan Cantrill for kstat-ifying the ARC statistics and creating DTrace.

Brendan Gregg's Blog