Recent posts:
Blog index

Brendan Gregg's Blog



22 Jul 2008

I originally posted this at

An exciting new ZFS feature has now become publicly known: the second level ARC, or L2ARC. I've been busy with its development for over a year, however, this is my first chance to post about it. This post will show a quick example and answer some basic questions.

Background in a nutshell

The "ARC" is the ZFS main memory cache (in DRAM), which can be accessed with sub microsecond latency. An ARC read miss would normally read from disk, at millisecond latency (especially random reads). The L2ARC sits in-between, extending the main memory cache using fast storage devices, such as flash memory based SSDs (solid state disks).

old model

new model

with ZFS

Some example sizes to put this into perspective, from a lab machine named "walu":

For this server, the L2ARC allows around 650 Gbytes to be stored in the total ZFS cache (ARC + L2ARC), rather than just DRAM with about 120 Gbytes.

A previous ZFS feature (the ZIL) allowed you to add SSD disks as log devices to improve write performance. This means ZFS provides two dimensions for adding flash memory to the file system stack: the L2ARC for random reads, and the ZIL for writes.

Adam has been the mastermind behind our flash memory efforts, and has written an excellent article in Communications of the ACM about flash memory based storage in ZFS; for more background, check it out.

L2ARC Example

To illustrate the L2ARC with an example, I'll use "walu", a medium-sized server in our test lab, which was briefly described above. Its ZFS pool of 44 x 7200 RPM disks is configured as a 2-way mirror, to provide both good reliability and performance. It also has 6 SSDs, which I'll add to the ZFS pool as L2ARC devices (or "cache devices").

I should note: this is an example of L2ARC operation, not a demonstration of the maximum performance that we can achieve (the SSDs I'm using here aren't the fastest I've ever used, nor the largest.)

20 clients access walu over NFSv3, and execute a random read workload with an 8 Kbyte record size across 500 Gbytes of files (which is also its working set).

1) Disks only

Since the 500 Gbytes of working set is larger than walu's 128 Gbytes of DRAM, the disks must service many requests. One way to grasp how this workload is performing is to examine the IOPS that the ZFS pool delivers:

The pool is pulling about 1.89K ops/sec, which would require about 42 ops per disk of this pool. To examine how this is delivered by the disks, we can either use zpool iostat or the original iostat:

iostat is interesting as it lists the service times: wsvc_t + asvc_t. These I/Os are taking on average between 9 and 10 milliseconds to complete, which the client application will usually suffer as latency. This time will be due to the random read nature of this workload: each I/O must wait as the disk heads seek and the disk platter rotates.

Another way to understand this performance is to examine the total NFSv3 ops delivered by this system (these days I use a GUI to monitor NFSv3 ops, but for this blog post I'll hammer nfsstat into printing something concise):

That's about 2.27K ops/sec for NFSv3; I'd expect 1.89K of that to be what our pool was delivering, and the rest are cache hits out of DRAM, which is warm at this point.

2) L2ARC devices

Now the 6 SSDs are added as L2ARC cache devices:

And we wait until the L2ARC is warm.

Time passes ...

Several hours later the cache devices have warmed up enough to satisfy most I/Os which miss main memory. The combined 'capacity/used' column for the cache devices shows that our 500 Gbytes of working set now exists on those 6 SSDs:

The pool_0 disks are still serving some requests (in this output 30 ops/sec) but the bulk of the reads are being serviced by the L2ARC cache devices, each providing around 2.6K ops/sec. The total delivered by this ZFS pool is 15.8K ops/sec (pool disks + L2ARC devices), about 8.4x faster than with disks alone.

This is confirmed by the delivered NFSv3 ops:

walu is now delivering 18.7K ops/sec, which is 8.3x faster than without the L2ARC.

However, the real win for the client applications is that of read latency; the disk-only iostat output showed our average was between 9 and 10 milliseconds, the L2ARC cache devices are delivering the following:

Our average service time is between 0.4 and 0.6 ms (wsvt_t + asvc_t columns), which is about 20x faster than what the disks were delivering.

What this means ...

An 8.3x improvement for 8 Kbyte random IOPS across a 500 Gbyte working set is impressive, as is improving storage I/O latency by 20x.

But this isn't really about the numbers, which will become dated (these SSDs were manufactured in July 2008, by a supplier who is providing us with bigger and faster SSDs every month).

What's important is that ZFS can make intelligent use of fast storage technology, in different roles to maximize their benefit. When you hear of new SSDs with incredible ops/sec performance, picture them as your L2ARC; or if it were great write throughput, picture them as your ZIL.

The example above was to show that the L2ARC can deliver, over NFS, whatever these SSDs could do. And these SSDs are being used as a second level cache, in-between main memory and disk, to achieve the best price/performance.


I recently spoke to a customer about the L2ARC and they asked a few questions which may be useful to repeat here:

What is L2ARC?

Isn't flash memory unreliable? What have you done about that?

Aren't SSDs really expensive?

What about writes – isn't flash memory slow to write to?

What's bad about the L2ARC?


If anyone is interested, I wrote a summary of L2ARC internals as a block comment in usr/src/uts/common/fs/zfs/arc.c, which is also surrounded by the actual implementation code. The block comment is below (see the source for the latest version), and is an excellent reference for how it really works:

Jonathan Schwartz (our CEO) recently linked to this block comment in a blog entry about flash memory, to show that ZFS can incorporate flash into the storage hierarchy, and here is the actual implementation.