zfsday 2012: ZFS Performance Analysis and Tools
zfsday 2012 talk by Brendan Gregg.Video: http://www.youtube.com/watch?v=xkDqe6rIMa0
Description: "The performance of the file system, or disks, is often the target of blame, especially in multi-tenant cloud environments. At Joyent we deploy a public cloud on ZFS-based systems, and frequently investigate performance with a wide variety of applications in growing environments. This talk is about ZFS performance observability, showing the tools and approaches we use to quickly show what ZFS is doing. This includes observing ZFS I/O throttling, an enhancement added to illumos-ZFS to isolate performance between neighbouring tenants, and the use of DTrace and heat maps to examine latency distributions and locate outliers."
next prev 1/65 | |
next prev 2/65 | |
next prev 3/65 | |
next prev 4/65 | |
next prev 5/65 | |
next prev 6/65 | |
next prev 7/65 | |
next prev 8/65 | |
next prev 9/65 | |
next prev 10/65 | |
next prev 11/65 | |
next prev 12/65 | |
next prev 13/65 | |
next prev 14/65 | |
next prev 15/65 | |
next prev 16/65 | |
next prev 17/65 | |
next prev 18/65 | |
next prev 19/65 | |
next prev 20/65 | |
next prev 21/65 | |
next prev 22/65 | |
next prev 23/65 | |
next prev 24/65 | |
next prev 25/65 | |
next prev 26/65 | |
next prev 27/65 | |
next prev 28/65 | |
next prev 29/65 | |
next prev 30/65 | |
next prev 31/65 | |
next prev 32/65 | |
next prev 33/65 | |
next prev 34/65 | |
next prev 35/65 | |
next prev 36/65 | |
next prev 37/65 | |
next prev 38/65 | |
next prev 39/65 | |
next prev 40/65 | |
next prev 41/65 | |
next prev 42/65 | |
next prev 43/65 | |
next prev 44/65 | |
next prev 45/65 | |
next prev 46/65 | |
next prev 47/65 | |
next prev 48/65 | |
next prev 49/65 | |
next prev 50/65 | |
next prev 51/65 | |
next prev 52/65 | |
next prev 53/65 | |
next prev 54/65 | |
next prev 55/65 | |
next prev 56/65 | |
next prev 57/65 | |
next prev 58/65 | |
next prev 59/65 | |
next prev 60/65 | |
next prev 61/65 | |
next prev 62/65 | |
next prev 63/65 | |
next prev 64/65 | |
next prev 65/65 |
PDF: zfsperftools2012.pdf
Keywords (from pdftotext):
slide 1:
ZFS Performance Analysis and Tools Brendan Gregg Lead Performance Engineer brendan@joyent.com @brendangregg Wednesday, October 3, 12 October, 2012slide 2:
whoami • G’Day, I’m Brendan • These days I do systems performance analysis of the cloud • Regarding ZFS: • Perf analysis of ZFS (mostly using DTrace) for 5+ years, both enterprise and cloud usage • Wrote many DTrace-based ZFS perf analysis tools including those in the DTrace book • Developed ZFS L2ARC while at Sun Wednesday, October 3, 12slide 3:
Who is Joyent • Cloud computing provider (public cloud + software) • Use ZFS as much as possible: • Host storage • Guest storage: OS virtualization (SmartOS), and KVM guests (Linux, Windows) • We use ZFS because • Reliability: checksums, COW • Features: snapshots, clones, compression, ... • Performance: large ARCs • It can boil oceans Wednesday, October 3, 12slide 4:
Joyent, cont. • We build tools for ZFS automation and observability. • Performance is a key company feature. • Need to solve FS/disk issues fast. Wednesday, October 3, 12slide 5:
Agenda • My top 12 tools for ZFS performance analysis (unsorted): • iostat • vfsstat • zfsslower.d • iosnoop • iostacks • metaslab_free.d • spasync.d • arcstat • arcaccess.d For cloud computing from within a Zone, add: • latency counters • scatter plots • mysqld_pid_fslatency.d • heat maps (CA) • syscall with fi_fs == zfs Wednesday, October 3, 12slide 6:
Functional diagram: full stack • Unix 101 Process User-Land Kernel logical I/O VFS ZFS Block Device Interface physical I/O Disks Wednesday, October 3, 12 Syscall Interfaceslide 7:
Functional diagram: full stack, cont. • Unix 102 User-Land Kernel sync. Process Syscall Interface VFS ZFS Block Device Interface Disks Wednesday, October 3, 12 iostat(1) often async: write buffering, read aheadslide 8:
Functional diagram: full stack, cont. • DTrace 301 mysql_pid_fslatency.d syscall with fi_fs == zfs Process User-Land Kernel VFS zioslower.d spasync.d metaslab_free.d arcstat.pl arcaccess.d ZFS iostacks.d Block Device Interface kernel drivers as needed see DTrace book chap 4 Wednesday, October 3, 12 Syscall Interface vfsstat Disks iostat iosnoopslide 9:
ZFS Internals • That’s just my top 12 • Use more as needed Wednesday, October 3, 12slide 10:
ZFS Internals • That’s just my top 12 • Use more as needed DTRACE ALL THE THINGS! http://hub.opensolaris.org/bin/view/Community+Group+zfs/source http://hyperboleandahalf.blogspot.com/2010/06/this-is-why-ill-never-be-adult.html Wednesday, October 3, 12slide 11:
iostat • Block-device level (almost disk-level) I/O statistics: $ iostat -xnz 1 [...] r/s w/s r/s w/s r/s w/s extended device statistics kr/s kw/s wait actv wsvc_t asvc_t %w %b device 52.0 0.0 0.0 1 c0t0d0 16.0 43325.5 0.0 4.0 1 12 c0t1d0 extended device statistics kr/s kw/s wait actv wsvc_t asvc_t %w %b device 34.0 0.0 0.0 0 c0t0d0 16.0 1440.5 0.0 2.0 0 10 c0t1d0 extended device statistics kr/s kw/s wait actv wsvc_t asvc_t %w %b device 36.0 0.0 0.0 0 c0t0d0 0.0 0.0 0.0 4 c0t1d0 ZFS->gt;Disk Workload Wednesday, October 3, 12 Disk Resulting Performanceslide 12:
iostat, cont. • Effective tool for a class of disk issues, especially: $ iostat -xnz 1 [...] r/s w/s r/s w/s r/s w/s r/s w/s r/s w/s extended device statistics kr/s kw/s wait actv wsvc_t asvc_t 0.0 0.0 4.0 extended device statistics kr/s kw/s wait actv wsvc_t asvc_t 0.0 0.0 4.0 extended device statistics kr/s kw/s wait actv wsvc_t asvc_t 0.0 0.0 4.0 extended device statistics kr/s kw/s wait actv wsvc_t asvc_t 0.0 0.0 4.0 extended device statistics kr/s kw/s wait actv wsvc_t asvc_t 0.0 0.0 4.0 • Disks “out to lunch” (PERC ECC error) Wednesday, October 3, 12 %w %b device 0 100 c0t0d0 %w %b device 0 100 c0t0d0 %w %b device 0 100 c0t0d0 %w %b device 0 100 c0t0d0 %w %b device 0 100 c0t0d0slide 13:
iostat, cont. • Minor nits: • does not show read/write latency separately. ZFS TXG flushes drag up the latency, which looks alarming, but are asynchronous. Can use DTrace for the split. • no higher level context: PID, ZFS dataset, file pathname, ... (not its role) • Major problem (although, not iostat’s fault): commonly confused with application-level (logical) I/O. • The I/O rates, sizes, and latency, can dramatically differ between logical file system I/O and physical disk I/O. • Users commonly draw the wrong conclusions when only provided with iostat statistics to understand a system. Wednesday, October 3, 12slide 14:
iostat, cont. • iostat output (or disk kstats) graphed by various monitoring software So MANY graphs! Wednesday, October 3, 12slide 15:
iostat, cont. mysql_pid_fslatency.d For so LITTLE visibility syscall with fi_fs == zfs Process ThisUser-Land leaves many perf issues unsolved Kernel VFS zioslower.d spasync.d metaslab_free.d arcstat.pl arcaccess.d ZFS iostacks.d Block Device Interface kernel drivers as needed see DTrace book chap 4 Wednesday, October 3, 12 Syscall Interface vfsstat Disks iostat iosnoopslide 16:
vfsstat • VFS-level I/O statistics (VFS-iostat): # vfsstat -Z 1 r/s w/s kr/s kw/s ractv wactv read_t writ_t App->gt;ZFS Workload Wednesday, October 3, 12 ZFS Resulting Performance d/s del_t zone 0.0 global (0) 34.9 9cc2d0d3 (2) 46.5 72188ca0 (3) 16.5 4d2a62bb (4) 27.6 8bbc4000 (5) 11.3 d305ee44 (6) 132.0 9897c8f5 (7) 40.7 5f3c7d9e (9) 31.9 22ef87fc (10) ZFS I/O Throttlingslide 17:
vfsstat, cont. • Good high-level summary of logical I/O: application FS workload • Summarizes by zone • Impetus was observability for cloud “noisy neighbors” • Shows affect of ZFS I/O throttling (performance isolation) • Summarizes performance applications actually experience! • Usually a lot better than disk-level, due to ZFS caching (ARC, L2ARC) and buffering • Required kernel changes, new kstats (thanks Bill Pijewski) Wednesday, October 3, 12slide 18:
zfsslower.d • ZFS reads/writes slower than 10 ms: # ./zfsslower.d 10 TIME PROCESS D 2012 Sep 30 04:56:22 beam.smp R 2012 Sep 30 04:56:23 beam.smp R 2012 Sep 30 04:56:24 beam.smp R 2012 Sep 30 04:56:24 beam.smp R 2012 Sep 30 04:56:25 beam.smp R 2012 Sep 30 04:56:25 beam.smp R 2012 Sep 30 04:56:25 beam.smp R 2012 Sep 30 04:56:26 cat 2012 Sep 30 04:56:26 beam.smp R 2012 Sep 30 04:56:27 beam.smp R [...] ms FILE 12 /var/db/riak/leveldb/.../205788.sst 15 /var/db/riak/leveldb/.../152831.sst 11 /var/db/riak/leveldb/.../220432.sst 12 /var/db/riak/leveldb/.../208619.sst 21 /var/db/riak/leveldb/.../210710.sst 18 /var/db/riak/leveldb/.../217757.sst 13 /var/db/riak/leveldb/.../191942.sst 10 /db/run/beam.smp.pid 11 /var/db/riak/leveldb/.../220389.sst 12 /var/db/riak/leveldb/.../186749.sst • Traces at VFS level to show the true application suffered I/O time • allows immediate confirm/deny of FS (incl. disk) based issue Wednesday, October 3, 12slide 19:
zfsslower.d • ZFS reads/writes slower than 100 ms: # ./zfsslower.d 100 TIME PROCESS D 2012 Sep 30 05:01:17 beam.smp R 2012 Sep 30 05:01:54 beam.smp R 2012 Sep 30 05:02:35 beam.smp R 2012 Sep 30 05:02:35 beam.smp R 2012 Sep 30 05:02:35 beam.smp R 2012 Sep 30 05:02:40 beam.smp R 2012 Sep 30 05:03:11 beam.smp R 2012 Sep 30 05:03:38 beam.smp R [...] less frequent Wednesday, October 3, 12 ms FILE 144 /var/db/riak/leveldb/.../238108.sst 149 /var/db/riak/leveldb/.../186222.sst 188 /var/db/riak/leveldb/.../200051.sst 159 /var/db/riak/leveldb/.../209376.sst 178 /var/db/riak/leveldb/.../203436.sst 172 /var/db/riak/leveldb/.../204688.sst 200 /var/db/riak/leveldb/.../219837.sst 142 /var/db/riak/leveldb/.../222443.sstslide 20:
zfsslower.d, cont. • All ZFS read/writes: # ./zfsslower.d TIME PROCESS D 2012 Sep 30 04:46:09 beam.smp R 2012 Sep 30 04:46:09 beam.smp R 2012 Sep 30 04:46:09 beam.smp R 2012 Sep 30 04:46:09 beam.smp R 2012 Sep 30 04:46:09 beam.smp R 2012 Sep 30 04:46:09 nginx 2012 Sep 30 04:46:09 beam.smp R 2012 Sep 30 04:46:09 beam.smp R 2012 Sep 30 04:46:09 beam.smp R 2012 Sep 30 04:46:09 beam.smp R 2012 Sep 30 04:46:09 beam.smp R 2012 Sep 30 04:46:09 beam.smp R 2012 Sep 30 04:46:09 beam.smp R 2012 Sep 30 04:46:09 beam.smp R 2012 Sep 30 04:46:09 beam.smp R 2012 Sep 30 04:46:09 beam.smp R 2012 Sep 30 04:46:09 beam.smp R 2012 Sep 30 04:46:09 beam.smp R [...] Wednesday, October 3, 12 ms FILE 0 /var/db/riak/leveldb/.../221844.sst 0 /var/db/riak/leveldb/.../221155.sst 0 /var/db/riak/leveldb/.../215917.sst 0 /var/db/riak/leveldb/.../190134.sst 0 /var/db/riak/leveldb/.../234539.sst 0 /db/log/apps/prod17_nginx_access.log 0 /var/db/riak/leveldb/.../205368.sst 1 /var/db/riak/leveldb/.../199665.sst 0 /var/db/riak/leveldb/.../177866.sst 0 /var/db/riak/leveldb/.../177866.sst 0 /var/db/riak/leveldb/.../177866.sst 0 /var/db/riak/leveldb/.../177866.sst 0 /var/db/riak/leveldb/.../218473.sst 0 /var/db/riak/leveldb/.../210712.sst 0 /var/db/riak/leveldb/.../234194.sst 0 /var/db/riak/leveldb/.../233302.sst 0 /var/db/riak/leveldb/.../233302.sst 0 /var/db/riak/leveldb/.../233302.sstslide 21:
zfsslower.d, cont. • Written in DTrace [...] fbt::zfs_read:entry, fbt::zfs_write:entry self->gt;path = args[0]->gt;v_path; self->gt;kb = args[1]->gt;uio_resid / 1024; self->gt;start = timestamp; fbt::zfs_read:return, fbt::zfs_write:return /self->gt;start && (timestamp - self->gt;start) >gt;= min_ns/ this->gt;iotime = (timestamp - self->gt;start) / 1000000; this->gt;dir = probefunc == "zfs_read" ? "R" : "W"; printf("%-20Y %-16s %1s %4d %6d %s\n", walltimestamp, execname, this->gt;dir, self->gt;kb, this->gt;iotime, self->gt;path != NULL ? stringof(self->gt;path) : "slide 22:gt;"); [...] • zfsslower.d, also on github, originated from the DTrace book Wednesday, October 3, 12
zfsslower.d, cont. • Traces VFS/ZFS interface (kernel) from usr/src/uts/common/fs/zfs/zfs_vnops.c: * Regular file vnode operations template vnodeops_t *zfs_fvnodeops; const fs_operation_def_t zfs_fvnodeops_template[] = { VOPNAME_OPEN, { .vop_open = zfs_open }, VOPNAME_CLOSE, { .vop_close = zfs_close }, VOPNAME_READ, { .vop_read = zfs_read }, VOPNAME_WRITE, { .vop_write = zfs_write }, VOPNAME_IOCTL, { .vop_ioctl = zfs_ioctl }, VOPNAME_GETATTR, { .vop_getattr = zfs_getattr }, [...] Wednesday, October 3, 12slide 23:
iosnoop • Traces block-device-level I/O: # ./iosnoop UID PID D BLOCK SIZE 103 5891 R 238517878 131072 103 5891 R 544800120 131072 103 5891 R 286317766 131072 103 5891 R 74515841 131072 103 5891 R 98341339 131072 [...] COMM PATHNAME beam.smpslide 24:gt; beam.smp gt; beam.smp gt; beam.smp gt; beam.smp gt; • Has many options: # ./iosnoop -Dots STIME(us) TIME(us) [...] Wednesday, October 3, 12 DELTA(us) DTIME(us) UID PID D BLOCK SIZE 5891 R 128302372 131072 5891 R 143783200 131072 5891 R 84913822 131072 5891 R 14964144 131072 5891 R 283398320 131072 5891 R 288468148 131072 COMM PATHNAME beam.smp gt; beam.smp gt; beam.smp gt; beam.smp gt; beam.smp gt; beam.smp gt;
iosnoop, cont. • Written in DTrace [...] io:genunix::done /start_time[args[0]->gt;b_edev, args[0]->gt;b_blkno]/ [...] /* fetch entry values */ this->gt;dev = args[0]->gt;b_edev; this->gt;blk = args[0]->gt;b_blkno; this->gt;suid = start_uid[this->gt;dev, this->gt;blk]; this->gt;spid = start_pid[this->gt;dev, this->gt;blk]; this->gt;sppid = start_ppid[this->gt;dev, this->gt;blk]; self->gt;sargs = (int)start_args[this->gt;dev, this->gt;blk] == 0 ? "" : start_args[this->gt;dev, this->gt;blk]; self->gt;scomm = start_comm[this->gt;dev, this->gt;blk]; [...] printf("%5d %5d %1s %8d %6d ", this->gt;suid, this->gt;spid, args[0]->gt;b_flags & B_READ ? "R" : "W", args[0]->gt;b_blkno, args[0]->gt;b_bcount); [...] • From the DTraceToolkit Wednesday, October 3, 12slide 25:
iosnoop, cont. • 370 lines of code, mostly to process options: USAGE: iosnoop [-a|-A|-DeghiNostv] [-d device] [-f filename] [-m mount_point] [-n name] [-p PID] iosnoop # default output # print all data (mostly) # dump all data, space delimited # print time delta, us (elapsed) # print device name # print command arguments # print device instance # print major and minor numbers # print disk delta time, us # print start time, us # print completion time, us # print completion time, string -d device # instance name to snoop -f filename # snoop this file only -m mount_point # this FS only -n name # this process name only -p PID # this PID only eg, iosnoop -v # human readable timestamps iosnoop -N # print major and minor numbers iosnoop -m / # snoop events on filesystem / only Wednesday, October 3, 12slide 26:
iosnoop, cont. • I originally wrote it in 2004 to solve disk-I/O-by-process • Shipped by default on Mac OS X (/usr/bin), Oracle Solaris 11 • I wrote a companion called iotop. A similar iotop tool later appeared for Linux (via the blk tracing framework) • Dosen’t do ZFS pathnames yet (known issue; hackathon?) • Pathnames from the block-device layer was always a party trick: relied on a cached vnode->gt;v_path • ZFS aggregates I/O. At the block-device layer, no one vnode responsible (or vnode pointer) Wednesday, October 3, 12slide 27:
iosnoop, cont. • Locate latency origin: Process Syscall Interface VFS zfsslower.d Correlate iosnoop ZFS Block Device Interface Disks Wednesday, October 3, 12slide 28:
iostacks.d • Just a one-liner really (could make an “iostacks.d”): # dtrace -n 'io:::start { @[stack()] = count(); }' dtrace: description 'io:::start ' matched 6 probes genunix`ldi_strategy+0x53 zfs`vdev_disk_io_start+0xcc zfs`zio_vdev_io_start+0xab zfs`zio_execute+0x88 zfs`zio_nowait+0x21 zfs`vdev_mirror_io_start+0xcd zfs`zio_vdev_io_start+0x250 zfs`zio_execute+0x88 zfs`zio_nowait+0x21 zfs`arc_read_nolock+0x4f9 zfs`arc_read+0x96 zfs`dsl_read+0x44 zfs`dbuf_read_impl+0x166 zfs`dbuf_read+0xab zfs`dmu_buf_hold_array_by_dnode+0x189 zfs`dmu_buf_hold_array+0x78 zfs`dmu_read_uio+0x5c zfs`zfs_read+0x1a3 genunix`fop_read+0x8b genunix`read+0x2a7 Wednesday, October 3, 12slide 29:
iostacks.d, cont. • Stack recognition chart: # dtrace -n 'io:::start { @[stack()] = count(); }' dtrace: description 'io:::start ' matched 6 probes genunix`ldi_strategy+0x53 zfs`vdev_disk_io_start+0xcc zfs`zio_vdev_io_start+0xab zfs`zio_execute+0x88 zfs`zio_nowait+0x21 zfs`vdev_mirror_io_start+0xcd zfs`zio_vdev_io_start+0x250 zfs`zio_execute+0x88 zfs`zio_nowait+0x21 zfs`arc_read_nolock+0x4f9 zfs`arc_read+0x96 zfs`dsl_read+0x44 zfs`dbuf_read_impl+0x166 zfs`dbuf_read+0xab zfs`dmu_buf_hold_array_by_dnode+0x189 zfs`dmu_buf_hold_array+0x78 zfs`dmu_read_uio+0x5c zfs`zfs_read+0x1a3 genunix`fop_read+0x8b genunix`read+0x2a7 Wednesday, October 3, 12 syscall read() arc-missslide 30:
iostacks.d, cont. • Stack recognition chart: # dtrace -n 'io:::start { @[stack()] = count(); }' dtrace: description 'io:::start ' matched 6 probes genunix`ldi_strategy+0x53 zfs`vdev_disk_io_start+0xcc zfs`zio_vdev_io_start+0xab zfs`zio_execute+0x88 zfs`vdev_queue_io_done+0x70 zfs`zio_vdev_io_done+0x80 zfs`zio_execute+0x88 genunix`taskq_thread+0x2d0 unix`thread_start+0x8 Wednesday, October 3, 12 ZIO pipelineslide 31:
iostacks.d, cont. • From zio, when parent == NULL (first zio): # dtrace -n 'zio_create:entry /arg0 == NULL/ { @[stack()] = count(); }' [...] zfs`zio_null+0x77 zfs`zio_root+0x2d zfs`dmu_buf_hold_array_by_dnode+0x113 zfs`dmu_buf_hold_array+0x78 zfs`dmu_write+0x80 zfs`space_map_sync+0x288 zfs`metaslab_sync+0x135 SPA zfs`vdev_sync+0x7f sync zfs`spa_sync+0x38b zfs`txg_sync_thread+0x204 unix`thread_start+0x8 Wednesday, October 3, 12slide 32:
iostacks.d, cont. • Can identify the reason for disk I/O • including unexplained additional I/O from ZFS • Times it doesn’t work (ZIO pipeline task pool), can try a similar one-liner from fbt::zio_create:entry, for the stack trace from the creation of all ZIO. Wednesday, October 3, 12slide 33:
metaslab_free.d • Traces ZFS metaslab details: # ./metaslab_free.d Tracing ZFS metaslab alloc. metaslab_df_free_pct = 4 % 2012 Sep 30 05:32:47 free %pct by allocations: 2012 Sep 30 05:32:49 free %pct by allocations: • ZFS pools are split into metaslabs (eg, 10 GB each) • Metaslabs have two allocation algorithms: metaslab %free >gt;= metaslab_df_free_pct == first fit (fast) metaslab %freeslide 34: metaslab_free.d, cont. • Written in DTrace dtrace:::BEGIN printf("Tracing ZFS metaslab alloc. `metaslab_df_free_pct); metaslab_df_free_pct = %d %%\n", fbt::metaslab_df_alloc:entry this->gt;pct = args[0]->gt;sm_space * 100 / args[0]->gt;sm_size; @[this->gt;pct] = count(); profile:::tick-1s printf("\n%Y free %%pct by allocations:", walltimestamp); printa(@); trunc(@); • metaslab_free.d is also on github Wednesday, October 3, 12slide 35:metaslab_free.d, cont. • Shows if allocations are currently fast fit or best fit • Correlate to performance changes • Use while tuning • Can’t just try out a new setting and watch performance: perf can improve if ZFS switched to a new, emptier, metaslab at the same time. Script identifies if that occurred. Wednesday, October 3, 12slide 36:spasync.d • Traces ZFS SPA syncs: # ./spasync.d Tracing ZFS spa_sync() slower than 1 ms... 2012 Sep 30 05:56:07 zones 30 ms, 25 MB 464 I/O 2012 Sep 30 05:56:12 zones 22 ms, 15 MB 366 I/O 2012 Sep 30 05:56:17 zones 39 ms, 19 MB 361 I/O 2012 Sep 30 05:56:22 zones 143 ms, 31 MB 536 I/O 2012 Sep 30 05:56:27 zones 32 ms, 27 MB 465 I/O 2012 Sep 30 05:56:32 zones 27 ms, 24 MB 468 I/O 2012 Sep 30 05:56:37 zones 31 ms, 23 MB 489 I/O 2012 Sep 30 05:56:42 zones 200 ms, 85 MB 956 I/O 2012 Sep 30 05:56:47 zones 122 ms, 46 MB 593 I/O 2012 Sep 30 05:56:52 zones 48 ms, 18 MB 394 I/O 2012 Sep 30 05:56:57 zones 43 ms, 16 MB 409 I/O 2012 Sep 30 05:57:02 zones 33 ms, 27 MB 526 I/O [...] • Check for correlations with I/O latency Wednesday, October 3, 12slide 37:spasync.d, cont. • Written in DTrace [...] fbt::spa_sync:entry /!self->gt;start/ in_spa_sync = 1; self->gt;start = timestamp; self->gt;spa = args[0]; io:::start /in_spa_sync/ @io = count(); @bytes = sum(args[0]->gt;b_bcount); fbt::spa_sync:return /self->gt;start && (this->gt;ms = (timestamp - self->gt;start) / 1000000) >gt; MIN_MS/ normalize(@bytes, 1048576); printf("%-20Y %-10s %6d ms, ", walltimestamp, stringof(self->gt;spa->gt;spa_name), this->gt;ms); printa("%@d MB %@d I/O\n", @bytes, @io); [...] • spasync.d is also on github Wednesday, October 3, 12slide 38:spasync.d, cont. • Simple and effective: helps diagnose TXG sync issues, providing information to help guide tuning. • Long heritage: Matt Ahrens ->gt; Roch Bourbannias ->gt; Ben Rockwood ->gt; Brendan Gregg Wednesday, October 3, 12slide 39:%iowait • I/O wait percent # mpstat 1 CPU minf mjf xcal 0 483 0 427 1 625 0 153 2 269 0 702 3 281 0 161 4 173 0 577 5 172 6 139 0 678 7 140 8 124 0 527 [...] Wednesday, October 3, 12 intr ithr csw icsw migr smtx 2216 913 4415 4171 1748 2887 1720 636 2943 1433 529 2255 3361 1465 2565 1108 418 1640 2406 802 2509 1343 527 2196 1145 419 1676 srw syscl 0 9275 0 5909 0 6422 0 5269 0 4645 0 3699 0 5135 0 4879 0 3335 usr sys 7 11 wt idl 0 83 0 85 0 87 0 90 0 90 0 94 0 90 0 92 0 92slide 40:%iowait • I/O wait percent ... still zero in illumos! # mpstat 1 CPU minf mjf xcal 0 483 0 427 1 625 0 153 2 269 0 702 3 281 0 161 4 173 0 577 5 172 6 139 0 678 7 140 8 124 0 527 [...] Wednesday, October 3, 12 intr ithr csw icsw migr smtx 2216 913 4415 4171 1748 2887 1720 636 2943 1433 529 2255 3361 1465 2565 1108 418 1640 2406 802 2509 1343 527 2196 1145 419 1676 srw syscl 0 9275 0 5909 0 6422 0 5269 0 4645 0 3699 0 5135 0 4879 0 3335 usr sys 7 11 wt idl 0 83 0 85 0 87 0 90 0 90 0 94 0 90 0 92 0 92slide 41:arcstat • ZFS ARC and L2ARC statistics: $ ./arcstat 1 time read 04:45:47 04:45:49 15K 04:45:50 23K 04:45:51 65K 04:45:52 30K [...] miss miss% dmis dm% pmis pm% mmis mm% arcsz 14G 14G 14G 14G 14G • Statistics are per-interval: • read, miss: total ARC accesses, misses • miss%, dm%, pm%, mm%: ARC miss percentage total, demand, prefetch, metadata • dmis, pmis, mmis: misses for demand, prefetch, metadata • arcsz, c: ARC size, ARC target size Wednesday, October 3, 12 14G 14G 14G 14G 14Gslide 42:arcstat, cont. • Written in Perl, uses kstats (zfs::arcstats:): [...] sub snap_stats { my %prev = %cur; if ($kstat->gt;update()) { printf("slide 43:gt;\n"); my $hashref_cur = $kstat->gt;{"zfs"}{0}{"arcstats"}; %cur = %$hashref_cur; foreach my $key (keys %cur) { next if $key =~ /class/; if (defined $prev{$key}) { $d{$key} = $cur{$key} - $prev{$key}; [...] sub calculate { %v = (); $v{"time"} = strftime("%H:%M:%S", localtime); $v{"hits"} = $d{"hits"}/$int; $v{"miss"} = $d{"misses"}/$int; $v{"read"} = $v{"hits"} + $v{"miss"}; $v{"hit%"} = 100*$v{"hits"}/$v{"read"} if $v{"read"} >gt; 0; $v{"miss%"} = 100 - $v{"hit%"} if $v{"read"} >gt; 0; [...] • github.com/mharsch/arcstat Wednesday, October 3, 12 arcstat, cont. • Options to configure output, including L2ARC stats • Crucial data when analyzing cache performance (and easier to read than the raw form: kstat -pn arcstats) • Originally by Neelakanth Nadgir, then Jason Hellenthal (FreeBSD port) and Mike Harsh Wednesday, October 3, 12slide 44:arcaccess.d • ARC population age: # ./arcaccess.d -n 'tick-10s { exit(0); }' lbolt rate is 100 Hertz. Tracing lbolts between ARC accesses... value ------------- Distribution ------------- count -1 | 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 729988 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 | 8192 | 16384 | 32768 | 65536 | 131072 | 262144 | 524288 | 1048576 | 2097152 | 4194304 | 8388608 | 16777216 | 33554432 | Wednesday, October 3, 12 Age: 10 ms 1 second 1 minute 1 hour 1 dayslide 45:arcaccess.d, cont. • Written in DTrace #!/usr/sbin/dtrace -s #pragma D option quiet dtrace:::BEGIN printf("lbolt rate is %d Hertz.\n", `hz); printf("Tracing lbolts between ARC accesses..."); fbt::arc_access:entry self->gt;ab = args[0]; self->gt;lbolt = args[0]->gt;b_arc_access; fbt::arc_access:return /self->gt;lbolt/ @ = quantize(self->gt;ab->gt;b_arc_access - self->gt;lbolt); self->gt;ab = 0; self->gt;lbolt = 0; • http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/ Wednesday, October 3, 12slide 46:arcaccess.d, cont. • Shows population age of the ARC based on access time • Helps determine working set size and turn over rate • Previous example showed ARC was so large, it was able to keep buffers that had not been accessed in over a day. • Turn over rate == low, working set size == entirely fits, likely much smaller. Wednesday, October 3, 12slide 47:Latency counters • kstat -p zone_vfs:::*_ops, show latency outlier counts since boot: $ kstat -p zone_vfs:::*_ops zone_vfs:0:global:100ms_ops zone_vfs:0:global:10ms_ops zone_vfs:0:global:10s_ops zone_vfs:0:global:1s_ops zone_vfs:1:5cedb79e:100ms_ops zone_vfs:1:5cedb79e:10ms_ops zone_vfs:1:5cedb79e:10s_ops zone_vfs:1:5cedb79e:1s_ops • reads/writes: $ kstat -p zone_vfs:::reads zone_vfs:::writes zone_vfs:0:global:reads 666148884 zone_vfs:1:5cedb79e-51c6-41b3-b99d-358b33:reads 1791879457 zone_vfs:0:global:writes zone_vfs:1:5cedb79e-51c6-41b3-b99d-358b33:writes Wednesday, October 3, 12slide 48:Latency counters • kstat -p zone_vfs:::*_ops, show latency outlier counts since boot: $ kstat -p zone_vfs:::*_ops zone_vfs:0:global:100ms_ops zone_vfs:0:global:10ms_ops zone_vfs:0:global:10s_ops zone_vfs:0:global:1s_ops zone_vfs:1:5cedb79e:100ms_ops zone_vfs:1:5cedb79e:10ms_ops zone_vfs:1:5cedb79e:10s_ops zone_vfs:1:5cedb79e:1s_ops • reads/writes: yikes, 430 ops >gt; 1s! however, this is a small fraction $ kstat -p zone_vfs:::reads zone_vfs:::writes zone_vfs:0:global:reads 666148884 zone_vfs:1:5cedb79e-51c6-41b3-b99d-358b33:reads 1791879457 zone_vfs:0:global:writes zone_vfs:1:5cedb79e-51c6-41b3-b99d-358b33:writes >gt;1s = 0.00002% Wednesday, October 3, 12slide 49:Latency counters, cont. • Added by Bill Pijewski of Joyent • Proven invaluable, usually to quickly eliminate ZFS (and disk) as a source of issues, after the fact (when counters are near-zero) • Has had identified real ZFS/disk-level issues as well Wednesday, October 3, 12slide 50:Scatter plots iosnoop -Dots >gt; outfile awk/perl gnuplot/R 60ms read = red, write = blue latency time (s) Wednesday, October 3, 12slide 51:Scatter plots, cont. • Has been an effective last-resort to investigate nasty PERC issues: reads queueing behind writes • Has a big data problem: many points, x-y coordinates, takes time & space to capture and render. • Heat maps solve this, and can be real-time Wednesday, October 3, 12slide 52:Heat maps • WHAT DOES IT MEAN? Wednesday, October 3, 12slide 53:Heat maps • WHAT DOES IT MEAN? disk I/O DRAM I/O workload becomes cached Wednesday, October 3, 12slide 54:Heat maps • WHAT DOES IT MEAN? bi-modal distribution Wednesday, October 3, 12slide 55:Heat maps • File system latency is commonly bi-modal: cache hits vs misses • Average latency, stddev, 99th percentile, don’t make a lot of sense • Given line graphs of these, we would not be able to solve many problems • Need to see the full distribution: heat maps • Also good at identifying outliers Wednesday, October 3, 12slide 56:Heat maps, cont. • Latency outliers, cloud wide (entire datacenter): Wednesday, October 3, 12slide 57:Heat maps, cont. • Latency outliers, cloud wide (entire datacenter): Inconceivable Very Bad Bad Good Wednesday, October 3, 12slide 58:Cloud computing: within the zone • Suspected “noisy neighbors” hurting ZFS performance • ZFS I/O throttling was added to address this • Need ZFS observability to confirm/deny issues from within a zone (eg, Joyent SmartMachine), however, kernel access (including DTrace io and fbt providers) is not available. • vfsstat (already covered) does work, as it is zone-aware • Cloud safe DTrace tools include: • mysql_pid_latency.d: tracing FS calls from within the app • syscall tracing when fds[].fi_fs == “zfs” • Scripts are in: http://dtrace.org/blogs/brendan/2011/05/18/file-system-latency-part-3/ http://dtrace.org/blogs/brendan/2011/05/24/file-system-latency-part-4/ Wednesday, October 3, 12slide 59:Recap mysql_pid_fslatency.d syscall with fi_fs == zfs Process User-Land Kernel VFS zioslower.d spasync.d metaslab_free.d arcstat.pl arcaccess.d ZFS iostacks.d Block Device Interface kernel drivers as needed see DTrace book chap 4 Wednesday, October 3, 12 Syscall Interface vfsstat Disks iostat iosnoopslide 60:There’s more • There are more tools • zpool iostat • zpool status • degraded? resilvering? • zfs get all • recsize match workload? compression? dedup? • ziostat Wednesday, October 3, 12slide 61:There’s more, cont. • There are more tools • And much, much, more DTrace DTRACE ALL THE THINGS? Wednesday, October 3, 12slide 62:DTrace book, Chapter 4 Disks Disk I/O subsystem scripts Chapter 5 is FS includes ZFS scripts Wednesday, October 3, 12slide 63:Future • More tools • as needed, working on customer issues Wednesday, October 3, 12slide 64:Future • More tools • as needed, working on customer issues • But not too many... Wenger giant Wednesday, October 3, 12slide 65:Thank you! • email: brendan@joyent.com • twitter: @brendangregg • Resources: • http://dtrace.org/blogs/brendan • http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/ • http://dtrace.org/blogs/brendan/2011/06/03/file-system-latency-part-5/ • https://github.com/brendangregg/dtrace-cloud-tools/tree/master/fs • http://www.dtracebook.com • More on latency heat maps: http://queue.acm.org/detail.cfm?id=1809426 Wednesday, October 3, 12