OSCON 2013: Open Source Systems Performance
Talk given at OSCON 2013 by Brendan Gregg.Video: http://www.youtube.com/watch?v=VsQWSJgb1AE
next prev 1/45 | |
next prev 2/45 | |
next prev 3/45 | |
next prev 4/45 | |
next prev 5/45 | |
next prev 6/45 | |
next prev 7/45 | |
next prev 8/45 | |
next prev 9/45 | |
next prev 10/45 | |
next prev 11/45 | |
next prev 12/45 | |
next prev 13/45 | |
next prev 14/45 | |
next prev 15/45 | |
next prev 16/45 | |
next prev 17/45 | |
next prev 18/45 | |
next prev 19/45 | |
next prev 20/45 | |
next prev 21/45 | |
next prev 22/45 | |
next prev 23/45 | |
next prev 24/45 | |
next prev 25/45 | |
next prev 26/45 | |
next prev 27/45 | |
next prev 28/45 | |
next prev 29/45 | |
next prev 30/45 | |
next prev 31/45 | |
next prev 32/45 | |
next prev 33/45 | |
next prev 34/45 | |
next prev 35/45 | |
next prev 36/45 | |
next prev 37/45 | |
next prev 38/45 | |
next prev 39/45 | |
next prev 40/45 | |
next prev 41/45 | |
next prev 42/45 | |
next prev 43/45 | |
next prev 44/45 | |
next prev 45/45 |
PDF: OSCON2013_sysperf.pdf
Keywords (from pdftotext):
slide 1:
Open Source Systems Performance Brendan Gregg Lead Performance Engineer Joyent Jul, 2013slide 2:
A Play in Three Acts A tale of operating systems, performance, and open source Dramatis Personae - Solaris, an Operating System - Brendan Gregg, a Performance Engineer - Linux, a Kernel Acts - 1. Before open source (traditional tools) - 2. Open source (source code-based tracing) - 3. Closed sourceslide 3:
Setting the Scene: Why Performance? Reduce IT Spend - price/performance Choose performing components - evaluation (benchmarking) of software and hardware Develop scalable architectures - understand system limits and develop around them Solve issuesslide 4:
Setting the Scene: What is Systems Performance? Analysis of: Systems Performance Analysis - A) the kernel Applications - 2-20% wins: tuning TCP, NUMA, etc - 2-200x wins: latency outliers, bugs, etc System Libraries System Call Interface - 2-2000x wins: eliminating unnecessary work The basis is the system The target is everything, down to metal Think LAMP not AMP Kernel - B) applications from system context VFS Sockets File Systems TCP/UDP Volume Managers Block Device Interface Ethernet Device Drivers Firmware Metal Scheduler Virtual Memoryslide 5:
Part 1. Before Open Sourceslide 6:
Part 1. Before Open Source The year is 2002 Enter Solaris 9, stage left Solaris 9 is not open sourceslide 7:
Solaris 9 Numerous performance observability tools Scope Type Tools system counters vmstat(1M), iostat(1M), netstat(1M), kstat(1M), sar(1) system tracing snoop(1M), prex(1M), tnfdump(1) process counters ps(1), prstat(1M), ptime(1) process tracing truss(1), sotruss(1), apptrace(1) both profiling lockstat(1M), cpustat(1M), cputrack(1) Performance, including resource controls and observability, were main featuresslide 8:
Systems Performance Typified by Unix tools like vmstat(1M) (from BSD): $ vmstat 1 kthr memory r b w swap free re 0 0 0 8475356 565176 2 1 0 0 7983772 119164 0 0 0 0 8046208 181600 0 [...] page disk faults cpu mf pi po fr de sr cd cd s0 s5 cs us sy id 8 0 0 0 0 1 0 0 -0 13 378 101 142 0 0 99 0 0 0 0 0 0 224 0 0 0 1175 5654 1196 1 15 84 0 0 0 0 0 0 322 0 0 0 1473 6931 1360 1 7 92 Some drill-down were possible with options; eg, the Solaris -p: $ vmstat -p 1 memory swap free re 8475336 565160 2 7972332 107648 1 7966188 101504 0 [...] page mf fr de executable epi epo epf anonymous api apo apf filesystem fpi fpo fpf Despite many tools, options, and metrics, the extent of observability was limited. This can be illustrated using a functional diagramslide 9:
Operating System Functional Diagram Operating System Hardware Applications DBs, all server types, ... System Libraries Kernel System Call Interface VFS Sockets File Systems TCP/UDP Volume Managers Block Device Interface Ethernet Scheduler CPU Interconnect Virtual Memory Memory Bus Device Drivers DRAM I/O Bus I/O Bridge I/O Controller Expander Interconnect Network Controller Interface Transports Disk Disk CPU Port Portslide 10:
Solaris 9 Observability Coverage apptrace Operating System netstat sotruss lockstat mpstat Applications DBs, all server types, ... truss System Libraries kstat Hardware Solaris Kernel System Call Interface VFS Sockets File Systems TCP/UDP Volume Managers Block Device Interface Ethernet Scheduler prstat Virtual Memory Device Drivers I/O Bus iostat prex I/O Bridge I/O Controller CPU Interconnect vmstat Disk CPU Memory Bus DRAM snoop Expander Interconnect Network Controller Interface Transports Disk cpustat cputrack Port Port netstat kstat Various: sar kstatslide 11:
Problems Below the syscall interface was dark, if not pitch black Many components either had: - No metrics at all - Undocumented metrics (kstat) Certain performance issues could not be analyzed - Time from asking Sun for a new performance metric to having it in production could be months or years or never - You solve what the current tools let you: the “tools method” of iterating over existing tools and metrics Situation largely accepted as a better way wasn’t known Much systems performance literature was written in this era, and is still aroundslide 12:
High Performance Tuning Performance experts were skilled in the art of inference and experimentation - Study Solaris Internals for background - Determine kernel behavior based on indirect metrics - Create known workloads to test undocumented metrics, and to explore system behavior - Heavy use of the Scientific method Science is good, source is betterslide 13:
... If the Universe was Open Source vi universe/include/electron.h: struct electron { mass_t e_mass; /* electron mass */ charge_t e_charge; /* electron charge */ uint64_t e_flags; /* 0x01 particle; 0x10 wave */ int e_orbit; /* current orbit level */ boolean_t e_matter; /* 1 = matter; 0 = antimatter */ [...] } electron_t; vi universe/particles.c: photon_t * spontaneous_emission(electron_t *e) { photon_t *p; if (e->gt;e_orbit >gt; 1) { p = palloc(e); e->gt;e_orbit--; } else { electron_capture(e->gt;e_nucleusp); return (NULL) return (p);slide 14:
Part 2. Open Sourceslide 15:
Part 2. Open Source The year is 2005 Solaris 10, as OpenSolaris, becomes open source - In response to Linux, which always wasslide 16:
Open Source Metrics Undocumented kstats could now be understood from source - it was like being handed the source code to the Universe - I wasn’t a Sun badged employee; I’d been working without source access Tool metrics could also be better understood, and exact behavior of the kernel $ vmstat 1 kthr memory r b w swap free re 0 0 0 8475356 565176 2 1 0 0 7983772 119164 0 page disk faults cpu mf pi po fr de sr cd cd s0 s5 cs us sy id 8 0 0 0 0 1 0 0 -0 13 378 101 142 0 0 99 0 0 0 0 0 0 224 0 0 0 1175 5654 1196 1 15 84 For example, where does “r” come from?slide 17:
Understanding “r” Starting with vmstat(1M)’s source, and drilling down: usr/src/cmd/stat/vmstat/vmstat.c: static void printhdr(int sig) [...] if (swflag) (void) printf(" r b w else (void) printf(" r b w [...] swap free so pi po fr de sr "); swap free mf pi po fr de sr "); static void dovmstats(struct snapshot *old, struct snapshot *new) [...] adjprintf(" %*lu", 1, DELTA(s_sys.ss_sysinfo.runque) / sys_updates);slide 18:
Understanding “r”, cont. Searching on ss_sysinfo: usr/src/cmd/stat/common/statcommon.h: struct sys_snapshot { sysinfo_t ss_sysinfo; [...] usr/src/uts/common/sys/sysinfo.h: typedef struct sysinfo { uint_t updates; uint_t runque; uint_t runocc; uint_t swpque; uint_t swpocc; uint_t waiting; } sysinfo_t; /* (update freq) update action /* (1 sec) ++ /* (1 sec) += num runnable procs /* (1 sec) ++ if num runnable procs >gt; 0 */ /* (1 sec) += num swapped procs /* (1 sec) ++ if num swapped procs >gt; 0 */ /* (1 sec) += jobs waiting for I/Oslide 19:
Understanding “r”, cont. ss_sysinfo is populated from kstat: usr/src/cmd/stat/common/acquire.c: int acquire_sys(struct snapshot *ss, kstat_ctl_t *kc) size_t i; kstat_named_t *knp; kstat_t *ksp; if ((ksp = kstat_lookup(kc, "unix", 0, "sysinfo")) == NULL) return (errno); if (kstat_read(kc, ksp, &ss->gt;s_sys.ss_sysinfo) == -1) return (errno); [...]slide 20:
Understanding “r”, cont. Searching on runque population, in the kernel: usr/src/uts/common/os/clock.c: static void clock(void) * There is additional processing which happens every time * the nanosecond counter rolls over which is described * below - see the section which begins with : if (one_sec) [...] do { uint_t cpu_nrunnable = cp->gt;cpu_disp->gt;disp_nrunnable; nrunnable += cpu_nrunnable; [...] } while ((cp = cp->gt;cpu_next) != cpu_list); [...] Once-a-second snapshots? if (one_sec) { That’s good to know! [...] if (nrunnable) { sysinfo.runque += nrunnable; sysinfo.runocc++;slide 21:
Statistic Spelunking A matter of browsing and reading source code - I use cscope, a text-based source code browser: Doesn’t require expertise to begin with: keep reading code until it makes sense Might take hours or days if you are new to a complex code base You may only do this three times in your career, but each time was worth it! C symbol: runque File Function Line 0 sa.hslide 22:gt; 188 uint64_t runque; 1 sysinfo.h gt; 132 uint_t runque; 2 sar.c prt_q_opt 919 (float )xx->gt;si.runque / (float )xx->gt;si.runocc, 3 kstat.c save_sysinfo 1066 SAVE_UINT32(ksi, sysinfo, runque); 4 vmstat.c dovmstats 316 adjprintf(" %*lu", 1, DELTA(s_sys.ss_sysinfo.runque) / sys_updates); 5 clock.c clock 862 sysinfo.runque += nrunnable; Find this C symbol: Find this global definition: Find functions called by this function: Find functions calling this function: Find this text string: Change this text string: Find this egrep pattern: Find this file: Find files #including this file:
Open Source Dynamic Tracing Solaris 10 also provided Dynamic Tracing (DTrace), which can observe virtually everything Core feature of all later OpenSolaris derivatives, including SmartOS and OmniOS Observability gaps now filledslide 23:
Solaris 10/SmartOS/OmniOS Observability Coverage Operating System netstat plockstat lockstat mpstat Applications DBs, all server types, ... truss System Libraries kstat Solaris/illumos Kernel dtrace System Call Interface VFS Sockets File Systems TCP/UDP Volume Managers Block Device Interface Ethernet Scheduler iostat I/O Bridge I/O Controller CPU Interconnect prstat Virtual Memory Device Drivers I/O Bus Hardware snoop vmstat intrstat Expander Interconnect Network Controller Interface Transports Disk Disk Port Port cpustat cputrack CPU Memory Bus DRAM netstat kstat Various: sar kstatslide 24:
Open Source Dynamic Tracing: Example Given the kernel source code, eg, ZFS SPA sync: usr/src/uts/common/fs/zfs/spa.c: * Sync the specified transaction group. New blocks may be dirtied as * part of the process, so we iterate until it converges. void spa_sync(spa_t *spa, uint64_t txg) dsl_pool_t *dp = spa->gt;spa_dsl_pool; [...] Trace and time it using the DTrace function boundary tracing (fbt) provider: # dtrace -n 'fbt::spa_sync:entry { self->gt;ts = timestamp; } fbt::spa_sync:return /self->gt;ts/ { printf("%Y %d ms", walltimestamp, (timestamp - self->gt;ts) / 1000000); self->gt;ts = 0; }' dtrace: description 'fbt::spa_sync:entry ' matched 2 probes CPU FUNCTION:NAME 0 53625 spa_sync:return 2013 Jul 26 17:37:02 12 ms Awesome! 0 53625 spa_sync:return 2013 Jul 26 17:37:08 726 ms 6 53625 spa_sync:return 2013 Jul 26 17:37:17 6913 ms 6 53625 spa_sync:return 2013 Jul 26 17:37:17 59 msslide 25:
Dynamic Tracing Scripts cifs*.d, iscsi*.d :Services nfsv3*.d, nfsv4*.d ssh*.d, httpd*.d hotuser, umutexmax.d, lib*.d node*.d, erlang*.d, j*.d, js*.d Language Providers: php*.d, pl*.d, py*.d, rb*.d, sh*.d Databases: mysql*.d, postgres*.d, redis*.d, riak*.d fswho.d, fssnoop.d sollife.d solvfssnoop.d dnlcsnoop.d zfsslower.d ziowait.d ziostacks.d spasync.d metaslab_free.d iosnoop, iotop disklatency.d satacmds.d satalatency.d scsicmds.d scsilatency.d sdretry.d, sdqueue.d ide*.d, mpt*.d opensnoop, statsnoop errinfo, dtruss, rwtop rwsnoop, mmap.d, kill.d shellsnoop, zonecalls.d weblatency.d, fddist Applications DBs, all server types, ... System Libraries System Call Interface VFS Sockets File Systems TCP/UDP Volume Managers Block Device Interface Ethernet Device Drivers Scheduler Virtual Memory priclass.d, pridist.d cv_wakeup_slow.d displat.d, capslat.d minfbypid.d pgpginbypid.d macops.d ngesnoop.d, ngelink.d sotop.d, socketio.d, so1stbyte.d, soconnect.d, soaccept.d ipio.d, ipproto.d, ipstat.d, ipfbtsnoop.d, icmpsnoop.d tcp1stbyte.d, tcpaccept.d, tcpconnect.d, tcpconnlat.d, tcpio.d tcpbytes.d, tcpsize.d, tcpnmap.d, udpio.d, udpstat.d These are some of my scripts from the DTraceToolkit, the DTrace book, and other collections. I’d add more but I ran out of room.slide 26:
Modern Systems Performance Typified by an abundance of high resolution useful metrics (latency) No longer a problem of missing metrics, but how to visualize many metrics, and across clouds - eg, latency heat maps: Prior tools are useful as starting points, with tracing to dig deeper In the following sections, I’ll describe modern Linux Systems Performance, summarizing how traditional and new tools can be used together - I’ll group dtrace/systemtap/perf/lttng/ktap/etc as “dynamic tracing”, which is a simplification: some needs may not be met by all those toolsslide 27:
Linux CPU Analysis Traditional tools: - 1. system wide usage vmstat - 2. per-processor usage - 3. per-process usage mpstat top, ps - 4. user- or kernel-stack profiling - 5. cycle analysis Modern tools: perf record -agF perf stat perf sched - 6. tracing scheduler latency - 7. tracing CPU usage of functions dynamic/static tracing - 8. tracing CPU consumption of spin locks dynamic/static tracing - 9. CPU cross call tracing dynamic/static tracing - 10. interrupt tracing dynamic/static tracingslide 28:
Linux Memory Analysis Traditional tools: - 1. system wide usage vmstat - 2. per-process usage - 3. kernel usage top /proc/meminfo, slaptop - 4. swapping activity - 5. leak detection Modern tools: - 6. tracing allocations - 7. tracing page faults - 8. tracing kswapd activity sar valgrind dynamic tracing dynamic/static tracing dynamic/static tracingslide 29:
Linux File System Analysis Traditional tools: - 1. cache usage - 2. syscall access Modern tools: - 3. tracing VFS accesses - 4. tracing file system latency - 5. tracing file system internals free, /proc/meminfo strace (expensive) dynamic/static tracing latencytop, dynamic/static tracing dynamic/static tracingslide 30:
Linux Disk Analysis Traditional tools: - 1. per-disk statistics - 2. per-process usage Modern tools: - 3. disk I/O latency tracing - 4. lower I/O stack tracing - 5. SCSI command tracing - 6. device driver tracing iostat pidstat -d, iotop blktrace, static tracing dynamic/static tracing dynamic/static tracing dynamic/static tracingslide 31:
Linux Network Analysis Traditional tools: - 1. system wide usage netstat -s - 2. per-interface usage - 3. TCP statistics netstat -i, sar -n DEV, ip netstat -s, sar -n TCP - 4. packet tracing - 5. socket call tracing tcpdump strace (expensive) - 6. experimental tests Modern tools: ping, traceroute - 6. tracing socket-level latency - 7. TCP retransmit (only) tracing - 8. tracing TCP kernel internals dynamic/static tracing dynamic tracing dynamic tracingslide 32:
Linux Network Analysis, Example TCP retransmits: given tcp_retransmit_skb(), show the dest IP addr. Source code: net/ipv4/tcp_output.c: int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb) struct tcp_sock *tp = tcp_sk(sk); int err = __tcp_retransmit_skb(sk, skb); [...] include/linux/tcp.h: struct tcp_sock { /* inet_connection_sock has to be the first member of tcp_sock */ struct inet_connection_sock inet_conn; [...] include/net/inet_connection_sock.h: struct inet_connection_sock { /* inet_sock has to be the first member! */ struct inet_sock icsk_inet; [...]slide 33:
Linux Network Analysis, Example ... More spelunking, like earlier. Not trivial, but doable. include/net/inet_sock.h: struct inet_sock { /* sk and pinet6 has to be the first two members of inet_sock */ struct sock sk; #if IS_ENABLED(CONFIG_IPV6) struct ipv6_pinfo *pinet6; #endif /* Socket demultiplex comparisons on incoming packets. */ #define inet_daddr sk.__sk_common.skc_daddr [...] Here it is include/net/sock.h struct sock { * Now struct inet_timewait_sock also uses sock_common, so please just * don't add nothing before this first member (__sk_common) --acme struct sock_common __sk_common; [...]slide 34:
Linux Network Analysis, Example Script TCP retransmit tracing script, using DTrace4Linux (prototype): #!/usr/sbin/dtrace -s #pragma D option quiet dtrace:::BEGIN { trace("Tracing TCP retransmits... Ctrl-C to end.\n"); } fbt::tcp_retransmit_skb:entry { this->gt;so = (struct sock *)arg0; this->gt;d = (unsigned char *)&this->gt;so->gt;__sk_common.skc_daddr; printf("%Y: retransmit to %d.%d.%d.%d, by:", walltimestamp, this->gt;d[0], this->gt;d[1], this->gt;d[2], this->gt;d[3]); stack(99);slide 35:
Linux Network Analysis, Example Output TCP retransmit tracing script, using DTrace4Linux (prototype): # ./tcpretransmit.d Tracing TCP retransmits... Ctrl-C to end. 2013 Feb 23 18:24:11: retransmit to 10.2.124.2, by: kernel`tcp_retransmit_timer+0x1bd kernel`tcp_write_timer+0x188 kernel`run_timer_softirq+0x12b kernel`tcp_write_timer kernel`__do_softirq+0xb8 kernel`read_tsc+0x9 kernel`sched_clock+0x9 kernel`sched_clock_local+0x25 kernel`call_softirq+0x1c kernel`do_softirq+0x65 kernel`irq_exit+0x9e kernel`smp_apic_timer_interrupt+0x6e kernel`apic_timer_interrupt+0x6e [...]slide 36:
Linux Network Example, cont. I created a custom performance tool on the fly, without kernel changes Would it be possible if the kernel wasn’t open source?slide 37:
Opportunities Open source allows dynamic tracing: otherwise you are tracing blind Dynamic tracing allows custom metrics and scripts (tools) to be written - fill in all observability gaps; can solve most performance issues Many people will use dynamic tracing: eg, DTraceToolkit, DTrace book, company tools; only some may author the tools: the OS or perf engineer on your team (which is ok) Dynamic tracing also allows new methodologies - prior methodologies constrained by existing tools and metrics - new methodologies can be explored, as any question posed can be answered Examples of new methodologies - USE method - Thread State Analysis methodslide 38:
Challenges Systems performance literature written for the pre-open source Unix days - Gives the impression that performance ends with older static tools DTrace not on Linux yet - Two ports are in progress: - DTrace4Linux: https://github.com/dtrace4linux/linux - Oracle Linux DTrace Instead of waiting, you can try an illumos-kernel based distro like SmartOS - illumos is the surviving fork of OpenSolaris. Which brings us to Act 3.slide 39:
Act 3. Closed Sourceslide 40:
Act 3. Closed Source The year is 2010 Oracle stops releasing updates for OpenSolaris Oracle Solaris 11 is released a year later, closed source Provides us with a unique additional perspective for open source systems performanceslide 41:
Closed Source Metrics This closed the only documentation for many metrics and kernel internals - Back to inference and experimentation by the end user - Will get harder over time as documentation ages: without a Solaris Internals 3rd Edition, kernel internals may become as opaque as it was in the 90’sslide 42:
Closed Source Dynamic Tracing Makes using the DTrace fbt provider much harder - Hypothetical example to show how this could manifest: - Dynamic tracing of ZFS SPA sync during a performance investigation: # dtrace -n 'fbt::spa_sync:entry { printf("%Y", walltimestamp); }' dtrace: invalid probe specifier fbt::spa_sync:entry { printf("%Y", walltimestamp); }: probe description fbt::spa_sync:entry does not match any probes Where’d spa_sync() go? Did it get renamed or removed? - Could be worse if tracing succeeds, but produces misleading metrics due to unknown changes Note that the capabilities are still there, and can be used by Oracle supportslide 43:
Elsewhere at Oracle Their DTrace port for Oracle Linux won’t have this handicap - although, the fbt provider hasn’t been included yetslide 44:
Epilog: The New Systems Performance An era of: - Open source - Dynamic tracing - Methodologies - Distributed systems (cloud) Covered in my book, out this year:slide 45:
Thank you! email: brendan@joyent.com twitter: @brendangregg blog: http://dtrace.org/blogs/brendan