DockerCon 2017: Container Performance Analysis
Slides for DockerCon 2017 by Brendan GreggVideo: https://www.youtube.com/watch?v=bK9A5ODIgac
Description: "Containers pose interesting challenges for performance monitoring and analysis, requiring new analysis methodologies and tooling. Resource-oriented analysis, as is common with systems performance tools and GUIs, must now account for both hardware limits and soft limits, as implemented using resource controls including cgroups. The interaction between containers can also be examined, and noisy neighbors either identified of exonerated. Performance tooling can also need special usage or workarounds to function properly from within a container or on the host, to deal with different privilege levels and name spaces. At Netflix, we're using containers for some microservices, and care very much about analyzing and tuning our containers to be as fast and efficient as possible. This talk will show how to successfully analyze performance in a Docker container environment, and navigate differences encountered."
PDF: DockerCon2017_performance_analysis.pdf
Keywords (from pdftotext):
slide 1:
Container Performance Analysis Brendan Gregg Sr. Performance Architect, Netflixslide 2:
Take Aways Identify bottlenecks: 1. In the host vs container, using system metrics 2. In application code on containers, using CPU flame graphs 3. Deeper in the kernel, using tracing tools Focus of this talk is how containers work in Linux (will demo on 4.9) I will include some Docker specifics, and start with a Netflix summary (Titus)slide 3:
1. Titus Containers at Summary slides from the Titus teamslide 4:
Titus • Cloud runtime platform for container jobs • Scheduling Service & batch job management Advanced resource management across elastic shared resource pool • Container Execution • Docker and AWS EC2 Integration Adds VPC, security groups, EC2 Service Batch Job Management Resource Management & Op=miza=on Container Execu=on metadata, IAM roles, S3 logs, … • Integration with Netflix infrastructure • In depth: http://techblog.netflix.com/2017/04/the-evolution-of-container-usage-at.html Integra=onslide 5:
Current Titus Scale • Deployed across multiple AWS accounts & three regions • Over 2,500 instances (Mostly M4.4xls & R3.8xls) • Over a week period launched over 1,000,000 containersslide 6:
Titus Use Cases • Service Stream Processing (Flink) UI Services (Node.JS single core) Internal dashboards • Batch Algorithm training, personalization & recommendations Adhoc reporting Continuous integration builds • Queued worker model Media encodingslide 7:
Container Performance @Netflix • Ability to scale and balance workloads with EC2 and Titus Can already solve many perf issues • Performance needs: • Application analysis: using CPU flame graphs with containers • Host tuning: file system, networking, sysctl's, … • Container analysis and tuning: cgroups, GPUs, … • Capacity planning: reduce over provisioningslide 8:
2. Container Background And Strategyslide 9:
Namespaces Restricting visibility Namespaces: • cgroup • ipc • mnt • net • pid • user • uts PID namespaces Host PID 1 PID namespace 1 1 (1238) 2 (1241) Kernelslide 10:
Control Groups Restricting usage cgroups: • blkio • cpu,cpuacct • cpuset • devices • hugetlb • memory • net_cls,net_prio • pids • … CPU cgroups container container container cpu cgroup 1 CPUsslide 11:
Linux Containers Container = combination of namespaces & cgroups Host Container 1 Container 2 Container 3 (namespaces) (namespaces) (namespaces) cgroups cgroups cgroups Kernelslide 12:
cgroup v1 cpu,cpuacct: cap CPU usage (hard limit). e.g. 1.5 CPUs. CPU shares. e.g. 100 shares. usage statistics (cpuacct) Docker: --cpus (1.13) --cpu-shares memory: limit and kmem limit (maximum bytes) OOM control: enable/disable usage statistics blkio (block I/O): weights (like shares) IOPS/tput caps per storage device statistics --memory --kernel-memory --oom-kill-disableslide 13:
CPU Shares Container's CPU limit = 100% x container's shares total busy shares This lets a container use other tenant's idle CPU (aka "bursting"), when available. Container's minimum CPU limit = 100% x container's shares total allocated shares Can make analysis tricky. Why did perf regress? Less bursting available?slide 14:
cgroup v2 • Major rewrite has been happening: cgroups v2 Supports nested groups, better organization and consistency Some already merged, some not yet (e.g. CPU) • See docs/talks by maintainer Tejun Heo (Facebook) • References: https://www.kernel.org/doc/Documentation/cgroup-v2.txt https://lwn.net/Articles/679786/slide 15:
Container OS Configuration File systems • Containers may be setup with aufs/overlay on top of another FS • See "in practice" pages and their performance sections from https://docs.docker.com/engine/userguide/storagedriver/ Networking • With Docker, can be bridge, host, or overlay networks • Overlay networks have come with significant performance costslide 16:
Analysis Strategy Performance analysis with containers: • One kernel • Two perspectives • Namespaces • cgroups Methodologies: • USE Method • Workload characterization • Checklists • Event tracingslide 17:
USE Method For every resource, check: 1. Utilization 2. Saturation 3. Errors Resource Utilization (%) For example, CPUs: • Utilization: time busy • Saturation: run queue length or latency • Errors: ECC errors, etc. Can be applied to hardware resources and software resources (cgroups)slide 18:
3. Host Tools And Container Awareness … if you have host accessslide 19:
Host Analysis Challenges • PIDs in host don't match those seen in containers • Symbol files aren't where tools expect them • The kernel currently doesn't have a container IDslide 20:
CLI Tool Disclaimer I'll demo CLI tools It's the lowest common denominator You may usually use GUIs (like we do). They source the same metrics.slide 21:
3.1. Host Physical Resources A refresher of basics... Not container specific. This will, however, solve many issues! Containers are often not the problem.slide 22:
Linux Perf Tools Where can we begin?slide 23:
Host Perf Analysis in 60s uptime dmesg | tail vmstat 1 mpstat -P ALL 1 pidstat 1 iostat -xz 1 free -m sar -n DEV 1 sar -n TCP,ETCP 1 top http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html load averages kernel errors overall stats by time CPU balance process usage disk I/O memory usage network I/O TCP stats check overviewslide 24:
USE Method: Host Resources Resource Utilization Saturation Errors CPU mpstat -P ALL 1, sum non-idle fields vmstat 1, "r" perf Memory Capacity free –m, "used"/"total" vmstat 1, "si"+"so"; demsg | grep killed dmesg Storage I/O iostat –xz 1, "%util" iostat –xnz 1, "avgqu-sz" >gt; 1 /sys/…/ioerr_cnt; smartctl Network nicstat, "%Util" ifconfig, "overrunns"; netstat –s "retrans…" ifconfig, "errors" These should be in your monitoring GUI. Can do other resources too (busses, ...)slide 25:
Event Tracing: e.g. iosnoop Disk I/O events with latency (from perf-tools; also in bcc/BPF as biosnoop) # ./iosnoop Tracing block I/O... Ctrl-C to end. COMM PID TYPE DEV supervise 202,1 supervise 202,1 tar 14794 RM 202,1 tar 14794 RM 202,1 tar 14794 RM 202,1 tar 14794 RM 202,1 tar 14794 RM 202,1 tar 14794 RM 202,1 tar 14794 RM 202,1 tar 14794 RM 202,1 BLOCK BYTES LATmsslide 26:
Event Tracing: e.g. zfsslower # /usr/share/bcc/tools/zfsslower 1 Tracing ZFS operations slower than 1 ms TIME COMM PID T BYTES 23:44:40 java 31386 O 0 23:44:53 java 31386 W 8190 23:44:59 java 31386 W 8192 23:44:59 java 31386 W 8191 23:45:00 java 31386 W 8192 23:45:15 java 31386 O 0 23:45:56 dockerd S 0 23:46:16 java 31386 W 31 OFF_KB LAT(ms) FILENAME 8.02 solrFeatures.txt 36.24 solrFeatures.txt 20.28 solrFeatures.txt 28.15 solrFeatures.txt 32.17 solrFeatures.txt 27.44 solrFeatures.txt 1.03 .tmp-a66ce9aad… 36.28 solrFeatures.txt • This is from our production Titus system (Docker). • File system latency is a better pain indicator than disk latency. • zfsslower (and btrfs*, etc) are in bcc/BPF. Can exonerate FS/disks.slide 27:
Latency Histograms: e.g. btrfsdist # ./btrfsdist Tracing btrfs operation latency... Hit Ctrl-C to end. operation = 'read' usecs : count distribution 0 ->gt; 1 : 192529 |****************************************| 2 ->gt; 3 : 72337 |*************** 4 ->gt; 7 : 5620 probably 8 ->gt; 15 : 1026 cache reads 16 ->gt; 31 : 369 32 ->gt; 63 : 239 64 ->gt; 127 : 53 probably 128 ->gt; 255 : 975 cache misses 256 ->gt; 511 : 524 512 ->gt; 1023 : 128 (flash reads) 1024 ->gt; 2047 : 16 2048 ->gt; 4095 : 7 4096 ->gt; 8191 : 2slide 28:
Latency Histograms: e.g. btrfsdist […] operation = 'write' usecs 0 ->gt; 1 2 ->gt; 3 4 ->gt; 7 8 ->gt; 15 16 ->gt; 31 32 ->gt; 63 64 ->gt; 127 128 ->gt; 255 256 ->gt; 511 512 ->gt; 1023 : count : 1 : 276 : 32125 : 111253 : 59154 : 5463 : 612 : 25 : 2 : 1 distribution |*********** |****************************************| |********************* • From a test Titus system (Docker). • Histograms show modes, outliers. Also in bcc/BPF (with other FSes). • Latency heat maps: http://queue.acm.org/detail.cfm?id=1809426slide 29:
3.2. Host Containers & cgroups Inspecting containers from the hostslide 30:
Namespaces Worth checking namespace config before analysis: # ./dockerpsns.sh CONTAINER NAME host titusagent-mainvpc-m b27909cd6dd1 Titus-1435830-worker dcf3a506de45 Titus-1392192-worker 370a3f041f36 Titus-1243558-worker af7549c76d9a Titus-1243553-worker dc27769a9b9c Titus-1243546-worker e18bd6189dcd Titus-1243517-worker ab45227dcea9 Titus-1243516-worker PID PATH CGROUP IPC MNT NET PID USER UTS 1 systemd 4026531835 4026531839 4026531840 4026532533 4026531836 4026531837 4026531838 37280 svscanboot 4026531835 4026533387 4026533385 4026532931 4026533388 4026531837 4026533386 27992 /apps/spaas/spaa 4026531835 4026533354 4026533352 4026532991 4026533355 4026531837 4026533353 98602 /apps/spaas/spaa 4026531835 4026533290 4026533288 4026533223 4026533291 4026531837 4026533289 97972 /apps/spaas/spaa 4026531835 4026533216 4026533214 4026533149 4026533217 4026531837 4026533215 97356 /apps/spaas/spaa 4026531835 4026533142 4026533140 4026533075 4026533143 4026531837 4026533141 96733 /apps/spaas/spaa 4026531835 4026533068 4026533066 4026533001 4026533069 4026531837 4026533067 96173 /apps/spaas/spaa 4026531835 4026532920 4026532918 4026532830 4026532921 4026531837 4026532919 • A POC "docker ps --namespaces" tool. NS shared with root in red. • https://github.com/docker/docker/issues/32501slide 31:
systemd-cgtop A "top" for cgroups: # systemd-cgtop Control Group /docker /docker/dcf3a...9d28fc4a1c72bbaff4a24834 /docker/370a3...e64ca01198f1e843ade7ce21 /system.slice /system.slice/daemontools.service /docker/dc277...42ab0603bbda2ac8af67996b /user.slice /user.slice/user-0.slice /user.slice/u....slice/session-c26.scope /docker/ab452...c946f8447f2a4184f3ccff2a /docker/e18bd...26ffdd7368b870aa3d1deb7a [...] Tasks %CPU Memory 45.9G 42.1G 24.0G 3.0G 4.1G 2.8G 2.3G 34.5M 15.7M 13.3M 6.3G 2.9G Input/s Output/sslide 32:
docker stats A "top" for containers. Resource utilization. Workload characterization. # docker stats CONTAINER CPU % 353426a09db1 526.81% 6bf166a66e08 303.82% 58dcf8aed0a7 41.01% 61061566ffe5 85.92% bdc721460293 2.69% 6c80ed61ae63 477.45% 337292fb5b64 89.05% b652ede9a605 173.50% d7cd2599291f 504.28% 05bf9f3e0d13 314.46% 09082f005755 142.04% bd45a3e1ce16 190.26% [...] MEM USAGE / LIMIT 4.061 GiB / 8.5 GiB 3.448 GiB / 8.5 GiB 1.322 GiB / 2.5 GiB 220.9 MiB / 3.023 GiB 1.204 GiB / 3.906 GiB 557.7 MiB / 8 GiB 766.2 MiB / 8 GiB 689.2 MiB / 8 GiB 673.2 MiB / 8 GiB 711.6 MiB / 8 GiB 693.9 MiB / 8 GiB 538.3 MiB / 8 GiB MEM % 47.78% 40.57% 52.89% 7.14% 30.82% 6.81% 9.35% 8.41% 8.22% 8.69% 8.47% 6.57% NET I/O 0 B / 0 B 0 B / 0 B 0 B / 0 B 0 B / 0 B 0 B / 0 B 0 B / 0 B 0 B / 0 B 0 B / 0 B 0 B / 0 B 0 B / 0 B 0 B / 0 B 0 B / 0 B BLOCK I/O 2.818 MB / 0 B 2.032 MB / 0 B 0 B / 0 B 43.4 MB / 0 B 4.35 MB / 0 B 9.257 MB / 0 B 5.493 MB / 0 B 6.48 MB / 0 B 12.58 MB / 0 B 7.942 MB / 0 B 8.081 MB / 0 B 10.6 MB / 0 B Loris Degioanni demoed a similar sysdigcloud view yesterday (needs the sysdig kernel agent) PIDSslide 33:
top In the host, top shows all processes. Currently doesn't show a container ID. # top - 22:46:53 up 36 days, 59 min, 1 user, load average: 5.77, 5.61, 5.63 Tasks: 1067 total, 1 running, 1046 sleeping, 0 stopped, 20 zombie %Cpu(s): 34.8 us, 1.8 sy, 0.0 ni, 61.3 id, 0.0 wa, 0.0 hi, 1.9 si, 0.1 st KiB Mem : 65958552 total, 12418448 free, 49247988 used, 4292116 buff/cache KiB Swap: 0 total, 0 free, 0 used. 13101316 avail Mem PID USER 28321 root 97712 root 98306 root 96511 root 5283 root 2079 root 5272 titusag+ […] VIRT RES 0 33.126g 0.023t 0 11.445g 2.333g 0 12.149g 3.060g 0 15.567g 6.313g 0 1643676 100092 0 10.473g 1.611g SHR S %CPU %MEM TIME+ COMMAND 37564 S 621.1 38.2 35184:09 java 37084 S 3.1 3.7 404:27.90 java 36996 S 2.0 4.9 194:21.10 java 37112 S 1.7 10.0 168:07.44 java 94184 S 1.0 0.2 401:36.16 mesos-slave 12 S 0.7 0.0 220:07.75 rngd 23488 S 0.7 2.6 1934:44 java … remember, there is no container ID in the kernel yet.slide 34:
htop htop can add a CGROUP field, but, can truncate important info: CGROUP PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command :pids:/docker/ 28321 root 0 33.1G 24.0G 37564 S 524. 38.2 672h /apps/java :pids:/docker/ 9982 root 0 33.1G 24.0G 37564 S 44.4 38.2 17h00:41 /apps/java :pids:/docker/ 9985 root 0 33.1G 24.0G 37564 R 41.9 38.2 16h44:51 /apps/java :pids:/docker/ 9979 root 0 33.1G 24.0G 37564 S 41.2 38.2 17h01:35 /apps/java :pids:/docker/ 9980 root 0 33.1G 24.0G 37564 S 39.3 38.2 16h59:17 /apps/java :pids:/docker/ 9981 root 0 33.1G 24.0G 37564 S 39.3 38.2 17h01:32 /apps/java :pids:/docker/ 9984 root 0 33.1G 24.0G 37564 S 37.3 38.2 16h49:03 /apps/java :pids:/docker/ 9983 root 0 33.1G 24.0G 37564 R 35.4 38.2 16h54:31 /apps/java :pids:/docker/ 9986 root 0 33.1G 24.0G 37564 S 35.4 38.2 17h05:30 /apps/java :name=systemd:/user.slice/user-0.slice/session-c31.scope? 74066 root 0 27620 :pids:/docker/ 9998 root 0 33.1G 24.0G 37564 R 28.3 38.2 11h38:03 /apps/java :pids:/docker/ 10001 root 0 33.1G 24.0G 37564 S 27.7 38.2 11h38:59 /apps/java :name=systemd:/system.slice/daemontools.service? 5272 titusagen 20 0 10.5G 1650M 23 :pids:/docker/ 10002 root 0 33.1G 24.0G 37564 S 25.1 38.2 11h40:37 /apps/java Can fix, but that would be Docker + cgroup-v1 specific. Still need a kernel CID.slide 35:
Host PID ->gt; Container ID … who does that (CPU busy) PID 28321 belong to? # grep 28321 /sys/fs/cgroup/cpu,cpuacct/docker/*/tasks | cut -d/ -f7 dcf3a506de453107715362f6c9ba9056fcfc6e769d28fc4a1c72bbaff4a24834 • Only works for Docker, and that cgroup v1 layout. Some Linux commands: # ls -l /proc/27992/ns/* lrwxrwxrwx 1 root root 0 Apr 13 20:49 cgroup ->gt; cgroup:[4026531835] lrwxrwxrwx 1 root root 0 Apr 13 20:49 ipc ->gt; ipc:[4026533354] lrwxrwxrwx 1 root root 0 Apr 13 20:49 mnt ->gt; mnt:[4026533352] […] # cat /proc/27992/cgroup 11:freezer:/docker/dcf3a506de453107715362f6c9ba9056fcfc6e769d28fc4a1c72bbaff4a24834 10:blkio:/docker/dcf3a506de453107715362f6c9ba9056fcfc6e769d28fc4a1c72bbaff4a24834 9:perf_event:/docker/dcf3a506de453107715362f6c9ba9056fcfc6e769d28fc4a1c72bbaff4a24834 […]slide 36:
nsenter Wrapping … what hostname is PID 28321 running on? # nsenter -t 28321 -u hostname titus-1392192-worker-14-16 • Can namespace enter: • -m: mount -u: uts -i: ipc -n: net -p: pid -U: user • Bypasses cgroup limits, and seccomp profile (allowing syscalls) For Docker, you can enter the container more completely with: docker exec -it CID command • Handy nsenter one-liners: nsenter -t PID -u hostname nsenter -t PID -n netstat -i nsenter -t PID –m -p df -h nsenter -t PID -p top container hostname container netstat container file system usage container topslide 37:
nsenter: Host ->gt; Container top … Given PID 28321, running top for its container by entering its namespaces: # nsenter -t 28321 -m -p top top - 18:16:13 up 36 days, 20:28, 0 users, load average: 5.66, 5.29, 5.28 Tasks: 6 total, 1 running, 5 sleeping, 0 stopped, 0 zombie %Cpu(s): 30.5 us, 1.7 sy, 0.0 ni, 65.9 id, 0.0 wa, 0.0 hi, 1.8 si, 0.1 st KiB Mem: 65958552 total, 54664124 used, 11294428 free, 164232 buffers KiB Swap: 0 total, 0 used, 0 free. 1592372 cached Mem PID USER 301 root 1 root 87888 root VIRT RES 0 33.127g 0.023t SHR S %CPU %MEM 37564 S 537.3 38.2 1812 S 0.0 0.0 1348 R 0.0 0.0 TIME+ COMMAND 40269:41 java 4:15.11 bash 0:00.00 top Note that it is PID 301 in the container. Can also see this using: # grep NSpid /proc/28321/status NSpid:slide 38:
perf: CPU Profiling Can run system-wide (-a), match a pid (-p), or cgroup (-G, if it works) # perf record -F 49 -a -g -- sleep 30 # perf script Failed to open /lib/x86_64-linux-gnu/libc-2.19.so, continuing without symbols Failed to open /tmp/perf-28321.map, continuing without symbols • Current symbol translation gotchas (up to 4.10-ish): perf can't find /tmp/perf-PID.map files in the host, and the PID is different perf can't find container binaries under host paths (what /usr/bin/java?) • Can copy files to the host, map PIDs, then run perf script/report: http://blog.alicegoldfuss.com/making-flamegraphs-with-containerized-java/ http://batey.info/docker-jvm-flamegraphs.html • Can nsenter (-m -u -i -n -p) a "power" shell, and then run "perf -p PID" • perf should be fixed to be namespace aware (like bcc was, PR#1051)slide 39:
CPU Flame Graphs git clone --depth 1 https://github.com/brendangregg/FlameGraph cd FlameGraph perf record –F 49 -a –g -- sleep 30 perf script | ./stackcollapse-perf.pl | ./flamegraph.pl >gt; perf.svg • See previous slide for getting perf symbols to work • From the host, can study all containers, as well as container overheads Kernel TCP/IP stack Look in areas like this to find and quantify overhead (cgroup throttles, FS layers, networking, etc). It's likely small and hard to find. Java, missing stacks (need -XX:+PreserveFramePointer)slide 40:
/sys/fs/cgroups (raw) The best source for per-cgroup metrics. e.g. CPU: # cd /sys/fs/cgroup/cpu,cpuacct/docker/02a7cf65f82e3f3e75283944caa4462e82f8f6ff5a7c9a... # ls cgroup.clone_children cpuacct.usage_all cpuacct.usage_sys cpu.shares cgroup.procs cpuacct.usage_percpu cpuacct.usage_user cpu.stat cpuacct.stat cpuacct.usage_percpu_sys cpu.cfs_period_us notify_on_release cpuacct.usage cpuacct.usage_percpu_user cpu.cfs_quota_us tasks # cat cpuacct.usage # cat cpu.stat total time throttled (nanoseconds). saturation metric. nr_periods 507 average throttle time = throttled_time / nr_throttled nr_throttled 74 throttled_time 3816445175 https://www.kernel.org/doc/Documentation/cgroup-v1/, ../scheduler/sched-bwc.txt https://blog.docker.com/2013/10/gathering-lxc-docker-containers-metrics/ Note: grep cgroup /proc/mounts to check where these are mounted These metrics should be included in performance monitoring GUIsslide 41:
Netflix Atlas Cloud-wide monitoring of containers (and instances) Fetches cgroup metrics via Intel snap https://github.com/netflix/Atlasslide 42:
Netflix Vector Our per-instance analyzer Has per-container metrics https://github.com/Netflix/vectorslide 43:
Intel snap A metric collector used by monitoring GUIs https://github.com/intelsdi-x/snap Has a Docker plugin to read cgroup stats There's also a collectd plugin: https://github.com/bobrik/collectddockerslide 44:
3.3. Let's Play a Game Host or Container? (or neither?)slide 45:
Game Scenario 1 Container user claims they have a CPU performance issue Container has a CPU cap and CPU shares configured There is idle CPU on the host Other tenants are CPU busy /sys/fs/cgroup/.../cpu.stat ->gt; throttled_time is increasing /proc/PID/status nonvoluntary_ctxt_switches is increasing Container CPU usage equals its cap (clue: this is not really a clue)slide 46:
Game Scenario 2 Container user claims they have a CPU performance issue Container has a CPU cap and CPU shares configured There is no idle CPU on the host Other tenants are CPU busy /sys/fs/cgroup/.../cpu.stat ->gt; throttled_time is not increasing /proc/PID/status nonvoluntary_ctxt_switches is increasingslide 47:
Game Scenario 3 Container user claims they have a CPU performance issue Container has CPU shares configured There is no idle CPU on the host Other tenants are CPU busy /sys/fs/cgroup/.../cpu.stat ->gt; throttled_time is not increasing /proc/PID/status nonvoluntary_ctxt_switches is not increasing much Experiments to confirm conclusion?slide 48:
Methodology: Reverse Diagnosis Enumerate possible outcomes, and work backwards to the metrics needed for diagnosis. e.g. CPU performance outcomes: A. physical CPU throttled B. cap throttled C. shares throttled (assumes physical CPU limited as well) D. not throttled Game answers: 1. B, 2. C, 3. Dslide 49:
CPU Bottleneck Identification throttled_time increasing? cap throttled nonvol…switches increasing? (but dig further) not throttled share throttled host has idle CPU? all other tenants idle? physical CPU throttledslide 50:
4. Guest Tools And Container Awareness … if you only have guest accessslide 51:
Guest Analysis Challenges • Some resource metrics are for the container, some for the host. Confusing! • May lack system capabilities or syscalls to run profilers and tracersslide 52:
CPU Can see host's CPU devices, but only container (pid namespace) processes: container# uptime 20:17:19 up 45 days, 21:21, 0 users, load average: 5.08, 3.69, 2.22 container# mpstat 1 Linux 4.9.0 (02a7cf65f82e) 04/14/17 _x86_64_ (8 CPU) busy CPUs 20:17:26 CPU %usr %nice %sys %iowait %irq 20:17:27 all 20:17:28 all Average: all container# pidstat 1 Linux 4.9.0 (02a7cf65f82e) 04/14/17 _x86_64_ (8 CPU) load! %soft %steal %guest %gnice %idle 20:17:33 UID PID %usr %system %guest %CPU CPU Command 20:17:34 UID PID %usr %system %guest %CPU CPU Command 20:17:35 [...] UID PID %usr %system %guest %CPU CPU Command but this container is running nothing (we saw CPU usage from neighbors)slide 53:
Memory Can see host's memory: container# free -m total Mem: Swap: used free container# perl -e '$a = "A" x 1_000_000_000' Killed shared buff/cache available host memory (this container is --memory=1g) tries to consume ~2 Gbytesslide 54:
Disks Can see host's disk devices: container# iostat -xz 1 avg-cpu: %user %nice %system %iowait %steal 0.00 16.94 %idle host disk I/O Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz xvdap1 xvdb 0.00 200.00 0.00 3080.00 xvdc 0.00 185.00 0.00 2840.00 md0 0.00 385.00 0.00 5920.00 [...] container# pidstat -d 1 Linux 4.9.0 (02a7cf65f82e) 04/18/17 _x86_64_ (8 CPU) await r_await w_await svctm %util 2.00 2.00 0.40 0.00 0.20 4.00 0.00 0.24 4.40 0.00 0.00 0.00 22:41:13 UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command 22:41:14 UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command 22:41:15 [...] UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command but no container I/Oslide 55:
Network Can't see host's network interfaces (network namespace): container# sar -n DEV,TCP 1 Linux 4.9.0 (02a7cf65f82e) 04/14/17 21:45:07 21:45:08 21:45:08 21:45:07 21:45:08 21:45:08 21:45:09 21:45:09 21:45:08 21:45:09 [...] IFACE eth0 _x86_64_ (8 CPU) rxpck/s txpck/s rxkB/s active/s passive/s iseg/s oseg/s rxpck/s txpck/s rxkB/s active/s passive/s iseg/s oseg/s IFACE eth0 txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil host has heavy network I/O, container sees itself (idle)slide 56:
Metrics Namespace This confuses apps too: trying to bind on all CPUs, or using 25% of memory Including the JDK, which is unaware of container limits, covered yesterday by Fabiane Nardon We could add a "metrics" namespace so the container only sees itself Or enhance existing namespaces to do this If you add a metrics namespace, please consider adding an option for: • /proc/host/stats: maps to host's /proc/stats, for CPU stats • /proc/host/diskstats: maps to host's /proc/diskstats, for disk stats As those host metrics can be useful, to identify/exonerate neighbor issuesslide 57:
perf: CPU Profiling Needs capabilities to run from a container: container# ./perf record -F 99 -a -g -- sleep 10 perf_event_open(..., PERF_FLAG_FD_CLOEXEC) failed with unexpected error 1 (Operation not permitted) perf_event_open(..., 0) failed unexpectedly with error 1 (Operation not permitted) Error: You may not have permission to collect system-wide stats. Consider tweaking /proc/sys/kernel/perf_event_paranoid, which controls use of the performance events system by unprivileged users (without CAP_SYS_ADMIN). Helpful message The current value is 2: -1: Allow use of (almost) all events by all users >gt;= 0: Disallow raw tracepoint access by users without CAP_IOC_LOCK >gt;= 1: Disallow CPU event access by users without CAP_SYS_ADMIN >gt;= 2: Disallow kernel profiling by users without CAP_SYS_ADMIN Although, after setting perf_event_paranoid to -1, it prints the same error...slide 58:
perf & Container Debugging Debugging using strace from the host (as ptrace() is also blocked): host# strace -fp 26450 bash PID, from which I then ran perf [...] [pid 27426] perf_event_open(0x2bfe498, -1, 0, -1, 0) = -1 EPERM (Operation not permitted) [pid 27426] perf_event_open(0x2bfe498, -1, 0, -1, 0) = -1 EPERM (Operation not permitted) [pid 27426] perf_event_open(0x2bfc1a8, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = -1 EPERM (Operation not permitted) Many different ways to debug this. https://docs.docker.com/engine/security/seccomp/#significant-syscalls-blocked-by-the-default-profile:slide 59:
perf, cont. • Can enable perf_event_open() with: docker run --cap-add sys_admin Also need (for kernel symbols): echo 0 >gt; /proc/sys/kernel/kptr_restrict • perf then "works", and you can make flame graphs. But it sees all CPUs!? perf needs to be "container aware", and only see the container's tasks. patch pending: https://lkml.org/lkml/2017/1/12/308 • Currently easier to run perf from the host (or secure "monitoring" container) Via a secure monitoring agent, e.g. Netflix Vector ->gt; CPU Flame Graph See earlier slides for stepsslide 60:
5. Tracing Advanced Analysis … a few more examples (iosnoop, zfsslower, and btrfsdist shown earlier)slide 61:
Built-in Linux Tracers Srace (2008+) perf_events (2009+) eBPF (2014+) Some front-ends: • ftrace: https://github.com/brendangregg/perf-tools • perf_events: used for CPU flame graphs • eBPF (aka BPF): https://github.com/iovisor/bcc (Linux 4.4+)slide 62:
ftrace: Overlay FS Function Calls Using ftrace via my perf-tools to count function calls in-kernel context: # funccount '*ovl*' Tracing "*ovl*"... Ctrl-C to end. FUNC COUNT ovl_cache_free ovl_xattr_get [...] ovl_fill_merge ovl_path_real ovl_path_upper ovl_update_time ovl_permission ovl_d_real ovl_override_creds Ending tracing... Each can be a target for further study with kprobesslide 63:
ftrace: Overlay FS Function Tracing Using kprobe (perf-tools) to trace ovl_fill_merg() args and stack trace # kprobe -s 'p:ovl_fill_merge ctx=%di name=+0(%si):string' Tracing kprobe ovl_fill_merge. Ctrl-C to end. bash-16633 [000] d... 14390771.218973: ovl_fill_merge: (ovl_fill_merge+0x0/0x1f0 [overlay]) ctx=0xffffc90042477db0 name="iostat" bash-16633 [000] d... 14390771.218981:slide 64:gt; =>gt; ovl_fill_merge =>gt; ext4_readdir =>gt; iterate_dir =>gt; ovl_dir_read_merged =>gt; ovl_iterate =>gt; iterate_dir =>gt; SyS_getdents =>gt; do_syscall_64 =>gt; return_from_SYSCALL_64 […] Good for debugging, although dumping all events can cost too much overhead. ftrace has some solutions to this, BPF has more…
Enhanced BPF Tracing Internals Observability Program Kernel load BPF bytecode BPF program event config verifier tracepoints a\ach dynamic tracing BPF output per-event data sta=s=cs sta=c tracing kprobes uprobes async copy sampling, PMCs maps perf_eventsslide 65:
BPF: Scheduler Latency 1 host# runqlat -p 20228 10 1 Tracing run queue latency... Hit Ctrl-C to end. usecs 0 ->gt; 1 2 ->gt; 3 4 ->gt; 7 8 ->gt; 15 16 ->gt; 31 32 ->gt; 63 64 ->gt; 127 128 ->gt; 255 256 ->gt; 511 512 ->gt; 1023 : count : 0 : 4 : 368 : 151 : 22 : 14 : 19 : 0 : 2 : 1 distribution |****************************************| |**************** |** |** This is an app in a Docker container on a system with idle CPU Tracing scheduler events can be costly (high rate), but this BPF program reduces cost by using in-kernel maps to summarize data, and only emits the "count" column to user space.slide 66:
BPF: Scheduler Latency 2 host# runqlat -p 20228 10 1 Tracing run queue latency... Hit Ctrl-C to end. usecs 0 ->gt; 1 2 ->gt; 3 4 ->gt; 7 8 ->gt; 15 16 ->gt; 31 32 ->gt; 63 64 ->gt; 127 128 ->gt; 255 256 ->gt; 511 512 ->gt; 1023 1024 ->gt; 2047 2048 ->gt; 4095 4096 ->gt; 8191 8192 ->gt; 16383 16384 ->gt; 32767 32768 ->gt; 65535 65536 ->gt; 131071 131072 ->gt; 262143 262144 ->gt; 524287 : count : 0 : 0 : 7 : 14 : 0 : 0 : 0 : 0 : 0 : 0 : 0 : 5 : 6 : 28 : 59 : 99 : 6 : 2 : 1 distribution Now other tenants are using | |** |***** more CPU, and this PID is throttled via CPU shares 8 - 65ms delays |** |** |*********** |*********************** |****************************************| |**slide 67:
BPF: Scheduler Latency 3 host# runqlat --pidnss -m Tracing run queue latency... Hit Ctrl-C to end. pidns = 4026532870 msecs : count distribution 0 ->gt; 1 : 264 |****************************************| 2 ->gt; 3 : 0 4 ->gt; 7 : 0 8 ->gt; 15 : 0 16 ->gt; 31 : 0 Per-PID namespace histograms || 32 ->gt; 63 : 0 64 ->gt; 127 : 2 (I added this yesterday) | […] pidns = 4026532382 msecs 0 ->gt; 1 2 ->gt; 3 4 ->gt; 7 8 ->gt; 15 16 ->gt; 31 32 ->gt; 63 : count : 646 : 18 : 48 : 17 : 150 : 134 distribution |****************************************| |** |********* |********slide 68:
BPF: Namespace-ing Tools Walking from the task_struct to the PID namespace ID: task_struct->gt;nsproxy->gt;pid_ns_for_children->gt;ns.inum This is unstable, and could break between kernel versions. If it becomes a problem, we'll add a bpf_get_current_pidns() Does needs a *task, or bpf_get_current_task() (added in 4.8) Can also pull out cgroups, but gets tricker…slide 69:
bcc (BPF) Perf Toolsslide 70:
Docker Analysis & Debugging If needed, dockerd can also be analyzed using: • go execution tracer • GODEBUG with gctrace and schedtrace • gdb and Go runtime support • perf profiling • bcc/BPF and uprobes Each has pros/cons. bcc/BPF can trace user & kernel events.slide 71:
BPF: dockerd Go Function Counting Counting dockerd Go calls in-kernel using BPF that match "*docker*get": # funccount '/usr/bin/dockerd:*docker*get*' Tracing 463 functions for "/usr/bin/dockerd:*docker*get*"... Hit Ctrl-C to end. FUNC COUNT github.com/docker/docker/daemon.(*statsCollector).getSystemCPUUsage github.com/docker/docker/daemon.(*Daemon).getNetworkSandboxID github.com/docker/docker/daemon.(*Daemon).getNetworkStats github.com/docker/docker/daemon.(*statsCollector).getSystemCPUUsage.func1 github.com/docker/docker/pkg/ioutils.getBuffer github.com/docker/docker/vendor/golang.org/x/net/trace.getBucket github.com/docker/docker/vendor/golang.org/x/net/trace.getFamily github.com/docker/docker/vendor/google.golang.org/grpc.(*ClientConn).getTransport github.com/docker/docker/vendor/github.com/golang/protobuf/proto.getbase github.com/docker/docker/vendor/google.golang.org/grpc/transport.(*http2Client).getStream Detaching... # objdump -tTj .text /usr/bin/dockerd | wc -l 35,859 functions can be traced! Uses uprobes, and needs newer kernels. Warning: will cost overhead at high function rates.slide 72:
BPF: dockerd Go Stack Tracing Counting stack traces that led to this ioutils.getBuffer() call: # stackcount 'p:/usr/bin/dockerd:*/ioutils.getBuffer' Tracing 1 functions for "p:/usr/bin/dockerd:*/ioutils.getBuffer"... Hit Ctrl-C to end. github.com/docker/docker/pkg/ioutils.getBuffer github.com/docker/docker/pkg/broadcaster.(*Unbuffered).Write bufio.(*Reader).writeBuf bufio.(*Reader).WriteTo io.copyBuffer io.Copy github.com/docker/docker/pkg/pools.Copy github.com/docker/docker/container/stream.(*Config).CopyToPipe.func1.1 runtime.goexit dockerd [18176] means this stack was seen 110 times Detaching... Can also trace function arguments, and latency (with some work) http://www.brendangregg.com/blog/2017-01-31/golang-bcc-bpf-function-tracing.htmlslide 73:
Summary Identify bottlenecks: 1. In the host vs container, using system metrics 2. In application code on containers, using CPU flame graphs 3. Deeper in the kernel, using tracing toolsslide 74:
References http://techblog.netflix.com/2017/04/the-evolution-of-container-usage-at.html http://techblog.netflix.com/2016/07/distributed-resource-scheduling-with.html https://www.slideshare.net/aspyker/netflix-and-containers-titus https://docs.docker.com/engine/admin/runmetrics/#tips-for-high-performance-metric-collection https://blog.docker.com/2013/10/gathering-lxc-docker-containers-metrics/ https://www.slideshare.net/jpetazzo/anatomy-of-a-container-namespaces-cgroups-some-filesystem-magic-linuxcon https://www.youtube.com/watch?v=sK5i-N34im8 Cgroups, namespaces, and beyond https://jvns.ca/blog/2016/10/10/what-even-is-a-container/ https://blog.jessfraz.com/post/containers-zones-jails-vms/ http://blog.alicegoldfuss.com/making-flamegraphs-with-containerized-java/ http://www.brendangregg.com/USEmethod/use-linux.html full USE method list http://www.brendangregg.com/blog/2017-01-31/golang-bcc-bpf-function-tracing.html http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html http://queue.acm.org/detail.cfm?id=1809426 latency heat maps https://github.com/brendangregg/perf-tools ftrace tools, https://github.com/iovisor/bcc BPF toolsslide 75:
Thank You! http://techblog.netflix.com http://slideshare.net/brendangregg http://www.brendangregg.com bgregg@netflix.com @brendangregg Titus team: @aspyker @anwleung @fabiokung @tomaszbak1974 @amit_joshee @sargun @corindwyer … #dockercon