Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

USENIX LISA 2017: Container Performance Analysis

Talk by Brendan Gregg for USENIX LISA17

Video: https://www.youtube.com/watch?v=NYLXZ58EboM

Description: "Containers pose interesting challenges for performance monitoring and analysis, requiring new analysis methodologies and tooling. Resource-oriented analysis, as is common with systems performance tools and GUIs, must now account for both hardware limits and soft limits, as implemented using cgroups. A reverse diagnosis methodology can be applied to identify whether a container is resource constrained, and by which hard or soft resource. The interaction between the host and containers can also be examined, and noisy neighbors identified or exonerated. Performance tooling can need special usage or workarounds to function properly from within a container or on the host, to deal with different privilege levels and name spaces. At Netflix, we're using containers for some microservices, and care very much about analyzing and tuning our containers to be as fast and efficient as possible. This talk will show you how to identify bottlenecks in the host or container configuration, in the applications by profiling in a container environment, and how to dig deeper into kernel and container internals."

next
prev
1/69
next
prev
2/69
next
prev
3/69
next
prev
4/69
next
prev
5/69
next
prev
6/69
next
prev
7/69
next
prev
8/69
next
prev
9/69
next
prev
10/69
next
prev
11/69
next
prev
12/69
next
prev
13/69
next
prev
14/69
next
prev
15/69
next
prev
16/69
next
prev
17/69
next
prev
18/69
next
prev
19/69
next
prev
20/69
next
prev
21/69
next
prev
22/69
next
prev
23/69
next
prev
24/69
next
prev
25/69
next
prev
26/69
next
prev
27/69
next
prev
28/69
next
prev
29/69
next
prev
30/69
next
prev
31/69
next
prev
32/69
next
prev
33/69
next
prev
34/69
next
prev
35/69
next
prev
36/69
next
prev
37/69
next
prev
38/69
next
prev
39/69
next
prev
40/69
next
prev
41/69
next
prev
42/69
next
prev
43/69
next
prev
44/69
next
prev
45/69
next
prev
46/69
next
prev
47/69
next
prev
48/69
next
prev
49/69
next
prev
50/69
next
prev
51/69
next
prev
52/69
next
prev
53/69
next
prev
54/69
next
prev
55/69
next
prev
56/69
next
prev
57/69
next
prev
58/69
next
prev
59/69
next
prev
60/69
next
prev
61/69
next
prev
62/69
next
prev
63/69
next
prev
64/69
next
prev
65/69
next
prev
66/69
next
prev
67/69
next
prev
68/69
next
prev
69/69

PDF: LISA2017_Container_Performance_Analysis.pdf

Keywords (from pdftotext):

slide 1:
    Container Performance
    Analysis
    Brendan Gregg
    bgregg@neIlix.com
    October 29–November 3, 2017 | San Francisco, CA
    www.usenix.org/lisa17
    #lisa17
    
slide 2:
    Take Aways
    IdenNfy boPlenecks:
    1. In the host vs container, using system metrics
    2. In applicaNon code on containers, using CPU flame graphs
    3. Deeper in the kernel, using tracing tools
    Focus of this talk is how containers work in Linux (will demo on Linux 4.9)
    
slide 3:
slide 4:
    Containers at NeIlix: summary slides from the Titus team.
    1. TITUS
    
slide 5:
    Titus
    • Cloud runNme plaIorm for container jobs
    • Scheduling
    – Service & batch job management
    – Advanced resource management across
    elasNc shared resource pool
    • Container ExecuNon
    Service
    Batch
    Job Management
    Resource Management & OpNmizaNon
    – Docker and AWS EC2 IntegraNon
    • Adds VPC, security groups, EC2
    metadata, IAM roles, S3 logs, …
    Container ExecuNon
    IntegraNon
    – IntegraNon with NeIlix infrastructure
    • In depth: hPp://techblog.neIlix.com/2017/04/the-evoluNon-of-container-usage-at.html
    
slide 6:
    Current Titus Scale
    • Used for ad hoc reporNng, media encoding, stream processing, …
    • Over 2,500 instances (Mostly m4.16xls & r3.8xls) across three regions
    • Over a week period launched over 1,000,000 containers
    
slide 7:
    Container Performance @NeIlix
    • Ability to scale and balance workloads with EC2 and Titus
    • Performance needs:
    – ApplicaNon analysis: using CPU flame graphs with containers
    – Host tuning: file system, networking, sysctl's, …
    – Container analysis and tuning: cgroups, GPUs, …
    – Capacity planning: reduce over provisioning
    
slide 8:
    And Strategy
    2. CONTAINER BACKGROUND
    
slide 9:
    Namespaces: RestricNng Visibility
    Current Namespaces:
    • cgroup
    • ipc
    • mnt
    • net
    • pid
    • user
    • uts
    PID namespaces
    Host
    PID 1
    PID namespace 1
    1 (1238)
    2 (1241)
    Kernel
    
slide 10:
    Control Groups: RestricNng Usage
    Current cgroups:
    • blkio
    • cpu,cpuacct
    • cpuset
    • devices
    • hugetlb
    • memory
    • net_cls,net_prio
    • pids
    • …
    CPU cgroups
    container
    container
    container
    cpu
    cgroup 1
    CPUs
    
slide 11:
    Linux Containers
    Container = combinaNon of namespaces & cgroups
    Host
    Container 1
    Container 2
    Container 3
    (namespaces)
    (namespaces)
    (namespaces)
    cgroups
    cgroups
    cgroups
    Kernel
    
slide 12:
    cgroup v1
    cpu,cpuacct:
    cap CPU usage (hard limit). e.g. 1.5 CPUs.
    CPU shares. e.g. 100 shares.
    usage staNsNcs (cpuacct)
    Docker:
    --cpus (1.13)
    --cpu-shares
    memory:
    limit and kmem limit (maximum bytes)
    OOM control: enable/disable
    usage staNsNcs
    blkio (block I/O):
    weights (like shares)
    IOPS/tput caps per storage device
    staNsNcs
    --memory --kernel-memory
    --oom-kill-disable
    
slide 13:
    CPU Shares
    Container's CPU limit = 100% x
    container's shares
    total busy shares
    This lets a container use other tenant's idle CPU (aka "bursNng"), when available.
    Container's minimum CPU limit = 100% x
    container's shares
    total allocated shares
    Can make analysis tricky. Why did perf regress? Less bursNng available?
    
slide 14:
    cgroup v2
    • Major rewrite has been happening: cgroups v2
    – Supports nested groups, bePer organizaNon and consistency
    – Some already merged, some not yet (e.g. CPU)
    • See docs/talks by maintainer Tejun Heo (Facebook)
    • References:
    – hPps://www.kernel.org/doc/DocumentaNon/cgroup-v2.txt
    – hPps://lwn.net/ArNcles/679786/
    
slide 15:
    Container OS ConfiguraNon
    File systems
    Containers may be setup with aufs/overlay on top of another FS
    See "in pracNce" pages and their performance secNons from
    hPps://docs.docker.com/engine/userguide/storagedriver/
    Networking
    With Docker, can be bridge, host, or overlay networks
    Overlay networks have come with significant performance cost
    
slide 16:
    Analysis Strategy
    Performance analysis with containers:
    • One kernel
    • Two perspecNves
    • Namespaces
    • cgroups
    Methodologies:
    • USE Method
    • Workload characterizaNon
    • Checklists
    • Event tracing
    
slide 17:
    USE Method
    For every resource, check:
    1. UNlizaNon
    2. SaturaNon
    3. Errors
    Resource
    Utilization
    (%)
    For example, CPUs:
    • UNlizaNon: Nme busy
    • SaturaNon: run queue length or latency
    • Errors: ECC errors, etc.
    Can be applied to hardware resources and sotware resources (cgroups)
    
slide 18:
    And Container Awareness
    3. HOST TOOLS
    
slide 19:
    Host Analysis Challenges
    • PIDs in host don't match those seen in containers
    • Symbol files aren't where tools expect them
    • The kernel currently doesn't have a container ID
    
slide 20:
    3.1. Host Physical Resources
    A refresher of basics... Not container specific.
    This will, however, solve many issues!
    Containers are oten not the problem.
    I will demo CLI tools. GUIs source the same metrics.
    
slide 21:
    Linux
    Perf
    Tools
    Where
    can we
    begin?
    
slide 22:
    Host Perf Analysis in 60s
    uptime
    dmesg | tail
    vmstat 1
    mpstat -P ALL 1
    pidstat 1
    iostat -xz 1
    free -m
    sar -n DEV 1
    sar -n TCP,ETCP 1
    top
    load averages
    kernel errors
    overall stats by Nme
    CPU balance
    process usage
    disk I/O
    memory usage
    network I/O
    TCP stats
    check overview
    hPp://techblog.neIlix.com/2015/11/linux-performance-analysis-in-60s.html
    
slide 23:
    USE Method: Host Resources
    Resource
    Utilization
    Saturation
    Errors
    CPU
    mpstat -P ALL 1,
    sum non-idle fields
    vmstat 1, "r"
    perf
    Memory
    Capacity
    free –m,
    "used"/"total"
    vmstat 1, "si"+"so";
    demsg | grep killed
    dmesg
    Storage I/O
    iostat –xz 1,
    "%util"
    iostat –xnz 1,
    "avgqu-sz" >gt; 1
    /sys/…/ioerr_cnt;
    smartctl
    Network
    nicstat, "%Util"
    ifconfig, "overrunns";
    netstat –s "retrans…"
    ifconfig,
    "errors"
    These should be in your monitoring GUI. Can do other resources too (busses, ...)
    
slide 24:
    Event Tracing: e.g. iosnoop
    Disk I/O events with latency (from perf-tools; also in bcc/BPF as biosnoop)
    # ./iosnoop
    Tracing block I/O... Ctrl-C to end.
    COMM
    PID
    TYPE DEV
    supervise
    202,1
    supervise
    202,1
    tar
    14794 RM
    202,1
    tar
    14794 RM
    202,1
    tar
    14794 RM
    202,1
    tar
    14794 RM
    202,1
    tar
    14794 RM
    202,1
    tar
    14794 RM
    202,1
    tar
    14794 RM
    202,1
    tar
    14794 RM
    202,1
    BLOCK
    BYTES
    LATms
    
slide 25:
    Event Tracing: e.g. zfsslower
    # /usr/share/bcc/tools/zfsslower 1
    Tracing ZFS operations slower than 1 ms
    TIME
    COMM
    PID
    T BYTES
    23:44:40 java
    31386 O 0
    23:44:53 java
    31386 W 8190
    23:44:59 java
    31386 W 8192
    23:44:59 java
    31386 W 8191
    23:45:00 java
    31386 W 8192
    23:45:15 java
    31386 O 0
    23:45:56 dockerd
    S 0
    23:46:16 java
    31386 W 31
    OFF_KB
    LAT(ms) FILENAME
    8.02 solrFeatures.txt
    36.24 solrFeatures.txt
    20.28 solrFeatures.txt
    28.15 solrFeatures.txt
    32.17 solrFeatures.txt
    27.44 solrFeatures.txt
    1.03 .tmp-a66ce9aad…
    36.28 solrFeatures.txt
    • This is from our producNon Titus system (Docker).
    • File system latency is a bePer pain indicator than disk latency.
    • zfsslower (and btrfs*, etc) are in bcc/BPF. Can exonerate FS/disks.
    
slide 26:
    Latency Histogram: e.g. btrfsdist
    # ./btrfsdist
    Tracing btrfs operation latency... Hit Ctrl-C to end.
    operation = 'read'
    usecs
    : count
    distribution
    0 ->gt; 1
    : 192529
    |****************************************|
    2 ->gt; 3
    : 72337
    |***************
    4 ->gt; 7
    : 5620
    probably
    8 ->gt; 15
    : 1026
    cache
    reads
    16 ->gt; 31
    : 369
    32 ->gt; 63
    : 239
    64 ->gt; 127
    : 53
    128 ->gt; 255
    : 975
    256 ->gt; 511
    : 524
    probably
    cache
    misses
    512 ->gt; 1023
    : 128
    (flash reads)
    1024 ->gt; 2047
    : 16
    2048 ->gt; 4095
    : 7
    […]
    From a test
    Titus system
    • Histograms show modes, outliers. Also in bcc/BPF (with other FSes).
    • Latency heat maps: hPp://queue.acm.org/detail.cfm?id=1809426
    
slide 27:
    3.2. Host Containers & cgroups
    InspecNng containers from the host
    
slide 28:
    Namespaces
    Worth checking namespace config before analysis:
    # ./dockerpsns.sh
    CONTAINER
    NAME
    host
    titusagent-mainvpc-m
    b27909cd6dd1 Titus-1435830-worker
    dcf3a506de45 Titus-1392192-worker
    370a3f041f36 Titus-1243558-worker
    af7549c76d9a Titus-1243553-worker
    dc27769a9b9c Titus-1243546-worker
    e18bd6189dcd Titus-1243517-worker
    ab45227dcea9 Titus-1243516-worker
    PID PATH
    CGROUP
    IPC
    MNT
    NET
    PID
    USER
    UTS
    1 systemd
    4026531835 4026531839 4026531840 4026532533 4026531836 4026531837 4026531838
    37280 svscanboot
    4026531835 4026533387 4026533385 4026532931 4026533388 4026531837 4026533386
    27992 /apps/spaas/spaa 4026531835 4026533354 4026533352 4026532991 4026533355 4026531837 4026533353
    98602 /apps/spaas/spaa 4026531835 4026533290 4026533288 4026533223 4026533291 4026531837 4026533289
    97972 /apps/spaas/spaa 4026531835 4026533216 4026533214 4026533149 4026533217 4026531837 4026533215
    97356 /apps/spaas/spaa 4026531835 4026533142 4026533140 4026533075 4026533143 4026531837 4026533141
    96733 /apps/spaas/spaa 4026531835 4026533068 4026533066 4026533001 4026533069 4026531837 4026533067
    96173 /apps/spaas/spaa 4026531835 4026532920 4026532918 4026532830 4026532921 4026531837 4026532919
    A POC "docker ps --namespaces" tool. NS shared with root in red.
    hPps://github.com/docker/docker/issues/32501
    hPps://github.com/kubernetes-incubator/cri-o/issues/868
    
slide 29:
    systemd-cgtop
    A "top" for cgroups:
    # systemd-cgtop
    Control Group
    /docker
    /docker/dcf3a...9d28fc4a1c72bbaff4a24834
    /docker/370a3...e64ca01198f1e843ade7ce21
    /system.slice
    /system.slice/daemontools.service
    /docker/dc277...42ab0603bbda2ac8af67996b
    /user.slice
    /user.slice/user-0.slice
    /user.slice/u....slice/session-c26.scope
    /docker/ab452...c946f8447f2a4184f3ccff2a
    /docker/e18bd...26ffdd7368b870aa3d1deb7a
    [...]
    Tasks
    %CPU
    Memory
    45.9G
    42.1G
    24.0G
    3.0G
    4.1G
    2.8G
    2.3G
    34.5M
    15.7M
    13.3M
    6.3G
    2.9G
    Input/s Output/s
    
slide 30:
    docker stats
    A "top" for containers. Resource uNlizaNon. Workload characterizaNon.
    # docker stats
    CONTAINER
    CPU %
    353426a09db1 526.81%
    6bf166a66e08 303.82%
    58dcf8aed0a7 41.01%
    61061566ffe5 85.92%
    bdc721460293 2.69%
    6c80ed61ae63 477.45%
    337292fb5b64 89.05%
    b652ede9a605 173.50%
    d7cd2599291f 504.28%
    05bf9f3e0d13 314.46%
    09082f005755 142.04%
    bd45a3e1ce16 190.26%
    [...]
    MEM USAGE / LIMIT
    4.061 GiB / 8.5 GiB
    3.448 GiB / 8.5 GiB
    1.322 GiB / 2.5 GiB
    220.9 MiB / 3.023 GiB
    1.204 GiB / 3.906 GiB
    557.7 MiB / 8 GiB
    766.2 MiB / 8 GiB
    689.2 MiB / 8 GiB
    673.2 MiB / 8 GiB
    711.6 MiB / 8 GiB
    693.9 MiB / 8 GiB
    538.3 MiB / 8 GiB
    MEM %
    47.78%
    40.57%
    52.89%
    7.14%
    30.82%
    6.81%
    9.35%
    8.41%
    8.22%
    8.69%
    8.47%
    6.57%
    NET I/O
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    BLOCK I/O
    2.818 MB / 0 B
    2.032 MB / 0 B
    0 B / 0 B
    43.4 MB / 0 B
    4.35 MB / 0 B
    9.257 MB / 0 B
    5.493 MB / 0 B
    6.48 MB / 0 B
    12.58 MB / 0 B
    7.942 MB / 0 B
    8.081 MB / 0 B
    10.6 MB / 0 B
    PIDS
    
slide 31:
    top
    In the host, top shows all processes, but currently no container IDs.
    # top - 22:46:53 up 36 days, 59 min, 1 user, load average: 5.77, 5.61, 5.63
    Tasks: 1067 total,
    1 running, 1046 sleeping,
    0 stopped, 20 zombie
    %Cpu(s): 34.8 us, 1.8 sy, 0.0 ni, 61.3 id, 0.0 wa, 0.0 hi, 1.9 si, 0.1 st
    KiB Mem : 65958552 total, 12418448 free, 49247988 used, 4292116 buff/cache
    KiB Swap:
    0 total,
    0 free,
    0 used. 13101316 avail Mem
    PID USER
    28321 root
    97712 root
    98306 root
    96511 root
    5283 root
    2079 root
    5272 titusag+
    […]
    VIRT
    RES
    0 33.126g 0.023t
    0 11.445g 2.333g
    0 12.149g 3.060g
    0 15.567g 6.313g
    0 1643676 100092
    0 10.473g 1.611g
    SHR S %CPU %MEM
    TIME+ COMMAND
    37564 S 621.1 38.2 35184:09 java
    37084 S
    3.1 3.7 404:27.90 java
    36996 S
    2.0 4.9 194:21.10 java
    37112 S
    1.7 10.0 168:07.44 java
    94184 S
    1.0 0.2 401:36.16 mesos-slave
    12 S
    0.7 0.0 220:07.75 rngd
    23488 S
    0.7 2.6
    1934:44 java
    Can fix, but that would be Docker + cgroup-v1 specific. SNll need a kernel CID.
    
slide 32:
    htop
    htop can add a CGROUP field, but, can truncate important info:
    CGROUP
    PID USER
    PRI NI VIRT
    RES
    SHR S CPU% MEM%
    TIME+ Command
    :pids:/docker/ 28321 root
    0 33.1G 24.0G 37564 S 524. 38.2
    672h /apps/java
    :pids:/docker/
    9982 root
    0 33.1G 24.0G 37564 S 44.4 38.2 17h00:41 /apps/java
    :pids:/docker/
    9985 root
    0 33.1G 24.0G 37564 R 41.9 38.2 16h44:51 /apps/java
    :pids:/docker/
    9979 root
    0 33.1G 24.0G 37564 S 41.2 38.2 17h01:35 /apps/java
    :pids:/docker/
    9980 root
    0 33.1G 24.0G 37564 S 39.3 38.2 16h59:17 /apps/java
    :pids:/docker/
    9981 root
    0 33.1G 24.0G 37564 S 39.3 38.2 17h01:32 /apps/java
    :pids:/docker/
    9984 root
    0 33.1G 24.0G 37564 S 37.3 38.2 16h49:03 /apps/java
    :pids:/docker/
    9983 root
    0 33.1G 24.0G 37564 R 35.4 38.2 16h54:31 /apps/java
    :pids:/docker/
    9986 root
    0 33.1G 24.0G 37564 S 35.4 38.2 17h05:30 /apps/java
    :name=systemd:/user.slice/user-0.slice/session-c31.scope? 74066 root
    0 27620
    :pids:/docker/
    9998 root
    0 33.1G 24.0G 37564 R 28.3 38.2 11h38:03 /apps/java
    :pids:/docker/ 10001 root
    0 33.1G 24.0G 37564 S 27.7 38.2 11h38:59 /apps/java
    :name=systemd:/system.slice/daemontools.service?
    5272 titusagen 20
    0 10.5G 1650M 23
    :pids:/docker/ 10002 root
    0 33.1G 24.0G 37564 S 25.1 38.2 11h40:37 /apps/java
    Can fix, but that would be Docker + cgroup-v1 specific. SNll need a kernel CID.
    
slide 33:
    Host PID ->gt; Container ID
    … who does that (CPU busy) PID 28321 belong to?
    # grep 28321 /sys/fs/cgroup/cpu,cpuacct/docker/*/tasks | cut -d/ -f7
    dcf3a506de453107715362f6c9ba9056fcfc6e769d28fc4a1c72bbaff4a24834
    • Only works for Docker, and that cgroup v1 layout. Some Linux commands:
    # ls -l /proc/27992/ns/*
    lrwxrwxrwx 1 root root 0 Apr 13 20:49 cgroup ->gt; cgroup:[4026531835]
    lrwxrwxrwx 1 root root 0 Apr 13 20:49 ipc ->gt; ipc:[4026533354]
    lrwxrwxrwx 1 root root 0 Apr 13 20:49 mnt ->gt; mnt:[4026533352]
    […]
    # cat /proc/27992/cgroup
    11:freezer:/docker/dcf3a506de453107715362f6c9ba9056fcfc6e769d28fc4a1c72bbaff4a24834
    10:blkio:/docker/dcf3a506de453107715362f6c9ba9056fcfc6e769d28fc4a1c72bbaff4a24834
    9:perf_event:/docker/dcf3a506de453107715362f6c9ba9056fcfc6e769d28fc4a1c72bbaff4a24834
    […]
    
slide 34:
    nsenter Wrapping
    … what hostname is PID 28321 running on?
    # nsenter -t 28321 -u hostname
    titus-1392192-worker-14-16
    Can namespace enter:
    – -m: mount
    -u: uts
    -i: ipc -n: net
    -p: pid
    -U: user
    Bypasses cgroup limits, and seccomp profile (allowing syscalls)
    – For Docker, enter the container more completely with: docker exec -it CID command
    Handy nsenter one-liners:
    nsenter -t PID -u hostname
    nsenter -t PID -n netstat -i
    nsenter -t PID –m -p df -h
    nsenter -t PID -p top
    container hostname
    container netstat
    container file system usage
    container top
    
slide 35:
    nsenter: Host ->gt; Container top
    … Given PID 28321, running top for its container by entering its namespaces:
    # nsenter -t 28321 -m -p top
    top - 18:16:13 up 36 days, 20:28, 0 users, load average: 5.66, 5.29, 5.28
    Tasks:
    6 total,
    1 running,
    5 sleeping,
    0 stopped,
    0 zombie
    %Cpu(s): 30.5 us, 1.7 sy, 0.0 ni, 65.9 id, 0.0 wa, 0.0 hi, 1.8 si, 0.1 st
    KiB Mem: 65958552 total, 54664124 used, 11294428 free,
    164232 buffers
    KiB Swap:
    0 total,
    0 used,
    0 free. 1592372 cached Mem
    PID USER
    301 root
    1 root
    87888 root
    VIRT
    RES
    0 33.127g 0.023t
    SHR S %CPU %MEM
    37564 S 537.3 38.2
    1812 S
    0.0 0.0
    1348 R
    0.0 0.0
    TIME+ COMMAND
    40269:41 java
    4:15.11 bash
    0:00.00 top
    Note that it is PID 301 in the container. Can also see this using:
    # grep NSpid /proc/28321/status
    NSpid:
    
slide 36:
    perf: CPU Profiling
    Can run system-wide (-a), match a pid (-p), or cgroup (-G, if it works)
    # perf record -F 49 -a -g -- sleep 30
    # perf script
    Failed to open /lib/x86_64-linux-gnu/libc-2.19.so, continuing without symbols
    Failed to open /tmp/perf-28321.map, continuing without symbols
    • Symbol translaNon gotchas on Linux 4.13 and earlier
    – perf can't find /tmp/perf-PID.map files in the host, and the PID is different
    – perf can't find container binaries under host paths (what /usr/bin/java?)
    – Can copy files to the host, map PIDs, then run perf script/report:
    • hPp://blog.alicegoldfuss.com/making-flamegraphs-with-containerized-java/
    • hPp://batey.info/docker-jvm-flamegraphs.html
    – Can nsenter (-m -u -i -n -p) a "power" shell, and then run "perf -p PID"
    • Linux 4.14 perf checks namespaces for symbol files
    Thanks Krister Johansen
    
slide 37:
    CPU Flame Graphs
    git clone --depth 1 https://github.com/brendangregg/FlameGraph
    cd FlameGraph
    perf record –F 49 -a –g -- sleep 30
    perf script | ./stackcollapse-perf.pl | ./flamegraph.pl >gt; perf.svg
    • See previous slide for ge{ng perf symbols to work
    • From the host, can study all containers, as well as container overheads
    Kernel TCP/IP stack
    Look in areas like this to find
    and quantify overhead (cgroup
    throttles, FS layers, networking, etc).
    It's likely small and hard to find.
    Java, missing stacks (need
    -XX:+PreserveFramePointer)
    
slide 38:
    /sys/fs/cgroups (raw)
    The best source for per-cgroup metrics. e.g. CPU:
    # cd /sys/fs/cgroup/cpu,cpuacct/docker/02a7cf65f82e3f3e75283944caa4462e82f8f6ff5a7c9a...
    # ls
    cgroup.clone_children cpuacct.usage_all
    cpuacct.usage_sys
    cpu.shares
    cgroup.procs
    cpuacct.usage_percpu
    cpuacct.usage_user cpu.stat
    cpuacct.stat
    cpuacct.usage_percpu_sys
    cpu.cfs_period_us
    notify_on_release
    cpuacct.usage
    cpuacct.usage_percpu_user cpu.cfs_quota_us
    tasks
    # cat cpuacct.usage
    # cat cpu.stat
    total time throttled (nanoseconds). saturation metric.
    nr_periods 507
    average throttle time = throttled_time / nr_throttled
    nr_throttled 74
    throttled_time 3816445175
    hPps://www.kernel.org/doc/DocumentaNon/cgroup-v1/, ../scheduler/sched-bwc.txt
    hPps://blog.docker.com/2013/10/gathering-lxc-docker-containers-metrics/
    Note: grep cgroup /proc/mounts to check where these are mounted
    These metrics should be included in performance monitoring GUIs
    
slide 39:
    NeOlix Atlas
    Cloud-wide monitoring of
    containers (and instances)
    Fetches cgroup metrics via
    Intel snap
    hPps://github.com/neIlix/Atlas
    
slide 40:
    NeOlix Vector
    Our per-instance analyzer
    Has per-container metrics
    hPps://github.com/NeIlix/vector
    
slide 41:
    Intel snap
    A metric collector used by
    monitoring GUIs
    hPps://github.com/intelsdi-x/snap
    Has a Docker plugin to read
    cgroup stats
    There's also a collectd plugin:
    hPps://github.com/bobrik/collectddocker
    
slide 42:
    3.3. Let's Play a Game
    Host or Container?
    (or Neither?)
    
slide 43:
    Game Scenario 1
    Container user claims they have a CPU performance issue
    Container has a CPU cap and CPU shares configured
    There is idle CPU on the host
    Other tenants are CPU busy
    /sys/fs/cgroup/.../cpu.stat ->gt; throPled_Nme is increasing
    /proc/PID/status nonvoluntary_ctxt_switches is increasing
    Container CPU usage equals its cap (clue: this is not really a clue)
    
slide 44:
    Game Scenario 2
    Container user claims they have a CPU performance issue
    Container has a CPU cap and CPU shares configured
    There is no idle CPU on the host
    Other tenants are CPU busy
    /sys/fs/cgroup/.../cpu.stat ->gt; throPled_Nme is not increasing
    /proc/PID/status nonvoluntary_ctxt_switches is increasing
    
slide 45:
    Game Scenario 3
    Container user claims they have a CPU performance issue
    Container has CPU shares configured
    There is no idle CPU on the host
    Other tenants are CPU busy
    /sys/fs/cgroup/.../cpu.stat ->gt; throPled_Nme is not increasing
    /proc/PID/status nonvoluntary_ctxt_switches is not increasing much
    Experiments to confirm conclusion?
    
slide 46:
    Methodology: Reverse Diagnosis
    Enumerate possible outcomes, and work backwards to the metrics
    needed for diagnosis.
    For example, CPU performance outcomes:
    A. physical CPU throPled
    B. cap throPled
    C. shares throPled (assumes physical CPU limited as well)
    D. not throPled
    Game answers: 1. B, 2. C, 3. D
    
slide 47:
    CPU BoRleneck
    IdenTficaTon
    throttled_time
    increasing?
    DifferenNal Diagnosis
    cap throttled
    nonvol…switches
    increasing?
    (but dig further)
    not throttled
    share throttled
    host has idle
    CPU?
    all other tenants
    idle?
    physical CPU
    throttled
    
slide 48:
    And Container Awareness
    4. GUEST TOOLS
    … if you only have guest access
    
slide 49:
    Guest Analysis Challenges
    • Some resource metrics are for the container, some for the host.
    Confusing!
    • May lack system capabiliNes or syscalls to run profilers and tracers
    
slide 50:
    CPU
    Can see host's CPU devices, but only container (pid namespace) processes:
    container# uptime
    20:17:19 up 45 days, 21:21,
    container# mpstat 1
    Linux 4.9.0 (02a7cf65f82e)
    0 users,
    04/14/17
    load average: 5.08, 3.69, 2.22
    _x86_64_
    20:17:26
    CPU
    %usr
    %nice
    %sys %iowait
    20:17:27
    all
    20:17:28
    all
    Average:
    all
    container# pidstat 1
    Linux 4.9.0 (02a7cf65f82e)
    04/14/17
    _x86_64_
    load!
    (8 CPU) busy CPUs
    %irq
    %soft
    %steal
    %guest
    %gnice
    %idle
    (8 CPU)
    20:17:33
    UID
    PID
    %usr %system
    %guest
    %CPU
    CPU
    20:17:34
    UID
    PID
    %usr %system
    %guest
    %CPU
    CPU
    20:17:35
    [...]
    UID
    PID
    %usr %system
    %guest
    %CPU
    CPU
    but this container
    is running nothing
    Command
    (we saw CPU usage
    Command from neighbors)
    Command
    
slide 51:
    Memory
    Can see host's memory:
    container# free -m
    total
    Mem:
    Swap:
    used
    free
    container# perl -e '$a = "A" x 1_000_000_000'
    Killed
    shared
    buff/cache
    available
    host memory (this container is --memory=1g)
    tries to consume ~2 Gbytes
    
slide 52:
    Disks
    Can see host's disk devices:
    container# iostat -xz 1
    avg-cpu: %user %nice %system %iowait %steal
    0.00 16.94
    %idle
    host disk I/O
    Device: rrqm/s wrqm/s
    r/s
    xvdap1
    xvdb
    0.00 200.00
    xvdc
    0.00 185.00
    md0
    0.00 385.00
    [...]
    container# pidstat -d 1
    Linux 4.9.0 (02a7cf65f82e)
    w/s
    rkB/s
    0.00 3080.00
    0.00 2840.00
    0.00 5920.00
    22:41:13
    UID
    PID
    kB_rd/s
    kB_wr/s kB_ccwr/s iodelay
    Command
    22:41:14
    UID
    PID
    kB_rd/s
    kB_wr/s kB_ccwr/s iodelay
    Command
    22:41:15
    [...]
    UID
    PID
    kB_rd/s
    kB_wr/s kB_ccwr/s iodelay
    Command
    04/18/17
    wkB/s avgrq-sz avgqu-sz
    _x86_64_
    await r_await w_await svctm %util
    2.00 2.00 0.40
    0.00 0.20 4.00
    0.00 0.24 4.40
    0.00 0.00 0.00
    (8 CPU)
    but no
    container I/O
    
slide 53:
    Network
    Can't see host's network interfaces (network namespace):
    container# sar -n DEV,TCP 1
    Linux 4.9.0 (02a7cf65f82e) 04/14/17
    21:45:07
    21:45:08
    21:45:08
    21:45:07
    21:45:08
    21:45:08
    21:45:09
    21:45:09
    21:45:08
    21:45:09
    [...]
    IFACE
    eth0
    _x86_64_
    (8 CPU)
    rxpck/s
    txpck/s
    rxkB/s
    active/s passive/s
    iseg/s
    oseg/s
    rxpck/s
    txpck/s
    rxkB/s
    active/s passive/s
    iseg/s
    oseg/s
    IFACE
    eth0
    txkB/s
    rxcmp/s
    txcmp/s
    rxmcst/s
    %ifutil
    txkB/s
    rxcmp/s
    txcmp/s
    rxmcst/s
    %ifutil
    host has heavy network I/O,
    container sees itself (idle)
    
slide 54:
    Metrics Namespace
    This confuses apps too: trying to bind on all CPUs, or using 25% of memory
    Including the JDK, which is unaware of container limits
    We could add a "metrics" namespace so the container only sees itself
    Or enhance exisNng namespaces to do this
    If you add a metrics namespace, please consider adding an opNon for:
    • /proc/host/stats: maps to host's /proc/stats, for CPU stats
    • /proc/host/diskstats: maps to host's /proc/diskstats, for disk stats
    As those host metrics can be useful, to idenNfy/exonerate neighbor issues
    
slide 55:
    perf: CPU Profiling
    Needs capabiliNes to run from a container:
    container# ./perf record -F 99 -a -g -- sleep 10
    perf_event_open(..., PERF_FLAG_FD_CLOEXEC) failed with unexpected error 1 (Operation not permitted)
    perf_event_open(..., 0) failed unexpectedly with error 1 (Operation not permitted)
    Error: You may not have permission to collect system-wide stats.
    Consider tweaking /proc/sys/kernel/perf_event_paranoid,
    [...]
    Although tweaking perf_event_paranoid (to -1) doesn't fix it. The real problem is:
    hPps://docs.docker.com/engine/security/seccomp/#significant-syscalls-blocked-by-the-default-profile:
    
slide 56:
    perf, cont.
    Can enable perf_event_open() with: docker run --cap-add sys_admin
    – Also need (for kernel symbols): echo 0 >gt; /proc/sys/kernel/kptr_restrict
    perf then "works", and you can make flame graphs. But it sees all CPUs!?
    – perf needs to be "container aware", and only see the container's tasks.
    patch pending: hPps://lkml.org/lkml/2017/1/12/308
    Currently easier to run perf from the host (or secure "monitoring" container)
    – e.g. NeIlix Vector ->gt; CPU Flame Graph
    
slide 57:
    Advanced Analysis
    5. TRACING
    … a few more examples
    (iosnoop, zfsslower, and btrfsdist shown earlier)
    
slide 58:
    Built-in Linux Tracers
    trace
    (2008+)
    perf_events
    (2009+)
    eBPF
    (2014+)
    Some front-ends:
    • trace: hPps://github.com/brendangregg/perf-tools
    • perf_events: used for CPU flame graphs
    • eBPF (aka BPF): hPps://github.com/iovisor/bcc (Linux 4.4+)
    
slide 59:
    trace: Overlay FS FuncNon Calls
    Using trace via my perf-tools to count funcNon calls in-kernel context:
    # funccount '*ovl*'
    Tracing "*ovl*"... Ctrl-C to end.
    FUNC
    COUNT
    ovl_cache_free
    ovl_xattr_get
    [...]
    ovl_fill_merge
    ovl_path_real
    ovl_path_upper
    ovl_update_time
    ovl_permission
    ovl_d_real
    ovl_override_creds
    Ending tracing...
    Each can be a target for further study with kprobes
    
slide 60:
    trace: Overlay FS FuncNon Tracing
    Using kprobe (perf-tools) to trace ovl_fill_merg() args and stack trace:
    # kprobe -s 'p:ovl_fill_merge ctx=%di name=+0(%si):string'
    Tracing kprobe ovl_fill_merge. Ctrl-C to end.
    bash-16633 [000] d... 14390771.218973: ovl_fill_merge: (ovl_fill_merge+0x0/0x1f0
    [overlay]) ctx=0xffffc90042477db0 name="iostat"
    bash-16633 [000] d... 14390771.218981: gt;
    =>gt; ovl_fill_merge
    =>gt; ext4_readdir
    =>gt; iterate_dir
    =>gt; ovl_dir_read_merged
    =>gt; ovl_iterate
    =>gt; iterate_dir
    =>gt; SyS_getdents
    =>gt; do_syscall_64
    =>gt; return_from_SYSCALL_64
    […]
    Good for debugging, although dumping all events can cost too much overhead. trace has some
    soluNons to this, BPF has more…
    
slide 61:
    Enhanced BPF Tracing Internals
    Observability Program
    BPF
    bytecode
    BPF
    program
    event config
    output
    per-event
    data
    staNsNcs
    Kernel
    load
    verifier
    staNc tracing
    tracepoints
    aPach
    dynamic tracing
    BPF
    kprobes
    uprobes
    async
    copy
    sampling, PMCs
    maps
    perf_events
    
slide 62:
    bcc/BPF
    Perf
    Tools
    
slide 63:
    BPF: Scheduler Latency
    host# runqlat --pidnss -m
    summarized in-kernel
    Tracing run queue latency... Hit Ctrl-C to end.
    for efficiency
    pidns = 4026532382
    msecs
    : count
    distribution
    0 ->gt; 1
    : 646
    |****************************************|
    2 ->gt; 3
    : 18
    4 ->gt; 7
    : 48
    |**
    Per-PID namespace histograms |
    8 ->gt; 15
    : 17
    16 ->gt; 31
    : 150
    |*********
    32 ->gt; 63
    : 134
    |********
    […]
    pidns = 4026532870
    msecs
    0 ->gt; 1
    2 ->gt; 3
    [...]
    : count
    : 264
    : 0
    distribution
    |****************************************|
    Shows CPU share throPling when present (eg, 8 - 65 ms)
    Currently using task_struct->gt;nsproxy->gt;pid_ns_for_children->gt;ns.inum
    for pidns. We could add a stable bpf_get_current_pidns() call to BPF.
    
slide 64:
    Docker Analysis & Debugging
    If needed, dockerd can also be analyzed using:
    go execuNon tracer
    GODEBUG with gctrace and schedtrace
    gdb and Go runNme support
    perf profiling
    bcc/BPF and uprobes
    Each has pros/cons. bcc/BPF can trace user & kernel events.
    
slide 65:
    BPF: dockerd Go FuncNon CounNng
    CounNng dockerd Go calls in-kernel using BPF that match "*docker*get":
    # funccount '/usr/bin/dockerd:*docker*get*'
    Tracing 463 functions for "/usr/bin/dockerd:*docker*get*"... Hit Ctrl-C to end.
    FUNC
    COUNT
    github.com/docker/docker/daemon.(*statsCollector).getSystemCPUUsage
    github.com/docker/docker/daemon.(*Daemon).getNetworkSandboxID
    github.com/docker/docker/daemon.(*Daemon).getNetworkStats
    github.com/docker/docker/daemon.(*statsCollector).getSystemCPUUsage.func1
    github.com/docker/docker/pkg/ioutils.getBuffer
    github.com/docker/docker/vendor/golang.org/x/net/trace.getFamily
    github.com/docker/docker/vendor/google.golang.org/grpc.(*ClientConn).getTransport
    github.com/docker/docker/vendor/github.com/golang/protobuf/proto.getbase
    github.com/docker/docker/vendor/google.golang.org/grpc/transport.(*http2Client).getStream
    Detaching...
    # objdump -tTj .text /usr/bin/dockerd | wc -l
    35,859 functions can be traced!
    Uses uprobes, and needs newer kernels. Warning: will cost overhead at high funcNon rates.
    
slide 66:
    BPF: dockerd Go Stack Tracing
    CounNng stack traces that led to this iouNls.getBuffer() call:
    # stackcount 'p:/usr/bin/dockerd:*/ioutils.getBuffer'
    Tracing 1 functions for "p:/usr/bin/dockerd:*/ioutils.getBuffer"... Hit Ctrl-C to end.
    github.com/docker/docker/pkg/ioutils.getBuffer
    github.com/docker/docker/pkg/broadcaster.(*Unbuffered).Write
    bufio.(*Reader).writeBuf
    bufio.(*Reader).WriteTo
    io.copyBuffer
    io.Copy
    github.com/docker/docker/pkg/pools.Copy
    github.com/docker/docker/container/stream.(*Config).CopyToPipe.func1.1
    runtime.goexit
    dockerd [18176]
    means this stack was seen 110 times
    Detaching...
    Can also trace funcNon arguments, and latency (with some work)
    hPp://www.brendangregg.com/blog/2017-01-31/golang-bcc-bpf-funcNon-tracing.html
    
slide 67:
    Summary
    IdenNfy boPlenecks:
    1. In the host vs container, using system metrics
    2. In applicaNon code on containers, using CPU flame graphs
    3. Deeper in the kernel, using tracing tools
    
slide 68:
    References
    hPp://techblog.neIlix.com/2017/04/the-evoluNon-of-container-usage-at.html
    hPp://techblog.neIlix.com/2016/07/distributed-resource-scheduling-with.html
    hPps://www.slideshare.net/aspyker/neIlix-and-containers-Ntus
    hPps://docs.docker.com/engine/admin/runmetrics/#Nps-for-high-performance-metric-collecNon
    hPps://blog.docker.com/2013/10/gathering-lxc-docker-containers-metrics/
    hPps://www.slideshare.net/jpetazzo/anatomy-of-a-container-namespaces-cgroups-some-filesystem-magic-linuxcon
    hPps://www.youtube.com/watch?v=sK5i-N34im8 Cgroups, namespaces, and beyond
    hPps://jvns.ca/blog/2016/10/10/what-even-is-a-container/
    hPps://blog.jessfraz.com/post/containers-zones-jails-vms/
    hPp://blog.alicegoldfuss.com/making-flamegraphs-with-containerized-java/
    hPp://www.brendangregg.com/USEmethod/use-linux.html full USE method list
    hPp://www.brendangregg.com/blog/2017-01-31/golang-bcc-bpf-funcNon-tracing.html
    hPp://techblog.neIlix.com/2015/11/linux-performance-analysis-in-60s.html
    hPp://queue.acm.org/detail.cfm?id=1809426 latency heat maps
    hPps://github.com/brendangregg/perf-tools trace tools, hPps://github.com/iovisor/bcc BPF tools
    
slide 69:
    Thank You!
    hPp://techblog.neIlix.com
    hPp://slideshare.net/brendangregg
    hPp://www.brendangregg.com
    bgregg@neIlix.com
    @brendangregg
    Titus team: @aspyker @anwleung @fabiokung @tomaszbak1974 @amit_joshee
    @sargun @corindwyer …
    October 29–November 3, 2017 | San Francisco, CA
    www.usenix.org/lisa17
    #lisa17