LISA2019_Linux_Systems_Performance.pdf

USENIX LISA 2019: Linux Systems Performance

Video: https://www.youtube.com/watch?v=fhBHvsi0Ql0

Talk by Brendan Gregg for USENIX LISA 2019

Description: "Systems performance is an effective discipline for performance analysis and tuning, and can help you find performance wins for your applications and the kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes the topic for everyone, touring six important areas of Linux systems performance: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events) and tracing (Ftrace, bcc/BPF, and bpftrace/BPF), and much advice about what is and isn't important to learn. This talk is aimed at everyone: developers, operations, sysadmins, etc, and in any environment running Linux, bare metal or the cloud."

	next prev 1/64
	next prev 2/64
	next prev 3/64
	next prev 4/64
	next prev 5/64
	next prev 6/64
	next prev 7/64
	next prev 8/64
	next prev 9/64
	next prev 10/64
	next prev 11/64
	next prev 12/64
	next prev 13/64
	next prev 14/64
	next prev 15/64
	next prev 16/64
	next prev 17/64
	next prev 18/64
	next prev 19/64
	next prev 20/64
	next prev 21/64
	next prev 22/64
	next prev 23/64
	next prev 24/64
	next prev 25/64
	next prev 26/64
	next prev 27/64
	next prev 28/64
	next prev 29/64
	next prev 30/64
	next prev 31/64
	next prev 32/64
	next prev 33/64
	next prev 34/64
	next prev 35/64
	next prev 36/64
	next prev 37/64
	next prev 38/64
	next prev 39/64
	next prev 40/64
	next prev 41/64
	next prev 42/64
	next prev 43/64
	next prev 44/64
	next prev 45/64
	next prev 46/64
	next prev 47/64
	next prev 48/64
	next prev 49/64
	next prev 50/64
	next prev 51/64
	next prev 52/64
	next prev 53/64
	next prev 54/64
	next prev 55/64
	next prev 56/64
	next prev 57/64
	next prev 58/64
	next prev 59/64
	next prev 60/64
	next prev 61/64
	next prev 62/64
	next prev 63/64
	next prev 64/64

PDF: LISA2019_Linux_Systems_Performance.pdf

Keywords (from pdftotext):

slide 1:

Oct, 2019
Linux Systems
Performance
Brendan Gregg
Senior Performance Engineer
USENIX LISA 2019, Portland, Oct 28-30

slide 2:

Experience: A 3x Perf Difference

slide 3:

mpstat
load averages: serverA 90, serverB 17
serverA# mpstat 10
Linux 4.4.0-130-generic (serverA) 07/18/2019 _x86_64_ (48 CPU)
10:07:55 PM
10:08:05 PM
10:08:15 PM
10:08:25 PM
[...]
Average:
CPU
all
all
all
%usr
%nice
%sys %iowait
%irq
%soft
%steal
%guest
%gnice
%idle
all
serverB# mpstat 10
Linux 4.19.26-nflx (serverB) 07/18/2019 _x86_64_ (64 CPU)
09:56:11 PM
09:56:21 PM
09:56:31 PM
09:56:41 PM
[...]
Average:
CPU
all
all
all
%usr
%nice
%sys %iowait
%irq
%soft
%steal
%guest
%gnice
%idle
all

slide 4:

pmcarch
serverA# ./pmcarch -p 4093 10
K_CYCLES
K_INSTR
IPC BR_RETIRED
BR_MISPRED
982412660 575706336
0.59 126424862460 2416880487
999621309 555043627
0.56 120449284756 2317302514
991146940 558145849
0.56 126350181501 2530383860
996314688 562276830
0.56 122215605985 2348638980
979890037 560268707
0.57 125609807909 2386085660
serverB# ./pmcarch -p 1928219 10
K_CYCLES
K_INSTR
IPC BR_RETIRED
147523816 222396364
1.51 46053921119
156634810 229801807
1.47 48236123575
152783226 237001219
1.55 49344315621
140787179 213570329
1.52 44518363978
136822760 219706637
1.61 45129020910
BR_MISPRED
BMR% LLCREF
LLCMISS
LLC%
1.91 15724006692 10872315070 30.86
1.92 15378257714 11121882510 27.68
2.00 15965082710 11464682655 28.19
1.92 15558286345 10835594199 30.35
1.90 15828820588 11038597030 30.26
BMR% LLCREF
1.39 8880477235
1.35 9186609260
1.40 9314992450
1.42 8675999448
1.44 8689831639
LLCMISS
LLC%
968809014 89.09
1183858023 87.11
879494418 90.56
712318917 91.79
617678747 92.89

slide 5:

perf
serverA# perf stat -e cs -a -I 1000
time
counts unit events
2,063,105
2,065,354
1,527,297
515,509
2,447,126
[...]
serverB# perf stat -e cs -p 1928219 -I 1000
time
counts unit events
1,172
1,370
1,034
1,207
1,053
[...]

slide 6:

bcc/BPF
serverA# /usr/share/bcc/tools/cpudist -p 4093 10 1
Tracing on-CPU time... Hit Ctrl-C to end.
usecs
0 ->gt; 1
2 ->gt; 3
4 ->gt; 7
8 ->gt; 15
16 ->gt; 31
32 ->gt; 63
: count
: 3618650
: 2704935
: 421179
: 99416
: 16951
: 6355
distribution
|****************************************|
|*****************************
|****
[...]
serverB# /usr/share/bcc/tools/cpudist -p 1928219 10 1
Tracing on-CPU time... Hit Ctrl-C to end.
usecs
256 ->gt; 511
512 ->gt; 1023
1024 ->gt; 2047
2048 ->gt; 4095
4096 ->gt; 8191
8192 ->gt; 16383
16384 ->gt; 32767
[...]
: count
: 44
: 156
: 238
: 4511
: 277
: 286
: 77
distribution
|**
|****************************************|
|**
|**

slide 7:

Systems Performance in 45 mins
• This is slides + discussion
• For more detail and stand-alone texts:

slide 8:

Agenda
1. Observability
2. Methodologies
3. Benchmarking
4. Profiling
5. Tracing
6. Tuning

slide 9:

slide 10:

1. Observability

slide 11:

How do you measure these?

slide 12:

Linux Observability Tools

slide 13:

Why Learn Tools?
• Most analysis at Netflix is via GUIs
• Benefits of command-line tools:
Helps you understand GUIs: they show the same metrics
Often documented, unlike GUI metrics
Often have useful options not exposed in GUIs
• Installing essential tools (something like):
$ sudo apt-get install sysstat bcc-tools bpftrace linux-tools-common \
linux-tools-$(uname -r) iproute2 msr-tools
$ git clone https://github.com/brendangregg/msr-cloud-tools
$ git clone https://github.com/brendangregg/bpf-perf-tools-book
These are crisis tools and should be installed by default
In a performance meltdown you may be unable to install them

slide 14:

uptime
• One way to print load averages:
$ uptime
07:42:06 up
8:16,
1 user,
load average: 2.27, 2.84, 2.91
• A measure of resource demand: CPUs + disks
– Includes TASK_UNINTERRUPTIBLE state to show all demand types
– You can use BPF & off-CPU flame graphs to explain this state:
http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html
– PSI in Linux 4.20 shows CPU, I/O, and memory loads
• Exponentially-damped moving averages
– With time constants of 1, 5, and 15 minutes. See historic trend.
• Load >gt; # of CPUs, may mean CPU saturation
Don’t spend more than 5 seconds studying these

slide 15:

top
• System and per-process interval summary:
$ top - 18:50:26 up 7:43, 1 user, load average: 4.11, 4.91, 5.22
Tasks: 209 total,
1 running, 206 sleeping,
0 stopped,
2 zombie
Cpu(s): 47.1%us, 4.0%sy, 0.0%ni, 48.4%id, 0.0%wa, 0.0%hi, 0.3%si, 0.2%st
Mem: 70197156k total, 44831072k used, 25366084k free,
36360k buffers
Swap:
0k total,
0k used,
0k free, 11873356k cached
PID USER
VIRT
RES
SHR S %CPU %MEM
5738 apiprod
1386 apiprod
1 root
2 root
[…]
0 62.6g 29g 352m S
0 17452 1388 964 R
0 24340 2272 1340 S
0 S
417 44.2
0 0.0
0 0.0
0 0.0
TIME+
COMMAND
2144:15 java
0:00.02 top
0:01.51 init
0:00.00 kthreadd
• %CPU is summed across all CPUs
• Can miss short-lived processes (atop won’t)

slide 16:

htop
$ htop
1 [||||||||||70.0%]
13 [||||||||||70.6%]
2 [||||||||||68.7%]
14 [||||||||||69.4%]
3 [||||||||||68.2%]
15 [||||||||||68.5%]
4 [||||||||||69.3%]
16 [||||||||||69.2%]
5 [||||||||||68.0%]
17 [||||||||||67.6%]
[…]
Mem[||||||||||||||||||||||||||||||176G/187G]
Swp[
0K/0K]
25 [||||||||||69.7%]
26 [||||||||||67.7%]
27 [||||||||||68.8%]
28 [||||||||||67.6%]
29 [||||||||||70.1%]
37 [||||||||||66.6%]
38 [||||||||||66.0%]
39 [||||||||||73.3%]
40 [||||||||||67.0%]
41 [||||||||||66.5%]
Tasks: 80, 3206 thr; 43 running
Load average: 36.95 37.19 38.29
Uptime: 01:39:36
PID USER
PRI NI VIRT
RES
SHR S CPU% MEM%
TIME+ Command
4067 www-data
0 202G 173G 55392 S 3359 93.0 48h51:30 /apps/java/bin/java -Dnop -Djdk.map
6817 www-data
0 202G 173G 55392 R 56.9 93.0 48:37.89 /apps/java/bin/java -Dnop -Djdk.map
6826 www-data
0 202G 173G 55392 R 25.7 93.0 22:26.90 /apps/java/bin/java -Dnop -Djdk.map
6721 www-data
0 202G 173G 55392 S 25.0 93.0 22:05.51 /apps/java/bin/java -Dnop -Djdk.map
6616 www-data
0 202G 173G 55392 S 13.6 93.0 11:15.51 /apps/java/bin/java -Dnop -Djdk.map
[…]
F1Help F2Setup F3SearchF4FilterF5Tree F6SortByF7Nice -F8Nice +F9Kill F10Quit
Pros: configurable. Cons: misleading colors.
dstat is similar, and now dead (May 2019); see pcp-dstat

slide 17:

vmstat
• Virtual memory statistics and more:
$ vmstat –Sm 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---r b
swpd
free
buff cache
cs us sy id wa
8 0
12 25 34 0 0
7 0
0 205 186 46 13 0 0
8 0
8 210 435 39 21 0 0
8 0
0 218 219 42 17 0 0
[…]
• USAGE: vmstat [interval [count]]
• First output line has some summary since boot values
• High level CPU summary
– “r” is runnable tasks

slide 18:

iostat
• Block I/O (disk) stats. 1st output is since boot.
$ iostat -xz 1
Linux 5.0.21 (c099.xxxx)
06/24/19
_x86_64_
(32 CPU)
[...]
Device
r/s
w/s
rkB/s
wkB/s
rrqm/s
sda
nvme3n1
20.39 293152.56 14758.05
nvme1n1
17.83 286402.15 13089.56
nvme0n1
19.70 258184.52 14218.55
wrqm/s %rrqm %wrqm \...
0.00 /...
0.00 18.81 \...
0.00 18.52 /...
0.00 19.51 \...
Workload
Very useful
set of stats
...\ r_await w_await aqu-sz rareq-sz wareq-sz
.../
...\
.../
...\
Resulting Performance
svctm
%util

slide 19:

free
• Main memory usage:
$ free -m
Mem:
Swap:
total
used
free
shared
buff/cache
• Recently added “available” column
buff/cache: block device I/O cache + virtual page cache
available: memory likely available to apps
free: completely unused memory
available

slide 20:

strace
• System call tracer:
$ strace –tttT –p 313
1408393285.779746 getgroups(0, NULL)
= 1 gt;
1408393285.779873 getgroups(1, [0])
= 1 gt;
1408393285.780797 close(3)
= 0 gt;
1408393285.781338 write(1, "wow much syscall\n", 17wow much syscall
) = 17 gt;
• Translates syscall arguments
• Not all kernel requests (e.g., page faults)
• Currently has massive overhead (ptrace based)
– Can slow the target by >gt; 100x. Skews measured time (-ttt, -T).
http://www.brendangregg.com/blog/2014-05-11/strace-wow-much-syscall.html
• perf trace will replace it: uses a ring buffer & BPF

slide 21:

tcpdump
• Sniff network packets for post analysis:
$ tcpdump -i eth0 -w /tmp/out.tcpdump
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
^C7985 packets captured
8996 packets received by filter
1010 packets dropped by kernel
# tcpdump -nr /tmp/out.tcpdump | head
reading from file /tmp/out.tcpdump, link-type EN10MB (Ethernet)
20:41:05.038437 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 18...
20:41:05.038533 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 48...
20:41:05.038584 IP 10.44.107.151.22 >gt; 10.53.237.72.46425: Flags [P.], seq 96...
[…]
• Study packet sequences with timestamps (us)
• CPU overhead optimized (socket ring buffers), but can
still be significant. Use BPF in-kernel summaries
instead.

slide 22:

nstat
• Replacement for netstat from iproute2
• Various network protocol statistics:
-s won’t reset counters,
otherwise intervals
can be examined
-d for daemon mode
• Linux keeps adding
more counters
$ nstat -s
#kernel
IpInReceives
IpInDelivers
IpOutRequests
[...]
TcpActiveOpens
TcpPassiveOpens
TcpAttemptFails
TcpEstabResets
TcpInSegs
TcpOutSegs
TcpRetransSegs
TcpOutRsts
[...]

slide 23:

slabtop
• Kernel slab allocator memory usage:
$ slabtop
Active / Total Objects (% used)
: 4692768 / 4751161 (98.8%)
Active / Total Slabs (% used)
: 129083 / 129083 (100.0%)
Active / Total Caches (% used)
: 71 / 109 (65.1%)
Active / Total Size (% used)
: 729966.22K / 738277.47K (98.9%)
Minimum / Average / Maximum Object : 0.01K / 0.16K / 8.00K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
3565575 3565575 100%
0.10K 91425
365700K buffer_head
314916 314066 99%
0.19K 14996
59984K dentry
184192 183751 99%
0.06K
11512K kmalloc-64
138618 138618 100%
0.94K
130464K xfs_inode
138602 138602 100%
0.21K
29968K xfs_ili
102116 99012 96%
0.55K
58352K radix_tree_node
97482 49093 50%
0.09K
9284K kmalloc-96
22695 20777 91%
0.05K
1068K shared_policy_node
21312 21312 100%
0.86K
18432K ext4_inode_cache
16288 14601 89%
0.25K
4072K kmalloc-256
[…]

slide 24:

pcstat
• Show page cache residency by file:
# ./pcstat data0*
|----------+----------------+------------+-----------+---------|
| Name
| Size
| Pages
| Cached
| Percent |
|----------+----------------+------------+-----------+---------|
| data00
| 104857600
| 25600
| 25600
| 100.000 |
| data01
| 104857600
| 25600
| 25600
| 100.000 |
| data02
| 104857600
| 25600
| 4080
| 015.938 |
| data03
| 104857600
| 25600
| 25600
| 100.000 |
| data04
| 104857600
| 25600
| 16010
| 062.539 |
| data05
| 104857600
| 25600
| 0
| 000.000 |
|----------+----------------+------------+-----------+---------|
• Uses mincore(2) syscall. Used for database perf analysis.

slide 25:

docker stats
• Soft limits (cgroups) by container:
# docker stats
CONTAINER
CPU %
353426a09db1 526.81%
6bf166a66e08 303.82%
58dcf8aed0a7 41.01%
61061566ffe5 85.92%
bdc721460293 2.69%
6c80ed61ae63 477.45%
337292fb5b64 89.05%
b652ede9a605 173.50%
d7cd2599291f 504.28%
05bf9f3e0d13 314.46%
09082f005755 142.04%
[...]
MEM USAGE / LIMIT
4.061 GiB / 8.5 GiB
3.448 GiB / 8.5 GiB
1.322 GiB / 2.5 GiB
220.9 MiB / 3.023 GiB
1.204 GiB / 3.906 GiB
557.7 MiB / 8 GiB
766.2 MiB / 8 GiB
689.2 MiB / 8 GiB
673.2 MiB / 8 GiB
711.6 MiB / 8 GiB
693.9 MiB / 8 GiB
MEM %
47.78%
40.57%
52.89%
7.14%
30.82%
6.81%
9.35%
8.41%
8.22%
8.69%
8.47%
NET I/O
0 B / 0 B
0 B / 0 B
0 B / 0 B
0 B / 0 B
0 B / 0 B
0 B / 0 B
0 B / 0 B
0 B / 0 B
0 B / 0 B
0 B / 0 B
0 B / 0 B
BLOCK I/O
2.818 MB / 0 B
2.032 MB / 0 B
0 B / 0 B
43.4 MB / 0 B
4.35 MB / 0 B
9.257 MB / 0 B
5.493 MB / 0 B
6.48 MB / 0 B
12.58 MB / 0 B
7.942 MB / 0 B
8.081 MB / 0 B
PIDS
• Stats are in /sys/fs/cgroups
• CPU shares and bursting breaks monitoring assumptions

slide 26:

showboost
• Determine current CPU clock rate
# showboost
Base CPU MHz : 2500
Set CPU MHz : 2500
Turbo MHz(s) : 3100 3200 3300 3500
Turbo Ratios : 124% 128% 132% 140%
CPU 0 summary every 1 seconds...
TIME
23:39:07
23:39:08
23:39:09
C0_MCYC
C0_ACYC
UTIL
64%
70%
99%
RATIO
MHz
• Uses MSRs. Can also use PMCs for this.
• Also see turbostat.
https://github.com/brendangregg/msr-cloud-tools

slide 27:

Also: Static Performance Tuning Tools

slide 28:

Where do you start...and stop?
Workload Observability
Static Configuration

slide 29:

2. Methodologies

slide 30:

Anti-Methodologies
• The lack of a deliberate methodology…
• Street Light Anti-Method:
– 1. Pick observability tools that are
• Familiar
• Found on the Internet
• Found at random
– 2. Run tools
– 3. Look for obvious issues
• Drunk Man Anti-Method:
– Tune things at random until the problem goes away

slide 31:

Methodologies
Linux Performance Analysis in 60 seconds
The USE method
Workload characterization
Many others:
– Resource analysis
– Workload analysis
– Drill-down analysis
– CPU profile method
– Off-CPU analysis
– Static performance tuning
– 5 whys

slide 32:

Linux Perf Analysis in 60s
uptime
dmesg -T | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1
top
load averages
kernel errors
overall stats by time
CPU balance
process usage
disk I/O
memory usage
network I/O
TCP stats
check overview
http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html

slide 33:

USE Method
For every resource, check:
1. Utilization
2. Saturation
3. Errors
Saturation
Errors
Resource
Utilization
(%)
For example, CPUs:
Utilization: time busy
Saturation: run queue length or latency
Errors: ECC errors, etc.
Start with the questions,
then find the tools
Can be applied to hardware and software (cgroups)

slide 34:

Workload Characterization
Analyze workload characteristics, not resulting performance
For example, CPUs:
1. Who: which PIDs, programs, users
2. Why: code paths, context
3. What: CPU instructions, cycles
4. How: changing over time
Workload
Target

slide 35:

3. Benchmarking

slide 36:

~100% of benchmarks are wrong
The energy needed to refute benchmarks
is orders of magnitude bigger than
to run them (so, no one does)

slide 37:

Benchmarking
• An experimental analysis activity
– Try observational analysis first; benchmarks can perturb
• Benchmarking is error prone:
– Testing the wrong target
• eg, FS cache I/O instead of disk I/O
– Choosing the wrong target
• eg, disk I/O instead of FS cache I/O
– Invalid results
• eg, bugs
– Misleading results:
• you benchmark A,
but actually measure B,
and conclude you measured C
caution: benchmarking

slide 38:

Benchmark Examples
• Micro benchmarks:
– File system maximum cached read operations/sec
– Network maximum throughput
• Macro (application) benchmarks:
– Simulated application max request rate
• Bad benchmarks:
– gitpid() in a tight loop
– Context switch timing
kitchen sink benchmarks

slide 39:

If your product’s chances of
winning a benchmark are
50/50, you’ll usually lose
Benchmark paradox
caution: despair
http://www.brendangregg.com/blog/2014-05-03/the-benchmark-paradox.html

slide 40:

Solution: Active Benchmarking
• Root cause analysis while the benchmark runs
– Use the earlier observability tools
– Identify the limiter (or suspect) and include it with the results
• For any given benchmark, ask: why not 10x?
• This takes time, but uncovers most mistakes

slide 41:

4. Profiling

slide 42:

Profiling
Can you do this?
“As an experiment to investigate the performance of the resulting TCP/IP
implementation ... the 11/750 is CPU saturated, but the 11/780 has about
30% idle time. The time spent in the system processing the data is spread
out among handling for the Ethernet (20%), IP packet processing (10%),
TCP processing (30%), checksumming (25%), and user system call
handling (15%), with no single part of the handling dominating the time in
the system.”
– Bill Joy, 1981, TCP-IP Digest, Vol 1 #6
https://www.rfc-editor.org/rfc/museum/tcp-ip-digest/tcp-ip-digest.v1n6.1

slide 43:

perf: CPU profiling
• Sampling full stack traces at 99 Hertz, for 30 secs:
# perf record -F 99 -ag -- sleep 30
[ perf record: Woken up 9 times to write data ]
[ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ]
# perf report -n --stdio
1.40%
java [kernel.kallsyms]
[k] _raw_spin_lock
--- _raw_spin_lock
|--63.21%-- try_to_wake_up
|--63.91%-- default_wake_function
|--56.11%-- __wake_up_common
__wake_up_locked
ep_poll_callback
__wake_up_common
__wake_up_sync_key
|--59.19%-- sock_def_readable
[…78,000 lines truncated…]

slide 44:

Full "perf report" Output

slide 45:

… as a Flame Graph

slide 46:

Flame Graphs
• Visualizes a collection of stack traces
– x-axis: alphabetical stack sort, to maximize merging
– y-axis: stack depth
– color: random (default), or a dimension
• Perl + SVG + JavaScript
– https://github.com/brendangregg/FlameGraph
– Takes input from many different profilers
– Multiple d3 versions are being developed
• References:
– http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
– http://queue.acm.org/detail.cfm?id=2927301
– "The Flame Graph" CACM, June 2016

slide 47:

Linux CPU Flame Graphs
Linux 2.6+, via perf:
git clone --depth 1 https://github.com/brendangregg/FlameGraph
cd FlameGraph
These files can be read using FlameScope
perf record -F 99 -a –g -- sleep 30
perf script --header >gt; out.perf01
./stackcollapse-perf.pl gt; perf.svg
Linux 4.9+, via BPF:
git clone --depth 1 https://github.com/brendangregg/FlameGraph
git clone --depth 1 https://github.com/iovisor/bcc
./bcc/tools/profile.py -dF 99 30 | ./FlameGraph/flamegraph.pl >gt; perf.svg
– Most efficient: no perf.data file, summarizes in-kernel

slide 48:

FlameScope
Analyze variance, perturbations
Flame graph
https://github.com/
Netflix/flamescope
Subsecond-offset heat map

slide 49:

perf: Counters
• Performance Monitoring Counters (PMCs):
$ perf list | grep –i hardware
cpu-cycles OR cycles
stalled-cycles-frontend OR idle-cycles-frontend
stalled-cycles-backend OR idle-cycles-backend
instructions
[…]
L1-dcache-loads
L1-dcache-load-misses
[…]
rNNN (see 'perf list --help' on how to encode it)
mem:gt;[:access]
[Hardware event]
[Hardware event]
[Hardware event]
[Hardware event]
[Hardware cache event]
[Hardware cache event]
[Raw hardware event …
[Hardware breakpoint]
• Measure instructions-per-cycle (IPC) and CPU stall types
• PMCs only enabled for some cloud instance types
My front-ends, incl. pmcarch:
https://github.com/brendangregg/pmc-cloud-tools

slide 50:

5. Tracing

slide 51:

Linux Tracing Events

slide 52:

Tracing Stack
add-on tools:
trace-cmd, perf-tools, bcc, bpftrace
front-end tools:
perf
tracing frameworks:
back-end instrumentation:
Ftrace, perf_events, BPF
tracepoints, kprobes, uprobes
BPF enables a new class of
custom, efficient, and production safe
performance analysis tools
Linux

slide 53:

Ftrace: perf-tools funccount
• Built-in kernel tracing capabilities, added by Steven
Rostedt and others since Linux 2.6.27
# ./funccount -i 1 'bio_*'
Tracing "bio_*"... Ctrl-C to end.
FUNC
[...]
bio_alloc_bioset
bio_endio
bio_free
bio_fs_destructor
bio_init
bio_integrity_enabled
bio_put
bio_add_page
• Also see trace-cmd
COUNT

slide 54:

perf: Tracing Tracepoints
perf was introduced earlier; it is also a powerful tracer
# perf stat -e block:block_rq_complete -a sleep 10
Performance counter stats for 'system wide':
In-kernel counts (efficient)
block:block_rq_complete
# perf record -e block:block_rq_complete -a sleep 10
Dump & post-process
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.428 MB perf.data (~18687 samples) ]
# perf script
run 30339 [000] 2083345.722857: block:block_rq_complete: 202,1 W () 12986336 + 8 [0]
run 30339 [000] 2083345.723180: block:block_rq_complete: 202,1 W () 12986528 + 8 [0]
swapper
0 [000] 2083345.723489: block:block_rq_complete: 202,1 W () 12986496 + 8 [0]
swapper
0 [000] 2083346.745840: block:block_rq_complete: 202,1 WS () 1052984 + 144 [0]
supervise 30342 [000] 2083346.746571: block:block_rq_complete: 202,1 WS () 1053128 + 8 [0]
[...]
http://www.brendangregg.com/perf.html
https://perf.wiki.kernel.org/index.php/Main_Page

slide 55:

BCC/BPF: ext4slower
• ext4 operations slower than the threshold:
# ./ext4slower 1
Tracing ext4 operations slower than 1 ms
TIME
COMM
PID
T BYTES
OFF_KB
06:49:17 bash
R 128
06:49:17 cksum
R 39552
06:49:17 cksum
R 96
06:49:17 cksum
R 96
06:49:17 cksum
R 10320
06:49:17 cksum
R 65536
06:49:17 cksum
R 55400
06:49:17 cksum
R 36792
[…]
LAT(ms) FILENAME
7.75 cksum
1.34 [
5.36 2to3-2.7
14.94 2to3-3.4
6.82 411toppm
4.01 a2p
8.77 ab
16.34 aclocal-1.14
• Better indicator of application pain than disk I/O
• Measures & filters in-kernel for efficiency using BPF
https://github.com/iovisor/bcc

slide 56:

bpftrace: one-liners
• Block I/O (disk) events by type; by size & comm:
# bpftrace -e 't:block:block_rq_issue { @[args->gt;rwbs] = count(); }'
Attaching 1 probe...
@[WS]: 2
@[RM]: 12
@[RA]: 1609
@[R]: 86421
# bpftrace -e 't:block:block_rq_issue { @bytes[comm] = hist(args->gt;bytes); }'
Attaching 1 probe...
@bytes[dmcrypt_write]:
[4K, 8K)
68 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K)
35 |@@@@@@@@@@@@@@@@@@@@@@@@@@
[16K, 32K)
4 |@@@
[32K, 64K)
1 |
[64K, 128K)
2 |@
https://github.com/iovisor/bpftrace
[...]

slide 57:

BPF Perf
Tools
(2019)
BCC & bpftrace repos
contain many of these.
The book has them all.

slide 58:

Off-CPU Analysis
• Explain all blocking events. High-overhead: needs BPF.
directory read
from disk
file read
from disk
fstat from disk
path read from disk
pipe write

slide 59:

6. Tuning

slide 60:

Ubuntu Bionic Tuning: Late 2019 (1/2)
CPU
schedtool –B PID
disable Ubuntu apport (crash reporter)
upgrade to Bionic (scheduling improvements)
Virtual Memory
vm.swappiness = 0
# from 60
Memory
echo madvise >gt; /sys/kernel/mm/transparent_hugepage/enabled
kernel.numa_balancing = 0
File System
vm.dirty_ratio = 80
# from 40
vm.dirty_background_ratio = 5
# from 10
vm.dirty_expire_centisecs = 12000
# from 3000
mount -o defaults,noatime,discard,nobarrier …
Storage I/O
/sys/block/*/queue/rq_affinity
# or 2
/sys/block/*/queue/scheduler
kyber
/sys/block/*/queue/nr_requests
/sys/block/*/queue/read_ahead_kb 128
mdadm –chunk=64 …

slide 61:

Ubuntu Bionic Tuning: Late 2019 (2/2)
Networking
net.core.default_qdisc = fq
net.core.netdev_max_backlog = 5000
net.core.rmem_max = 16777216
net.core.somaxconn = 1024
net.core.wmem_max = 16777216
net.ipv4.ip_local_port_range = 10240 65535
net.ipv4.tcp_abort_on_overflow = 1
# maybe
net.ipv4.tcp_congestion_control = bbr
net.ipv4.tcp_max_syn_backlog = 8192
net.ipv4.tcp_rmem = 4096 12582912 16777216
# or 8388608 ...
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_syn_retries = 2
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_wmem = 4096 12582912 16777216
# or 8388608 ...
Hypervisor
echo tsc >gt; /sys/devices/…/current_clocksource
Plus use AWS Nitro
Other
net.core.bpf_jit_enable = 1
sysctl -w kernel.perf_event_max_stack=1000

slide 62:

Takeaways
Systems Performance is:
Observability, Methodologies, Benchmarking, Profiling, Tracing, Tuning
Print out for your office wall:
uptime
dmesg -T | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1
top

slide 63:

Links
Netflix Tech Blog on Linux:
http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html
http://techblog.netflix.com/2015/08/netflix-at-velocity-2015-linux.html
Linux Performance:
http://www.brendangregg.com/linuxperf.html
Linux perf:
https://perf.wiki.kernel.org/index.php/Main_Page
http://www.brendangregg.com/perf.html
Linux ftrace:
https://www.kernel.org/doc/Documentation/trace/ftrace.txt
https://github.com/brendangregg/perf-tools
Linux BPF:
http://www.brendangregg.com/ebpf.html
http://www.brendangregg.com/bpf-performance-tools-book.html
https://github.com/iovisor/bcc
https://github.com/iovisor/bpftrace
Methodologies:
http://www.brendangregg.com/USEmethod/use-linux.html
http://www.brendangregg.com/activebenchmarking.html
Flame Graphs & FlameScope:
http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
http://queue.acm.org/detail.cfm?id=2927301
https://github.com/Netflix/flamescope
MSRs and PMCs
https://github.com/brendangregg/msr-cloud-tools
https://github.com/brendangregg/pmc-cloud-tools
BPF Performance Tools

slide 64:

Thanks
Questions?
http://slideshare.net/brendangregg
http://www.brendangregg.com
bgregg@netflix.com
@brendangregg
Look out for 2nd Ed.
USENIX LISA 2019, Portland, Oct 28-30