KernelRecipes_Perf

Kernel Recipes 2017: Using Linux perf at Netflix

Video: https://www.youtube.com/watch?v=UVM3WX8Lq2k

Talk for Kernel Recipes 2017 by Brendan Gregg.

Description: "Linux perf is a crucial performance analysis tool at Netflix, and is used by a self-service GUI for generating CPU flame graphs and other reports. This sounds like an easy task, however, getting perf to work properly in VM guests running Java, Node.js, containers, and other software, has been at times a challenge. This talk summarizes Linux perf, how we use it at Netflix, the various gotchas we have encountered, and a summary of advanced features."

	next prev 1/79
	next prev 2/79
	next prev 3/79
	next prev 4/79
	next prev 5/79
	next prev 6/79
	next prev 7/79
	next prev 8/79
	next prev 9/79
	next prev 10/79
	next prev 11/79
	next prev 12/79
	next prev 13/79
	next prev 14/79
	next prev 15/79
	next prev 16/79
	next prev 17/79
	next prev 18/79
	next prev 19/79
	next prev 20/79
	next prev 21/79
	next prev 22/79
	next prev 23/79
	next prev 24/79
	next prev 25/79
	next prev 26/79
	next prev 27/79
	next prev 28/79
	next prev 29/79
	next prev 30/79
	next prev 31/79
	next prev 32/79
	next prev 33/79
	next prev 34/79
	next prev 35/79
	next prev 36/79
	next prev 37/79
	next prev 38/79
	next prev 39/79
	next prev 40/79
	next prev 41/79
	next prev 42/79
	next prev 43/79
	next prev 44/79
	next prev 45/79
	next prev 46/79
	next prev 47/79
	next prev 48/79
	next prev 49/79
	next prev 50/79
	next prev 51/79
	next prev 52/79
	next prev 53/79
	next prev 54/79
	next prev 55/79
	next prev 56/79
	next prev 57/79
	next prev 58/79
	next prev 59/79
	next prev 60/79
	next prev 61/79
	next prev 62/79
	next prev 63/79
	next prev 64/79
	next prev 65/79
	next prev 66/79
	next prev 67/79
	next prev 68/79
	next prev 69/79
	next prev 70/79
	next prev 71/79
	next prev 72/79
	next prev 73/79
	next prev 74/79
	next prev 75/79
	next prev 76/79
	next prev 77/79
	next prev 78/79
	next prev 79/79

PDF: KernelRecipes_Perf_Events.pdf

Keywords (from pdftotext):

slide 1:

Using Linux perf
at Netflix
Brendan Gregg
Senior Performance Architect
Sep 2017

slide 2:

Case Study: ZFS is ea/ng my CPU
• Easy to debug using Ne9lix Vector & ﬂame graphs
• How I expected it to look:

slide 3:

Case Study: ZFS is ea/ng my CPU (cont.)
• How it really looked:
Applica/on (truncated)
38% kernel /me (why?)

slide 4:

Case Study: ZFS is ea/ng my CPU (cont.)
Zoomed:
• ZFS ARC (adap/ve replacement cache) reclaim.
• But… ZFS is not in use. No pools, datasets, or ARC buﬀers.
• CPU /me is in random entropy, picking which (empty) list to evict.
Bug: hUps://github.com/zfsonlinux/zfs/issues/6531

slide 5:

slide 6:

Agenda
1. Why Ne9lix Needs Linux Proﬁling
2. perf Basics
3. CPU Proﬁling & Gotchas
Stacks (gcc, Java)
Symbols (Node.js, Java)
Guest PMCs
PEBS
Overheads
4. perf Advanced

slide 7:

1. Why Ne)lix Needs Linux Proﬁling

slide 8:

Understand CPU usage quickly and completely

slide 9:

Quickly
Eg, Ne9lix Vector (self-service UI):
Flame Graphs
Heat Maps

slide 10:

Completely
CPU Flame Graph
Kernel
(C)
User
(C)
JVM
(C++)
Java

slide 11:

Why Linux perf?
• Available
– Linux, open source
• Low overhead
– Tunable sampling, ring buﬀers
• Accurate
– Applica/on-basic samplers don't know what's really RUNNING; eg, Java and epoll
• No blind spots
– See user, library, kernel with CPU sampling
– With some work: hardirqs & SMI as well
• No sample skew
– Unlike Java safety point skew

slide 12:

Why is this so important
• We typically scale microservices based on %CPU
– Small %CPU improvements can mean big $avings
• CPU proﬁling is used by many ac/vi/es
Explaining regressions in new sooware versions
Incident response
3rd party sooware evalua/ons
Iden/fy performance tuning targets
Part of CPU workload characteriza/on
• perf does lots more, but we spend ~95% of our /me looking
at CPU proﬁles, and 5% on everything else
– With new BPF capabili/es (oﬀ-CPU analysis), that might go from 95 to 90%

slide 13:

CPU profiling should be easy, but…
JIT runtimes
no frame pointers
no debuginfo
stale symbol maps
container namespaces

slide 14:

2. perf Basics

slide 15:

perf (aka "perf_events")
• The oﬃcial Linux proﬁler
– In the linux-tools-common package
– Source code & docs in Linux: tools/perf
• Supports many proﬁling/tracing features:
CPU Performance Monitoring Counters (PMCs)
Sta/cally deﬁned tracepoints
User and kernel dynamic tracing
Kernel line and local variable tracing
Eﬃcient in-kernel counts and ﬁlters
Stack tracing, libunwind
Code annota/on
• Some bugs in the past; has been stable for us
perf_events
ponycorn

slide 16:

A Mul/tool of Subcommands
# perf
usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]
The most commonly used perf commands are:
annotate
Read perf.data (created by perf record) and display annotated code
archive
Create archive with object files with build-ids found in perf.data file
bench
General framework for benchmark suites
buildid-cache
Manage build-id cache.
buildid-list
List the buildids in a perf.data file
c2c
Shared Data C2C/HITM Analyzer.
config
Get and set variables in a configuration file.
data
Data file related processing
diff
Read perf.data files and display the differential profile
evlist
List the event names in a perf.data file
ftrace
simple wrapper for kernel's ftrace functionality
inject
Filter to augment the events stream with additional information
kallsyms
Searches running kernel for symbols
kmem
Tool to trace/measure kernel memory properties
kvm
Tool to trace/measure kvm guest os
list
List all symbolic event types
lock
Analyze lock events
mem
Profile memory accesses
record
Run a command and record its profile into perf.data
report
Read perf.data (created by perf record) and display the profile
sched
Tool to trace/measure scheduler properties (latencies)
script
Read perf.data (created by perf record) and display trace output
stat
Run a command and gather performance counter statistics
test
Runs sanity tests.
timechart
Tool to visualize total system behavior during a workload
top
System profiling tool.
probe
Define new dynamic tracepoints
trace
strace inspired tool
See 'perf help COMMAND' for more information on a specific command.
from Linux 4.13

slide 17:

perf Basic Workﬂow
list ->gt; ﬁnd events
stat ->gt; count them
record->gt; write event data to ﬁle
report ->gt; browse summary
script ->gt; event dump for post processing

slide 18:

Basic Workﬂow Example
# perf list sched:*
[…]
sched:sched_process_exec
[Tracepoint event]
[…]
# perf stat -e sched:sched_process_exec -a -- sleep 10
Performance counter stats for 'system wide':
sched:sched_process_exec
1. found an event of interest
2. 19 per 10 sec is a very low
rate, so safe to record
10.001327817 seconds time elapsed
# perf record -e sched:sched_process_exec -a -g -- sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.212 MB perf.data (21 samples) ]
# perf report -n --stdio
# Children
Self
Samples Trace output
# ........ ........ ............ .................................................
4.76%
4.76%
1 filename=/bin/bash pid=7732 old_pid=7732
---_start
return_from_SYSCALL_64
do_syscall_64
sys_execve
do_execveat_common.isra.35
[…]
# perf script
sleep 7729 [003] 632804.699184: sched:sched_process_exec: filename=/bin/sleep pid=7729 old_pid=7729
44b97e do_execveat_common.isra.35 (/lib/modules/4.13.0-rc1-virtual/build/vmlinux)
44bc01 sys_execve (/lib/modules/4.13.0-rc1-virtual/build/vmlinux)
203acb do_syscall_64 (/lib/modules/4.13.0-rc1-virtual/build/vmlinux)
acd02b return_from_SYSCALL_64 (/lib/modules/4.13.0-rc1-virtual/build/vmlinux)
c30 _start (/lib/x86_64-linux-gnu/ld-2.23.so)
[…]
3. 21 samples captured
4. summary style may be
suﬃcient, or,
5. script output in /me order

slide 19:

perf stat/record Format
• These have three main parts: ac/on, event, scope.
• e.g., proﬁling on-CPU stack traces:
AcCon: record stack traces
perf record -F 99 -a -g -- sleep 10
Scope: all CPUs
Event: 99 Hertz
Note: sleep 10 is a dummy command to set the dura/on

slide 20:

perf Ac/ons
• Count events (perf stat …)
– Uses an eﬃcient in-kernel counter, and prints the results
• Sample events (perf record …)
– Records details of every event to a dump ﬁle (perf.data)
• Timestamp, CPU, PID, instruc/on pointer, …
– This incurs higher overhead, rela/ve to the rate of events
– Include the call graph (stack trace) using -g
• Other ac/ons include:
List events (perf list)
Report from a perf.data ﬁle (perf report)
Dump a perf.data ﬁle as text (perf script)
top style proﬁling (perf top)

slide 21:

perf Events
• Custom /mers
– e.g., 99 Hertz (samples per second)
• Hardware events
– CPU Performance Monitoring Counters (PMCs)
• Tracepoints
– Sta/cally deﬁned in sooware
• Dynamic tracing
– Created using uprobes (user) or kprobes (kernel)
– Can do kernel line tracing with local variables (needs kernel debuginfo)

slide 22:

perf Events: Map

slide 23:

perf Events: List
# perf list
List of pre-defined events (to be used in -e):
cpu-cycles OR cycles
instructions
cache-references
cache-misses
branch-instructions OR branches
branch-misses
bus-cycles
stalled-cycles-frontend OR idle-cycles-frontend
stalled-cycles-backend OR idle-cycles-backend
[…]
cpu-clock
task-clock
page-faults OR faults
context-switches OR cs
cpu-migrations OR migrations
[…]
L1-dcache-loads
L1-dcache-load-misses
L1-dcache-stores
[…]
skb:kfree_skb
skb:consume_skb
skb:skb_copy_datagram_iovec
net:net_dev_xmit
net:net_dev_queue
net:netif_receive_skb
net:netif_rx
[…]
[Hardware event]
[Hardware event]
[Hardware event]
[Hardware event]
[Hardware event]
[Hardware event]
[Hardware event]
[Hardware event]
[Hardware event]
[Software event]
[Software event]
[Software event]
[Software event]
[Software event]
[Hardware cache event]
[Hardware cache event]
[Hardware cache event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]
[Tracepoint event]

slide 24:

perf Scope
System-wide: all CPUs (-a)
Target PID (-p PID)
Target command (…)
Speciﬁc CPUs (-c …)
User-level only (gt;:u)
Kernel-level only (gt;:k)
A custom ﬁlter to match variables (--filter …)
This cgroup (container) only (--cgroup …)

slide 25:

One-Liners: Lis/ng Events
# Listing all currently known events:
perf list
# Searching for "sched" tracepoints:
perf list | grep sched
# Listing sched tracepoints:
perf list 'sched:*'
Dozens of perf one-liners:
hUp://www.brendangregg.com/perf.html#OneLiners

slide 26:

One-Liners: Coun/ng Events
# CPU counter statistics for the specified command:
perf stat command
# CPU counter statistics for the entire system, for 5 seconds:
perf stat -a sleep 5
# Detailed CPU counter statistics for the specified PID, until Ctrl-C:
perf stat -dp PID
# Various CPU last level cache statistics for the specified command:
perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches command
# Count system calls for the specified PID, until Ctrl-C:
perf stat -e 'syscalls:sys_enter_*' -p PID
# Count block device I/O events for the entire system, for 10 seconds:
perf stat -e 'block:*' -a sleep 10
# Show system calls by process, refreshing every 2 seconds:
perf top -e raw_syscalls:sys_enter -ns comm

slide 27:

One-Liners: Proﬁling Events
# Sample on-CPU functions for the specified command, at 99 Hertz:
perf record -F 99 command
# Sample CPU stack traces for the specified PID, at 99 Hertz, for 10 seconds:
perf record -F 99 -p PID -g -- sleep 10
# Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds:
perf record -F 99 -ag -- sleep 10
# Sample CPU stacks, once every 10,000 Level 1 data cache misses, for 5 secs:
perf record -e L1-dcache-load-misses -c 10000 -ag -- sleep 5
# Sample CPU stack traces, once every 100 last level cache misses, for 5 secs:
perf record -e LLC-load-misses -c 100 -ag -- sleep 5
# Sample on-CPU kernel instructions, for 5 seconds:
perf record -e cycles:k -a -- sleep 5
# Sample on-CPU user instructions, for 5 seconds:
perf record -e cycles:u -a -- sleep 5

slide 28:

One-Liners: Repor/ng
# Show perf.data in an ncurses browser (TUI) if possible:
perf report
# Show perf.data with a column for sample count:
perf report -n
# Show perf.data as a text report, with data coalesced and percentages:
perf report --stdio
# List all raw events from perf.data:
perf script
# List all raw events from perf.data, with customized fields:
perf script -f comm,tid,pid,time,cpu,event,ip,sym,dso
# Dump raw contents from perf.data as hex (for debugging):
perf script -D
# Disassemble and annotate instructions with percentages (needs debuginfo):
perf annotate --stdio

slide 29:

3. CPU Proﬁling

slide 30:

CPU Proﬁling
• Record stacks at a /med interval: simple and eﬀec/ve
– Pros: Low (determinis/c) overhead
– Cons: Coarse accuracy, but usually suﬃcient
stack
samples:
syscall
on-CPU
/me
oﬀ-CPU
block
interrupt

slide 31:

perf Record
# perf record -F 99 -ag -- sleep 30
[ perf record: Woken up 9 times to write data ]
[ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ]
# perf report -n --stdio
1.40%
java [kernel.kallsyms]
[k] _raw_spin_lock
--- _raw_spin_lock
|--63.21%-- try_to_wake_up
|--63.91%-- default_wake_function
|--56.11%-- __wake_up_common
__wake_up_locked
ep_poll_callback
__wake_up_common
__wake_up_sync_key
|--59.19%-- sock_def_readable
[…78,000 lines truncated…]
Sampling full
stack traces
at 99 Hertz

slide 32:

perf Repor/ng
• perf report summarizes by combining common paths
• Previous output truncated 78,000 lines of summary
• The following is what a mere 8,000 lines looks like…

slide 33:

perf report

slide 34:

… as a Flame Graph

slide 35:

Flame Graphs
git clone --depth 1 https://github.com/brendangregg/FlameGraph
cd FlameGraph
perf record -F 99 -a –g -- sleep 30
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl >gt; perf.svg
• Flame Graphs:
– x-axis: alphabe/cal stack sort, to maximize merging
– y-axis: stack depth
– color: random, or hue can be a dimension
• e.g., sooware type, or diﬀerence between two proﬁles for
non-regression tes/ng ("diﬀeren/al ﬂame graphs")
– interpreta/on: top edge is on-CPU, beneath it is ancestry
• Just a Perl program to convert perf stacks into SVG
– Includes JavaScript: open in a browser for interac/vity
• Easy to get working hUp://www.brendangregg.com/FlameGraphs/cpuﬂamegraphs.html

slide 36:

ﬂamegraph.pl Op/ons
$ flamegraph.pl --help
USAGE: flamegraph.pl [options] infile >gt; outfile.svg
--title TEXT
# change title text
--subtitle TEXT # second level title (optional)
--width NUM
# width of image (default 1200)
--height NUM
# height of each frame (default 16)
--minwidth NUM
# omit smaller functions (default 0.1 pixels)
--fonttype FONT # font type (default "Verdana")
--fontsize NUM
# font size (default 12)
--countname TEXT # count type label (default "samples")
--nametype TEXT # name type label (default "Function:")
--colors PALETTE # set color palette. choices are: hot (default), mem,
# io, wakeup, chain, java, js, perl, red, green, blue,
# aqua, yellow, purple, orange
--hash
# colors are keyed by function name hash
--cp
# use consistent palette (palette.map)
--reverse
# generate stack-reversed flame graph
--inverted
# icicle graph
--negate
# switch differential hues (bluegt;red)
--notes TEXT
# add notes comment in SVG (for debugging)
--help
# this message
eg,
flamegraph.pl --title="Flame Graph: malloc()" trace.txt >gt; graph.svg

slide 37:

perf Flame Graph Workﬂow (Linux 2.6+)
list events
count events
capture stacks
perf list
perf stat
perf record
Typical
Workﬂow
perf.data
text UI
dump proﬁle
perf report
ﬂame graph
visualiza/on
perf script
stackcollapse-perf.pl
flamegraph.pl

slide 38:

perf Flame Graph Workﬂow (Linux 4.5+)
list events
count events
capture stacks
perf list
perf stat
perf record
Typical
Workﬂow
perf.data
text UI
dump
perf report
perf script
ﬂame graph
visualiza/on
summary
perf report
-g folded
awk
flamegraph.pl

slide 39:

Flame Graph Op/miza/ons
Linux 2.6
Linux 4.5
Linux 4.9
capture stacks
capture stacks
count stacks (BPF)
perf record
perf record
profile.py
write samples
perf.data
reads samples
perf script
write text
stackcollapse-perf.pl
folded output
flamegraph.pl
write samples
not perf
perf.data
reads samples
perf report
–g folded
folded
output
folded report
awk
folded output
flamegraph.pl
flamegraph.pl

slide 40:

Gotchas

slide 41:

When we've tried to use perf
Stacks don't work (missing)
Symbols don't work (hex numbers)
Instruc/on proﬁling looks bogus
PMCs don't work in VM guests
Container break things
Overhead is too high

slide 42:

How to really get started
1. Get "perf" to work
2. Get stack walking to work
3. Fix symbol transla/on
4. Get IPC to work
5. Test perf under load
Install perf-tools-common and
perf-tools-`uname -r` packages;
Or compile in the Linux source:
tools/perf
The "gotchas"…

slide 43:

Gotcha #1 Broken Stacks
perf record -F 99 -a –g -- sleep 30
perf report -n --stdio
1. Take a CPU proﬁle
2. Run perf report
3. If stacks are ooen

slide 44:

Iden/fying Broken Stacks
28.10%
sed
libc-2.19.so
--- re_search_internal
|--12.25%-- 0x3
0x100007
|--96.78%-- re_search_stub
rpl_re_search
match_regex
do_subst
execute_program
process_files
main
__libc_start_main
--3.22%-- rpl_re_search
match_regex
do_subst
execute_program
process_files
main
__libc_start_main
[.] re_search_internal
broken
not broken

slide 45:

Iden/fying Broken Stacks
78.50%
sed
libc-2.19.so
|--3.65%-- 0x7f2516d5d10d
|--2.19%-- 0x7f2516d0332f
|--1.22%-- 0x7f2516cffbd2
|--1.22%-- 0x7f2516d5d5ad
|--11.65%-- 0x40a447
0x40659a
0x408dd8
0x408ed1
0x402689
0x7fa1cd08aec5
|--1.33%-- 0x40a4a1
|--60.01%-- 0x40659a
0x408dd8
0x408ed1
0x402689
0x7fa1cd08aec5
[.] 0x00000000000dd7d4
broken
probably not broken
missing symbols, but
that's another problem

slide 46:

Broken Stacks Flame Graph
Broken Java stacks
(missing frame pointer)
Java == green
system == red
C++ == yellow

slide 47:

Fixing Broken Stacks
• Either:
• Fix frame pointer-based stack walking (the default)
Pros: simple, supports any system stack walker
Cons: might cost a liUle extra CPU to make available
• Use libunwind and DWARF: perf record -g dwarf
Pros: more debug info
Cons: not on older kernels, and inﬂates instance size
… there's also ORC on the latest kernel
• Applica/on support
hUps://github.com/jvm-proﬁling-tools/async-proﬁler
• Our current preference is (A), but (C) is also promising
So how do we ﬁx the frame pointer…

slide 48:

gcc -fno-omit-frame-pointer
• Once upon a time, x86 had fewer registers, and the frame
pointer register was reused for general purpose to improve
performance. This breaks system stack walking.
• gcc provides -fno-omit-frame-pointer to ﬁx this
– Please make this the default in gcc!

slide 49:

Java -XX:+PreserveFramePointer
• I hacked frame pointers in the JVM (JDK-8068945) and Oracle rewrote
it as -XX:+PreserveFramePointer. Lets perf do FP stack walks of Java.
--- openjdk8clean/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-03-04…
+++ openjdk8/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-11-07 …
@@ -5236,6 +5236,7 @@
// We always push rbp, so that on return to interpreter rbp, will be
// restored correctly and we can correct the stack.
push(rbp);
mov(rbp, rsp);
// Remove word for ebp
framesize -= wordSize;
Involved changes like this:
ﬁxing x86-64 func/on
prologues
--- openjdk8clean/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp …
+++ openjdk8/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp …
[...]
• Costs some overhead to use. Usually

slide 50:

Broken Java Stacks
# perf script
[…]
java 4579 cpu-clock:
ffffffff8172adff tracesys ([kernel.kallsyms])
7f4183bad7ce pthread_cond_timedwait@@GLIBC_2…
java
4579 cpu-clock:
7f417908c10b [unknown] (/tmp/perf-4458.map)
java
4579 cpu-clock:
7f4179101c97 [unknown] (/tmp/perf-4458.map)
java
4579 cpu-clock:
7f41792fc65f [unknown] (/tmp/perf-4458.map)
a2d53351ff7da603 [unknown] ([unknown])
java
4579 cpu-clock:
7f4179349aec [unknown] (/tmp/perf-4458.map)
java
4579 cpu-clock:
7f4179101d0f [unknown] (/tmp/perf-4458.map)
[…]
Check with "perf script" to see
stack samples
These are 1 or 2 levels deep (junk
values)

slide 51:

Fixed Java Stacks
# perf script
[…]
java 8131 cpu-clock:
7fff76f2dce1 [unknown] ([vdso])
7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…
7fd301861e46 [unknown] (/tmp/perf-8131.map)
7fd30184def8 [unknown] (/tmp/perf-8131.map)
7fd30174f544 [unknown] (/tmp/perf-8131.map)
7fd30175d3a8 [unknown] (/tmp/perf-8131.map)
7fd30166d51c [unknown] (/tmp/perf-8131.map)
7fd301750f34 [unknown] (/tmp/perf-8131.map)
7fd3016c2280 [unknown] (/tmp/perf-8131.map)
7fd301b02ec0 [unknown] (/tmp/perf-8131.map)
7fd3016f9888 [unknown] (/tmp/perf-8131.map)
7fd3016ece04 [unknown] (/tmp/perf-8131.map)
7fd30177783c [unknown] (/tmp/perf-8131.map)
7fd301600aa8 [unknown] (/tmp/perf-8131.map)
7fd301a4484c [unknown] (/tmp/perf-8131.map)
7fd3010072e0 [unknown] (/tmp/perf-8131.map)
7fd301007325 [unknown] (/tmp/perf-8131.map)
7fd301007325 [unknown] (/tmp/perf-8131.map)
7fd3010004e7 [unknown] (/tmp/perf-8131.map)
7fd3171df76a JavaCalls::call_helper(JavaValue*,…
7fd3171dce44 JavaCalls::call_virtual(JavaValue*…
7fd3171dd43a JavaCalls::call_virtual(JavaValue*…
7fd31721b6ce thread_entry(JavaThread*, Thread*)…
7fd3175389e0 JavaThread::thread_main_inner() (/…
7fd317538cb2 JavaThread::run() (/usr/lib/jvm/nf…
7fd3173f6f52 java_start(Thread*) (/usr/lib/jvm/…
7fd317a7e182 start_thread (/lib/x86_64-linux-gn…
With -XX:+PreserveFramePointer
stacks are full, and go all the way to
start_thread()
This is what the CPUs are really
running: inlined frames are not
present

slide 52:

Fixed Stacks Flame Graph
Java
(no symbols)

slide 53:

Gotcha #2 Missing Symbols
• Missing symbols should be obvious in perf report/script:
71.79%
12.06%
sed
sed
|--11.65%-- 0x40a447
0x40659a
0x408dd8
0x408ed1
0x402689
0x7fa1cd08aec5
sed
sed
--- re_search_internal
|--96.78%-- re_search_stub
rpl_re_search
match_regex
do_subst
execute_program
process_files
main
__libc_start_main
[.] 0x000000000001afc1
broken
[.] re_search_internal
not broken

slide 54:

Fixing Symbols
• For installed packages:
A. Add a -dbgsym package, if available
B. Recompile from source
• For JIT (Java, Node.js, …):
A. Create a /tmp/perf-PID.map ﬁle. perf already looks for this
Map format is "START SIZE symbolname"
B. Or use a symbol loggers. Eg tools/perf/jvm/.
# perf script
Failed to open /tmp/perf-8131.map, continuing without symbols
[…]
java 8131 cpu-clock:
7fff76f2dce1 [unknown] ([vdso])
7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…
7fd301861e46 [unknown] (/tmp/perf-8131.map)
[…]

slide 55:

Java Symbols
• perf-map-agent
– Agent aUaches and writes the map ﬁle on demand (previous versions aUached on Java
start, and wrote con/nually)
– hUps://github.com/jvm-proﬁling-tools/perf-map-agent
(was hUps://github.com/jrudolph/perf-map-agent)
• Automa/on: jmaps
– We use scripts to ﬁnd Java processes and dump their map ﬁles, paying aUen/on to ﬁle
ownership etc
– hUps://github.com/brendangregg/FlameGraph/blob/master/jmaps
– Needs to run as close as possible to the proﬁle, to minimize symbol churn
# perf record -F 99 -a -g -- sleep 30; jmaps

slide 56:

Java Flame Graph: Stacks & Symbols
Kernel
(C)
flamegraph.pl --color=java
Java
User
(C)
JVM
(C++)

slide 57:

Java: Inlining
A. Disabling inlining:
-XX:-Inline
Many more Java frames
80% slower (in this case)
May not be necessary: inlined ﬂame
graphs ooen make enough sense
– Or tune -XX:MaxInlineSize and XX:InlineSmallCode to reveal more
frames, without cos/ng much perf: can
even go faster!
B. Symbol agents can uninline
– perf-map-agent unfoldall
– We some/mes need and use this
No inlining

slide 58:

Node.js: Stacks & Symbols
• Frame pointer stacks work
• Symbols currently via a logger
– --perf-basic-prof: everything. We found it can log over 1 Gbyte/day.
– --perf-basic-prof-only-functions: tries to only log symbols we care about.
• perf may not use the most recent symbol in the log
– We /dy logs before using them:
hUps://raw.githubusercontent.com/brendangregg/Misc/master/perf_events/
perfmap/dy.pl
• Future v8's may support on-demand symbol dumps

slide 59:

Gotcha #3 Instruc/on Proﬁling
# perf annotate -i perf.data.noplooper --stdio
Percent |
Source code & Disassembly of noplooper
-------------------------------------------------------:
Disassembly of section .text:
00000000004004ed gt;:
0.00 :
4004ed:
push
%rbp
0.00 :
4004ee:
mov
%rsp,%rbp
20.86 :
4004f1:
nop
0.00 :
4004f2:
nop
0.00 :
4004f3:
nop
0.00 :
4004f4:
nop
19.84 :
4004f5:
nop
0.00 :
4004f6:
nop
0.00 :
4004f7:
nop
0.00 :
4004f8:
nop
18.73 :
4004f9:
nop
0.00 :
4004fa:
nop
0.00 :
4004fb:
nop
0.00 :
4004fc:
nop
19.08 :
4004fd:
nop
0.00 :
4004fe:
nop
0.00 :
4004ff:
nop
0.00 :
400500:
nop
21.49 :
400501:
jmp
4004f1 gt;
16 NOPs in a loop
Let's proﬁle instruc/ons
to see which are hot!
(have I lost my mind?)

slide 60:

Instruc/on Proﬁling
• Even distribu/on (A)? Or something else?
(A)
(B)
(C)
(D)

slide 61:

Instruc/on Proﬁling
# perf annotate -i perf.data.noplooper --stdio
Percent |
Source code & Disassembly of noplooper
-------------------------------------------------------:
Disassembly of section .text:
00000000004004ed gt;:
0.00 :
4004ed:
push
%rbp
0.00 :
4004ee:
mov
%rsp,%rbp
20.86 :
4004f1:
nop
0.00 :
4004f2:
nop
0.00 :
4004f3:
nop
0.00 :
4004f4:
nop
19.84 :
4004f5:
nop
0.00 :
4004f6:
nop
0.00 :
4004f7:
nop
0.00 :
4004f8:
nop
18.73 :
4004f9:
nop
0.00 :
4004fa:
nop
0.00 :
4004fb:
nop
0.00 :
4004fc:
nop
19.08 :
4004fd:
nop
0.00 :
4004fe:
nop
Go home instruc/on pointer, you're drunk
0.00 :
4004ff:
nop
0.00 :
400500:
nop
21.49 :
400501:
jmp
4004f1 gt;

slide 62:

PEBS
• I believe this is due to parallel and out-of-order execu/on of
micro-ops: the sampled IP is the resump/on instruc/on, not
what is currently execu/ng. And skid.
• PEBS may help: Intel's Precise Event Based Sampling
• perf_events has support:
– perf record -e cycles:pp
– The 'p' can be speciﬁed mul/ple /mes:
0 - SAMPLE_IP can have arbitrary skid
1 - SAMPLE_IP must have constant skid
2 - SAMPLE_IP requested to have 0 skid
3 - SAMPLE_IP must have 0 skid
– … from tools/perf/Documenta/on/perf-list.txt

slide 63:

Gotcha #4 VM Guests
• Using PMCs from most VM guests:
# perf stat -a -d sleep 5
Performance counter stats for 'system wide':
10003.718595 task-clock (msec)
323 context-switches
17 cpu-migrations
233 page-faults
gt; cycles
gt; stalled-cycles-frontend
gt; stalled-cycles-backend
gt; instructions
gt; branches
gt; branch-misses
gt; L1-dcache-loads
gt; L1-dcache-load-misses
gt; LLC-loads
gt; LLC-load-misses
5.001607197 seconds time elapsed
2.000 CPUs utilized
0.032 K/sec
0.002 K/sec
0.023 K/sec
[100.00%]
[100.00%]
[100.00%]

slide 64:

VM Guest PMCs
• Without PMCs, %CPU is ambiguous. We need IPC.
– Can't measure instruc/ons per cycle (IPC), cache hits/misses, MMU/TLB events, etc.
• Is ﬁxable: eg, Xen can enable PMCs (vpmu boot op/on)
– I added vpmu support for subsets, eg, vpmu=arch for Intel architectural set (7 PMCs only)
– hUp://www.brendangregg.com/blog/2017-05-04/the-pmcs-of-ec2.html
architectural
set
– Now available on the largest AWS EC2 instance types

slide 65:

VM Guest MSRs
• Model Speciﬁc Registers (MSRs) may be exposed when PMCs are not
• BeUer than nothing. Can solve some issues.
# ./showboost
CPU MHz
: 2500
Turbo MHz
: 2900 (10 active)
Turbo Ratio : 116% (10 active)
CPU 0 summary every 5 seconds...
TIME
17:28:03
17:28:08
17:28:13
17:28:18
17:28:23
[...]
C0_MCYC
C0_ACYC
UTIL
33%
35%
36%
37%
38%
RATIO
116%
116%
116%
115%
115%
MHz
– showboost is from my msr-cloud-tools collec/on (on github)

slide 66:

VM Guest PEBS
• Not possible yet in Xen
– please ﬁx
• DiUo for LBR, BTS, processor trace

slide 67:

Gotcha #5 Containers
• perf from the host can't ﬁnd symbol ﬁles in diﬀerent mount
namespaces
• We currently workaround it
– hUp://blog.alicegoldfuss.com/making-ﬂamegraphs-with-containerized-java/
• Should be ﬁxed in 4.14
– Krister Johansen's patches

slide 68:

Gotcha #6 Overhead
• Overhead is rela/ve to the rate of events instrumented
• perf stat does in-kernel counts: rela/vely low overhead
• perf record writes perf.data, which has slightly higher
CPU overhead, plus ﬁle system and disk I/O
• Test before use
– In the lab
– Run perf stat to understand rate, before perf record
• Also consider --filter, to ﬁlter events in-kernel

slide 69:

4. perf Advanced

slide 70:

perf for Tracing Events

slide 71:

Tracepoints
# perf record -e block:block_rq_insert -a
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.172 MB perf.data (~7527 samples) ]
# perf script
[…]
java 9940 [015] 1199510.044783: block_rq_insert: 202,1 R 0 () 4783360 + 88 [java]
java 9940 [015] 1199510.044786: block_rq_insert: 202,1 R 0 () 4783448 + 88 [java]
java 9940 [015] 1199510.044786: block_rq_insert: 202,1 R 0 () 4783536 + 24 [java]
java 9940 [000] 1199510.065195: block_rq_insert: 202,1 R 0 () 4864088 + 88 [java]
[…]
process PID [CPU] /mestamp: eventname:
format string
include/trace/events/block.h: java 9940 [015] 1199510.044783: block_rq_insert: 202,1 R 0 () 4783360 + 88 [java]
DECLARE_EVENT_CLASS(block_rq,
[...]
TP_printk("%d,%d %s %u (%s) %llu + %u [%s]",
kernel source
MAJOR(__entry->gt;dev), MINOR(__entry->gt;dev),
may be the
__entry->gt;rwbs, __entry->gt;bytes, __get_str(cmd),
(unsigned long long)__entry->gt;sector,
only docs
__entry->gt;nr_sector, __entry->gt;comm)
Also see: cat /sys/kernel/debug/tracing/events/block/block_rq_insert/format

slide 72:

One-Liners: Sta/c Tracing
# Trace new processes, until Ctrl-C:
perf record -e sched:sched_process_exec -a
# Trace all context-switches with stack traces, for 1 second:
perf record -e context-switches –ag -- sleep 1
# Trace CPU migrations, for 10 seconds:
perf record -e migrations -a -- sleep 10
# Trace all connect()s with stack traces (outbound connections), until Ctrl-C:
perf record -e syscalls:sys_enter_connect –ag
# Trace all block device (disk I/O) requests with stack traces, until Ctrl-C:
perf record -e block:block_rq_insert -ag
# Trace all block device issues and completions (has timestamps), until Ctrl-C:
perf record -e block:block_rq_issue -e block:block_rq_complete -a
# Trace all block completions, of size at least 100 Kbytes, until Ctrl-C:
perf record -e block:block_rq_complete --filter 'nr_sector >gt; 200'
# Trace all block completions, synchronous writes only, until Ctrl-C:
perf record -e block:block_rq_complete --filter 'rwbs == "WS"'
# Trace all block completions, all types of writes, until Ctrl-C:
perf record -e block:block_rq_complete --filter 'rwbs ~ "*W*"'
# Trace all ext4 calls, and write to a non-ext4 location, until Ctrl-C:
perf record -e 'ext4:*' -o /tmp/perf.data -a

slide 73:

One-Liners: Dynamic Tracing
# Add a tracepoint for the kernel tcp_sendmsg() function entry (--add optional):
perf probe --add tcp_sendmsg
# Remove the tcp_sendmsg() tracepoint (or use --del):
perf probe -d tcp_sendmsg
# Add a tracepoint for the kernel tcp_sendmsg() function return:
perf probe 'tcp_sendmsg%return'
# Show avail vars for the tcp_sendmsg(), plus external vars (needs debuginfo):
perf probe -V tcp_sendmsg --externs
# Show available line probes for tcp_sendmsg() (needs debuginfo):
perf probe -L tcp_sendmsg
# Add a tracepoint for tcp_sendmsg() line 81 with local var seglen (debuginfo):
perf probe 'tcp_sendmsg:81 seglen'
# Add a tracepoint for do_sys_open() with the filename as a string (debuginfo):
perf probe 'do_sys_open filename:string'
# Add a tracepoint for myfunc() return, and include the retval as a string:
perf probe 'myfunc%return +0($retval):string'
# Add a tracepoint for the user-level malloc() function from libc:
perf probe -x /lib64/libc.so.6 malloc
# List currently available dynamic probes:
perf probe -l

slide 74:

One-Liners: Advanced Dynamic Tracing
# Add a tracepoint for tcp_sendmsg(), with three entry regs (platform specific):
perf probe 'tcp_sendmsg %ax %dx %cx'
# Add a tracepoint for tcp_sendmsg(), with an alias ("bytes") for %cx register:
perf probe 'tcp_sendmsg bytes=%cx'
# Trace previously created probe when the bytes (alias) var is greater than 100:
perf record -e probe:tcp_sendmsg --filter 'bytes >gt; 100'
# Add a tracepoint for tcp_sendmsg() return, and capture the return value:
perf probe 'tcp_sendmsg%return $retval'
# Add a tracepoint for tcp_sendmsg(), and "size" entry argument (debuginfo):
perf probe 'tcp_sendmsg size'
# Add a tracepoint for tcp_sendmsg(), with size and socket state (debuginfo):
perf probe 'tcp_sendmsg size sk->gt;__sk_common.skc_state'
# Trace previous probe when size >gt; 0, and state != TCP_ESTABLISHED(1) (debuginfo):
perf record -e probe:tcp_sendmsg --filter 'size >gt; 0 && skc_state != 1' -a
• Kernel debuginfo is an onerous requirement for the Netflix cloud
• We can use registers instead (as above). But which registers?

slide 75:

The RoseUa Stone of Registers
One server instance with kernel debuginfo, and -nv (dry run, verbose):
# perf probe -nv 'tcp_sendmsg size sk->gt;__sk_common.skc_state'
[…]
Added new event:
Writing event: p:probe/tcp_sendmsg tcp_sendmsg+0 size=%cx:u64 skc_state=+18(%si):u8
probe:tcp_sendmsg (on tcp_sendmsg with size skc_state=sk->gt;__sk_common.skc_state)
You can now use it in all perf tools, such as:
perf record -e probe:tcp_sendmsg -aR sleep 1
All other instances (of the same kernel version):
Copy-n-paste!
# perf probe 'tcp_sendmsg+0 size=%cx:u64 skc_state=+18(%si):u8'
Failed to find path of kernel module.
Added new event:
probe:tcp_sendmsg
(on tcp_sendmsg with size=%cx:u64 skc_state=+18(%si):u8)
You can now use it in all perf tools, such as:
perf record -e probe:tcp_sendmsg -aR sleep 1
Masami Hiramatsu was investigating a way to better automate this

slide 76:

perf Visualiza/ons: Block I/O Latency Heat Map
• We automated this for analyzing disk I/O latency issues
SSD I/O
(fast, with queueing)
HDD I/O
(random, modes)
hUp://www.brendangregg.com/blog/2014-07-01/perf-heat-maps.html

slide 77:

There's s/ll a lot more to perf…
Using PMCs
perf scrip/ng interface
perf + eBPF
perf sched
perf /mechart
perf trace
perf c2c (new!)
perf orace (new!)

slide 78:

Links & References
perf_events
Kernel source: tools/perf/DocumentaCon
hUps://perf.wiki.kernel.org/index.php/Main_Page
hUp://www.brendangregg.com/perf.html
hUp://web.eece.maine.edu/~vweaver/projects/perf_events/
Mailing list hUp://vger.kernel.org/vger-lists.html#linux-perf-users
perf-tools: hUps://github.com/brendangregg/perf-tools
PMU tools: hUps://github.com/andikleen/pmu-tools
perf, orace, and more: hUp://www.brendangregg.com/linuxperf.html
Java frame pointer patch
hUp://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2014-December/016477.html
hUps://bugs.openjdk.java.net/browse/JDK-8068945
Node.js: hUp://techblog.ne9lix.com/2014/11/nodejs-in-ﬂames.html
Methodology: hUp://www.brendangregg.com/methodology.html
Flame graphs: hUp://www.brendangregg.com/ﬂamegraphs.html
Heat maps: hUp://www.brendangregg.com/heatmaps.html
eBPF: hUp://lwn.net/Ar/cles/603983/

slide 79:

Thank You
– Ques/ons?
– hUp://www.brendangregg.com
– hUp://slideshare.net/brendangregg
– bgregg@ne9lix.com
– @brendangregg