Systems Performance 2nd Ed.

BPF Performance Tools book

Recent posts:
Blog index

Linux bcc/BPF Node.js USDT Tracing

12 Oct 2016

You may know that Node.js has built-in USDT (user statically-defined tracing) probes for performance analysis and debugging, but did you know that Linux now supports using them? And now that V8 has tracing, is this too late to matter? In this post I'll explain things a little with a basic USDT example.

The Linux 4.x series has been adding enhancements to BPF (Berkeley Packet Filter) originally for software defined networks, but now can be used for programmatic tracing. Aka BPF superpowers. These are built into Linux, so sooner or later, this is coming to everyone who runs Linux.

I wrote an example of instrumenting the node http-server-request USDT probe with BPF:

# ./ 24728
TIME(s)            COMM             PID    ARGS
24653324.561322998 node             24728  path:/index.html
24653335.343401998 node             24728  path:/images/welcome.png
24653340.510164998 node             24728  path:/images/favicon.png

The source is in bcc under examples/tracing/

bcc uses C for the kernel instrumentation (which it compiles into BPF bytecode) and Python (or lua) for user-level reporting. It gets the job done, but is verbose. This example should be even more verbose: I used a debug shortcut, bpf_trace_printk(), but if this were a tool intended for concurrent use it needs to use BPF_PERF_OUTPUT() instead (I explained how in the bcc Python Tutorial), which will inflate the code further.

This code ultimately runs the do_trace() function when the http__server__request probe is hit, which reads the 6th argument, the URL. You can see some argument definitions in src/node_provider.d, eg:

probe http__server__request(node_dtrace_http_server_request_t *h,
    node_dtrace_connection_t *c, const char *a, int p, const char *m,
    const char *u, int fd) : (node_http_request_t *h, node_connection_t *c,

Let's check those other strings (char *'s). The bcc trace program can print them out, which allows some powerful ad hoc one-liners to be developed:

# trace -p `pgrep -n node` 'u:node:http__server__request "%s %s %s", arg3, arg5, arg6'
TIME     PID    COMM         FUNC             -
21:28:14 827    node         http__server__request GET /
21:28:15 827    node         http__server__request GET /
21:28:16 827    node         http__server__request GET /

You can also use bcc's tplist to list probes, eg, on a file:

# tplist -l /mnt/src/node-v6.7.0/node
/mnt/src/node-v6.7.0/node node:gc__start
/mnt/src/node-v6.7.0/node node:gc__done
/mnt/src/node-v6.7.0/node node:http__server__response
/mnt/src/node-v6.7.0/node node:net__server__connection
/mnt/src/node-v6.7.0/node node:net__stream__end
/mnt/src/node-v6.7.0/node node:http__client__response
/mnt/src/node-v6.7.0/node node:http__client__request
/mnt/src/node-v6.7.0/node node:http__server__request

... or on a running process:

# tplist -p `pgrep node`
/mnt/src/node-v6.7.0/out/Release/node node:gc__start
/mnt/src/node-v6.7.0/out/Release/node node:gc__done
/mnt/src/node-v6.7.0/out/Release/node node:http__server__response
/mnt/src/node-v6.7.0/out/Release/node node:net__server__connection
/mnt/src/node-v6.7.0/out/Release/node node:net__stream__end
/mnt/src/node-v6.7.0/out/Release/node node:http__client__response
/mnt/src/node-v6.7.0/out/Release/node node:http__client__request
/mnt/src/node-v6.7.0/out/Release/node node:http__server__request
/lib/x86_64-linux-gnu/ libc:setjmp
/lib/x86_64-linux-gnu/ libc:longjmp
/lib/x86_64-linux-gnu/ libc:longjmp_target
/lib/x86_64-linux-gnu/ libc:memory_heap_new
/lib/x86_64-linux-gnu/ libc:memory_sbrk_less
/lib/x86_64-linux-gnu/ libc:memory_arena_reuse_free_list
/lib/x86_64-linux-gnu/ libc:memory_arena_reuse_wait
/lib/x86_64-linux-gnu/ libc:memory_arena_reuse
/lib/x86_64-linux-gnu/ libc:memory_arena_new
/lib/x86_64-linux-gnu/ libc:memory_arena_retry
/lib/x86_64-linux-gnu/ libc:memory_heap_free
/lib/x86_64-linux-gnu/ libc:memory_heap_less
/lib/x86_64-linux-gnu/ libc:memory_heap_more
/lib/x86_64-linux-gnu/ libc:memory_sbrk_more
/lib/x86_64-linux-gnu/ libc:memory_malloc_retry
/lib/x86_64-linux-gnu/ libc:memory_mallopt_free_dyn_thresholds
/lib/x86_64-linux-gnu/ libc:memory_realloc_retry
/lib/x86_64-linux-gnu/ libc:memory_memalign_retry
/lib/x86_64-linux-gnu/ libc:memory_calloc_retry
/lib/x86_64-linux-gnu/ libc:memory_mallopt
/lib/x86_64-linux-gnu/ libc:memory_mallopt_mxfast
/lib/x86_64-linux-gnu/ libc:memory_mallopt_arena_max
/lib/x86_64-linux-gnu/ libc:memory_mallopt_arena_test
/lib/x86_64-linux-gnu/ libc:memory_mallopt_mmap_max
/lib/x86_64-linux-gnu/ libc:memory_mallopt_mmap_threshold
/lib/x86_64-linux-gnu/ libc:memory_mallopt_top_pad
/lib/x86_64-linux-gnu/ libc:memory_mallopt_trim_threshold
/lib/x86_64-linux-gnu/ libc:memory_mallopt_perturb
/lib/x86_64-linux-gnu/ libc:memory_mallopt_check_action
/lib/x86_64-linux-gnu/ libc:lll_lock_wait_private
/lib/x86_64-linux-gnu/ libpthread:pthread_start
/lib/x86_64-linux-gnu/ libpthread:pthread_create
/lib/x86_64-linux-gnu/ libpthread:pthread_join
/lib/x86_64-linux-gnu/ libpthread:pthread_join_ret
/lib/x86_64-linux-gnu/ libpthread:mutex_init
/lib/x86_64-linux-gnu/ libpthread:mutex_destroy
/lib/x86_64-linux-gnu/ libpthread:mutex_acquired
/lib/x86_64-linux-gnu/ libpthread:mutex_entry
/lib/x86_64-linux-gnu/ libpthread:mutex_timedlock_entry
/lib/x86_64-linux-gnu/ libpthread:mutex_timedlock_acquired
/lib/x86_64-linux-gnu/ libpthread:mutex_release
/lib/x86_64-linux-gnu/ libpthread:rwlock_destroy
/lib/x86_64-linux-gnu/ libpthread:rdlock_acquire_read
/lib/x86_64-linux-gnu/ libpthread:rdlock_entry
/lib/x86_64-linux-gnu/ libpthread:wrlock_acquire_write
/lib/x86_64-linux-gnu/ libpthread:wrlock_entry
/lib/x86_64-linux-gnu/ libpthread:rwlock_unlock
/lib/x86_64-linux-gnu/ libpthread:cond_init
/lib/x86_64-linux-gnu/ libpthread:cond_destroy
/lib/x86_64-linux-gnu/ libpthread:cond_wait
/lib/x86_64-linux-gnu/ libpthread:cond_timedwait
/lib/x86_64-linux-gnu/ libpthread:cond_signal
/lib/x86_64-linux-gnu/ libpthread:cond_broadcast
/lib/x86_64-linux-gnu/ libpthread:lll_lock_wait_private
/lib/x86_64-linux-gnu/ libpthread:lll_lock_wait
/lib/x86_64-linux-gnu/ libm:slowatan2_inexact
/lib/x86_64-linux-gnu/ libm:slowatan2
/lib/x86_64-linux-gnu/ libm:slowlog_inexact
/lib/x86_64-linux-gnu/ libm:slowlog
/lib/x86_64-linux-gnu/ libm:slowatan_inexact
/lib/x86_64-linux-gnu/ libm:slowatan
/lib/x86_64-linux-gnu/ libm:slowtan
/lib/x86_64-linux-gnu/ libm:slowasin
/lib/x86_64-linux-gnu/ libm:slowacos
/lib/x86_64-linux-gnu/ libm:slowsin
/lib/x86_64-linux-gnu/ libm:slowcos
/lib/x86_64-linux-gnu/ libm:slowexp_p6
/lib/x86_64-linux-gnu/ libm:slowexp_p32
/lib/x86_64-linux-gnu/ libm:slowpow_p10
/lib/x86_64-linux-gnu/ libm:slowpow_p32
/usr/lib/x86_64-linux-gnu/ libstdcxx:catch
/usr/lib/x86_64-linux-gnu/ libstdcxx:throw
/usr/lib/x86_64-linux-gnu/ libstdcxx:rethrow
/lib/x86_64-linux-gnu/ rtld:init_start
/lib/x86_64-linux-gnu/ rtld:init_complete
/lib/x86_64-linux-gnu/ rtld:map_failed
/lib/x86_64-linux-gnu/ rtld:map_start
/lib/x86_64-linux-gnu/ rtld:map_complete
/lib/x86_64-linux-gnu/ rtld:reloc_start
/lib/x86_64-linux-gnu/ rtld:reloc_complete
/lib/x86_64-linux-gnu/ rtld:unmap_start
/lib/x86_64-linux-gnu/ rtld:unmap_complete
/lib/x86_64-linux-gnu/ rtld:setjmp
/lib/x86_64-linux-gnu/ rtld:longjmp
/lib/x86_64-linux-gnu/ rtld:longjmp_target

This has picked up many other USDT probes (wow), from libc, libpthread, libm, libstdcxx, and rtld. Nice!

Node.js and USDT

Last time I built Node.js with these USDT probes I used these steps:

$ sudo apt-get install systemtap-sdt-dev   # adds "dtrace", used by node build
$ wget
$ tar xvf node-v6.7.0.tar.gz
$ cd node-v6.7.0
$ ./configure --with-dtrace
$ make -j8

If you don't have bcc setup, you can use readelf to check that the USDT probes are built into the binary, which show up as "SystemTap probe descriptors":

# readelf -n /mnt/src/node-v6.7.0/node
Displaying notes found at file offset 0x01814014 with length 0x000003c4:
  Owner                 Data size   Description
  stapsdt              0x0000003c   NT_STAPSDT (SystemTap probe descriptors)
    Provider: node
    Name: gc__start
    Location: 0x00000000011552f4, Base: 0x0000000001a444e4, Semaphore: 0x0000000001e13fdc
    Arguments: 4@%esi 4@%edx 8@%rdi

bcc supports both USDT probes and IS-ENABLED USDT probes.

There is another way to create USDT probes: the dtrace-provider library, which allows your Node.js code to dynamically declare new USDT probes. Last I checked, that library did not compile on Linux, however, with the new bcc/BPF support it should be fixable.

In order to use USDT probes with bcc, you'll need a Linux kernel that's new and shiny. By Linux 4.4 (which is used by Ubuntu 16.04 LTS), there's enough BPF to do USDT event tracing, latency measurements, and histograms. Linux 4.6 adds stack trace support.

V8 Tracing & Future Use

V8 has recently added an --enable-tracing option that generates a v8_trace.json file for loading in Google Chrome's trace viewer. Once this is widely available in node, many common tracing needs may be solved.

The long term value of USDT support with bcc may be the instrumentation of other subsystems: node internals, system libraries, and the kernel, and exposing these with Node.js context fetched from the USDT probes. BPF/bcc can instrument kernel functions, kernel tracepoints, and user-level functions as well. These:

# objdump -j .text -tT node | head

node:     file format elf64-x86-64

000000000079bd00 l    d  .text  0000000000000000              .text
000000000079e070 l     F .text  0000000000000000              deregister_tm_clones
000000000079e0b0 l     F .text  0000000000000000              register_tm_clones
000000000079e0f0 l     F .text  0000000000000000              __do_global_dtors_aux
000000000079e110 l     F .text  0000000000000000              frame_dummy
000000000079e140 l     F .text  000000000000002b              ssl_callback_ctrl
# objdump -j .text -tT node | wc
  76170  492908 9373788
# wc /proc/kallsyms
108919 339047 4659123 /proc/kallsyms

That's over 75 thousand user-level probe points, plus over 108 thousand kernel probe points, plus those extra USDT probes I listed previously.

If you can solve your performance issues using v8/node-specific capabilities like V8 tracing, then great! But for times when you really need to dig into the depths of your application and its OS interactions: bcc/BPF can do it, and you can make custom tools to automate it.

Adding USDT support to bcc is just the beginning, and I've shared some details on how it works in this post. Next is to build useful tools and GUIs that make use of it.

Click here for Disqus comments (ad supported).