DTrace pid Provider Overhead

I originally posted this at http://dtrace.org/blogs/brendan/2011/02/18/dtrace-pid-provider-overhead.

I've been posting about the DTrace pid provider and its various features to share my real-world experiences in this area. Here I'll discuss an important, tricky, and generally undocumented topic: the overhead of using the pid provider. I'll provide some rough expectations for overhead based on some simple stress tests, and at the very least, show the units used to express the overhead of a fired probe (milliseconds, microseconds or nanoseconds?).

Answer: microseconds.

To cut to the chase, see the Expressing Overhead and Summary sections.

Zero Overhead

When the DTrace pid provider is not enabled and tracing, the overhead on applications is zero. You only pay a CPU cost when you actually use it.

The non-enabled overhead of the pid provider is zero.

When I first heard this, I was skeptical, wondering if it was more of a marketing or technical statement. Does zero overhead really mean zero, as in, zero extra instructions? Wouldn't a compare and jump instruction, at least, be necessary at some point, so that the operating system can check whether it should be calling into DTrace or not? If there are no such instructions, how can I walk up to already-running software and instrument it with DTrace? Wouldn't it just continue as-is, and never know that it should be branching into some DTrace code?

I only really understood this properly when I learned what happened at the instruction level, which I'll demo here.

Here is the strlen() function assembly (truncated), which I've been tracing in previous posts:

libc.so.1`strlen:               movl   0x4(%esp),%eax
libc.so.1`strlen+4:             testl  $0x3,%eax
libc.so.1`strlen+9:             jne    +0x28    <libc.so.1`strlen+0x31>
libc.so.1`strlen+0xb:           nop
libc.so.1`strlen+0xc:           movl   (%eax),%edx
libc.so.1`strlen+0xe:           movl   $0x7f7f7f7f,%ecx
libc.so.1`strlen+0x13:          andl   %edx,%ecx
[...]
libc.so.1`strlen+0x44:          ret

Now I'll use the DTrace pid provider to instrument the entry to this function:

# dtrace -n 'pid$target::strlen:entry' -p 12072

And while DTrace is still attached, I'll dump the same assembly from this process:

libc.so.1`strlen:               int    $0x3
libc.so.1`strlen+1:             incl   %esp
libc.so.1`strlen+2:             andb   $0x4,%al
libc.so.1`strlen+4:             testl  $0x3,%eax
libc.so.1`strlen+9:             jne    +0x28    <libc.so.1`strlen+0x31>
libc.so.1`strlen+0xb:           nop    
libc.so.1`strlen+0xc:           movl   (%eax),%edx
libc.so.1`strlen+0xe:           movl   $0x7f7f7f7f,%ecx
libc.so.1`strlen+0x13:          andl   %edx,%ecx
[...]
libc.so.1`strlen+0x44:          ret

It's changed - while the process was running! DTrace has has dynamically patched the live instructions in this process's address space. This is the dynamic part of Dynamic Tracing. DTrace can dynamically instrument the kernel, too, via the fbt provider.

DTrace can dynamically patch live instructions in memory.

How DTrace dynamically instruments code depends on the location (user/kernel) and architecture. This example was a user process on AMD64, and the first instruction was replaced by "int" to call a soft interrupt containing a DTrace routine. The kernel is dynamically instrumented in a different way.

This seems terrifying.

The process is still running, and could be in production. Thankfully, the instrumention is applied in a safe manner, so that the process can never execute partially-patched instructions. So, while it seems terrifying, it is actually safe - which is a primary goal of DTrace: safe for production use.

This is why there is zero overhead when dynamic probes are not enabled. The code runs as usual.

When DTrace Isn't D-Trace

Before I move on, I want to note that dynamic tracing is only used by the pid and fbt providers. Many of the other providers that DTrace uses are "statically defined tracing", such as io, syscall, tcp, perl, python, javascript, etc. So, DTrace isn't just about dynamic tracing, as in the examples above, it uses static tracing a lot too.

Statically-defined providers are more effort to create and deploy - the source code must be instrumented before compilation, and that instrumented version must be executing. The advantage of static over dynamic providers is that they have a stable interface, and DTrace scripts written for these providers should continue working across different software versions and OSes. Stability was discussed in my first post in this series.

... Perhaps DTrace should have been called "Dynamic and Static Tracing", or DasTrace for short. (and coded on your DasKeyboard.)

Overhead

Usually when I'm considering overhead, it's about the running overhead of DTrace - the performance cost to the target application while the probes are enabled and collecting data. There is also some additional overhead when you first run DTrace and initialize pid provider probes, which can be proportional to the number of probes initialized. The running overhead is proportional to the rate of probes - the more they fire per second, the higher the overhead.

I'll run some overhead tests to provide rough figures to consider. These stress test the pid provider, so that the overhead of probes becomes noticable and measurable. I'd like to show the overhead from tracing a real-world application, but the overhead is often so small that it is lost in the noise (caching effects and normal workload variability).

Stress Testing

I'll stress test the pid provider by running and DTracing "looper" - a short C program that calls a simple function 100 million times. I'll time the program running normally, then while using the pid provider to instrument that function.

# cat -n looper.c
     1  #include <strings.h>
     2  
     3  int
     4  func(char *input)
     5  {
     6          return (strlen(input) + 1);
     7  }
     8  
     9  int
    10  main()
    11  {
    12          int i, j, sum = 0;
    13  
    14          for (i = 0; i < 1000000; i++) {
    15                  for (j = 0; j < 100; j++) {
    16                          sum += func("hello DTrace");
    17                  }
    18          }
    19  
    20          return (0);
    21  }

Since func() is called 100 millions times and does very little, the execution of this program may become bounded by the overhead of executing the DTrace probes, which could take more CPU cycles than the function itself. This is a form of stress test to magnify something minute so that it becomes observable.

Control

Without any DTrace enabled, I'll use ptime to measure the run time of the program ("real" time):

# ptime ./looper

real        1.513169989

It normally takes 1.5 seconds to run.

DTrace startup

Since I'm timing the overall command, including DTrace, I need to determine the time for DTrace to initialize itself. This time includes allocating the per-CPU output buffers. I've tested this by tracing a function that wasn't called, so additional time is almost entirely the cost of startup:

# ptime dtrace -n 'pid$target:libc:strncpy:entry { @ = count(); }' -c ./looper
dtrace: description 'pid$target:libc:strncpy:entry ' matched 1 probe
dtrace: pid 5475 has exited


real        2.722916042

2.7 seconds. So, about 1.2 seconds is DTrace startup.

pid provider probe

The following traces the function that is called 100 million times. The action is simply to count ("@ = count()"), a commonly used action which also lets us confirm that we traced all 100 million invocations:

# ptime dtrace -n 'pid$target:a.out:func:entry { @ = count(); }' -c ./looper
dtrace: description 'pid$target:a.out:func:entry ' matched 1 probe
dtrace: pid 5524 has exited

        100000000

real     1:02.519046129

The time has now increased to 62.5 seconds. (And the "100000000" confirms we traced all 100 million function calls.)

probe + copyinstr()

The copyinstr() action adds additional overhead, as it copies data from the user-land address space to the kernel. Testing this:

# ptime dtrace -n 'pid$target:a.out:func:entry { @[copyinstr(arg0)] = count(); }' -c ./looper
dtrace: description 'pid$target:a.out:func:entry ' matched 1 probe
dtrace: pid 9672 has exited

  hello DTrace                                              100000000

real     2:33.046645951

The time now was 153 seconds.

Calculating Overhead

The last example of DTrace made this program run 100 times slower. Ouch! However, this is an extreme test - instrumenting ten million function calls.

The overhead of each probe is fixed (about as fixed as anything can be, considering CPU and MMU cache effects). For the first function tracing test, tracing 100 million probes added 59.8 seconds (62.5 - 2.7s). That means each pid provider probe is costing about 0.598 microseconds. For this rough test, on this hardware.

This enabled pid provider probe was costing 0.6 microseconds per call.

When I added a copyinstr(), which does more CPU work (including MMU work to bridge address spaces), the cost per probe was about 1.5 microseconds. (These calculations have been updated thanks to a blog comment!)

Sanity Testing

As with any performance test, it's worth sanity checking via other means. Here I'll estimate pid provider probe cost a different way.

I'll start by timing strlen() (using vtimestamp to reduce the inclusion of DTrace probe cost):

# dtrace -n 'pid$target::strlen:entry,pid$target::strlen:return { trace(vtimestamp); }'
-p 21607
dtrace: description 'pid$target::strlen:entry,pid$target::strlen:return ' matched 4 probes
CPU     ID                    FUNCTION:NAME
 13  75978                     strlen:entry           3479933
 13  75977                    strlen:return           3482396

DTrace measures this as 2463 nanoseconds (3482396 - 3479933).

Now I'll trace each instruction of strlen() and print a timestamp (switching to normal timestamps to include DTrace probe cost):

# dtrace -n 'pid$target::strlen: { trace(timestamp); }' -p 21607
dtrace: description 'pid$target::strlen: ' matched 80 probes
CPU     ID                    FUNCTION:NAME
  1  75978                     strlen:entry  10165584038261648
  1  75979                         strlen:0  10165584038263818
  1  75980                         strlen:4  10165584038266909
  1  75981                         strlen:6  10165584038269029
  1  75982                         strlen:9  10165584038271405
  1  75983                         strlen:b  10165584038273856
  1  75984                         strlen:e  10165584038275964
  1  75985                        strlen:10  10165584038278132
  1  75986                        strlen:11  10165584038280403
  1  75987                        strlen:14  10165584038282483
  1  75988                        strlen:16  10165584038284388
  1  75989                        strlen:19  10165584038286323
  1  76015                        strlen:51  10165584038288842
  1  76016                        strlen:55  10165584038290819
  1  75977                    strlen:return  10165584038292203
^C

The time delta for strlen() was now 30555 nanoseconds (...38292203 - ...38261648), which is 28092 ns slower than measured before. This extra time is from the 14 probes, which makes the cost about 2.0 microseconds per probe.

This enabled pid provider probe was costing 2 microseconds per call.

As a sanity test, this looks ok. The result is in the same ballpark. It is a different probe type, instruction instead of function boundary, and could have a different cost associated (I'd have to check the implementation to confirm).

Sanity Testing #2

As a different form of sanity test, I traced libmosjs calls (JS_*()) on the Firefox web browser on my laptop. I left it running in the background with various tabs open, and found its CPU usage was consistent across a 60 second average (too variable across a 1 second average to draw conclusions). I'll summarize details here.

Firefox CPU Utilization for 60 seconds: 22.7%
... while tracing all JS_* calls: 46.7%
Number of JS_* calls during those 60 seconds: 7867128
Average number of JS_* calls during 1 second: 131119
Percent CPU cost for each probe ((46.7 - 22.7) / 131119): 0.000183039%
Time for each probe (when "100%" CPU == 1000 ms per second): ~1.8 microseconds

Again, this is pretty rough, and is just intended as a sanity test. The result of 1.8 us is a little faster than the 2 us result, and may be due in part to testing this on a different platform with slightly faster CPUs.

Expressing Overhead

I'd suggest considering pid provider overhead as follows:

Don't worry too much about pid provider probe cost at < 1000 events/sec.
At > 10,000 events/sec, pid provider probe cost will be noticeable.
At > 100,000 events/sec, pid provider probe cost may be painful.

Let's discuss these using the 6 us per probe result from the previous tests. This could be faster (~2 us) or slower (> 15 us) depending on architecture and the probe action.

at 1000 events/sec, that could cost 6 milliseconds of CPU time. If the application was 100% on-CPU, it may slow by 0.6% - ie, don't worry.
At 10,000 events/sec, that could cost 60 milliseconds of CPU time. For a 100% on-CPU application, then the slow down may be 6%. If it was idle some of the time (systems often are), then the overhead may be less, as there are spare cycles to run DTrace (really depends on the location of the probe in the code-path, though).
At 100,000 events/sec, the CPU cost could be 600 milliseconds. This can significantly slow down an application. Whether that's a problem depends on the app and its function - how sensitive it is to latency.

This was described in terms of a single thread of execution. If an application is spreading these events concurrently across multiple CPUs, then the performance impact should be less.

My previous stress test was an extreme case, as I traced a function that was called 100 million times in 1.5 seconds. That's a rate of over 60 million events per second. For these suggestions, that puts this well off the pain-end of the scale. And the result was indeed painful - the application ran two orders of magnitude slower.

From Experience

When investigating real-world applications, I can often find key information by tracing events in the 10 to 10,000 events/sec range. The overhead is usually small, and hard to measure over the noise of workload variability. I'm also often using DTrace for brief periods, less than 30 seconds, to collect information that I can then analyze offline.

The actual overhead of my typical pid provider investigations is so small it's difficult to measure.

I have used DTrace plenty of times for higher event rates where it has significantly slowed down the target application, because there was no other way to get the data. If I'm using DTrace to analyze a performance meltdown, then the system is already in hell, and a little more pain so that we can fix it sooner is usually worth it.

Reducing Overhead

Here are some tips for reducing overhead when using the pid provider.

Always use entry and return

Be careful with the last field of the probename. Left blank (wildcard) it will match all the instruction probes - creating many more probes than was probably wanted. With the pid provider, if you only meant entry and return, remember to specify them explicitly.

Trace for short intervals

If you expect a DTrace script will have high overhead, you may collect sufficient data when running it for a very short interval. High overhead usually means many thousands of events per second, so a fraction of a second may capture more than enough events anyway. The shortest I've run a pid provider enabling for is 100ms, by adding the probe:

profile:::tick-100ms { exit(0); }

This quits DTrace after 100ms of tracing.

Find higher level functions

Higher level functions are called less frequently than lower level functions. If you already know your way around the source code, candidates may be obvious. If not, you can try using stack back traces to find them.

Let's say I wanted to trace what a bash shell was doing, but didn't know where to start. For this example, I'll start with a high overhead experiment: tracing all of bash's functions and frequency counting them:

# dtrace -n 'pid$target:bash::entry { @[probefunc] = count(); }' -p 11290
dtrace: description 'pid$target:bash::entry ' matched 1957 probes
^C

  _rl_replace_text                                                  1
  check_mail                                                        1
[...]
  _rl_reset_argument                                              108
  readline_internal_char                                          108
  rl_get_char                                                     108
  rl_getc                                                         108
  rl_read_key                                                     108
  rl_redisplay                                                    108
  sh_malloc                                                       108
  sh_xmalloc                                                      108
  update_line                                                     108
  sh_free                                                         110
  sh_xfree                                                        110
  _rl_block_sigint                                                112
  _rl_release_sigint                                              112
  internal_free                                                   120
  internal_malloc                                                 128

The output was many pages long, truncated here to fit.

I'll pick one of the more frequently called functions and then count its stack back trace. Trying readline_internal_char():

# dtrace -n 'pid$target::readline_internal_char:entry { @[ustack()] = count(); }' -p 11290
dtrace: description 'pid$target::readline_internal_char:entry ' matched 1 probe
^C

              bash`readline_internal_char
              bash`readline+0x49
              bash`yy_readline_get+0x52
              bash`shell_getc+0xe1
              bash`read_token+0x6f
              bash`yyparse+0x4b9
              bash`parse_command+0x67
              bash`read_command+0x52
              bash`reader_loop+0xa5
              bash`main+0xaff
              bash`_start+0x83
               48

By reading down the stack, I'd guess that functions like read_command() may be a better place to start - if I was just interested in a high level view of what bash was doing. It is indeed called less often - six times when tracing earlier (although truncated from the above output).

Future Note

For future readers, I should note that these numbers are based on current CPU hardware. The "looper" program was run on a Joyent SmartMachine on the following hardware:

# psrinfo -vp
The physical processor has 4 cores and 8 virtual processors (1 3 5 7 9 11 13 15)
[...]
    x86 (GenuineIntel 106A5 family 6 model 26 step 5 clock 2400 MHz)
      Intel(r) Xeon(r) CPU           E5530  @ 2.40GHz

As CPUs become faster, the fixed cost that I was calculating above should reduce. Six microseconds sounds fast now (2011), but may sound slow a decade from now.

Summary

The DTrace pid provider has zero overhead when not enabled, and a low overhead per probe invocation when enabled - on the order of microseconds (between 1.8 and 15 us when tested here). When choosing pid provider probes to enable, consider the rate that they will fire - the rate of events to trace. If this is less than a thousand per second, then the overhead is unlikely to be noticable; if the rate is over 100,000, then the overhead is likely to be noticable - and could start slowing the target application noticeably. Using particular DTrace actions, such as copyin() and copyinstr(), can increase the probe cost further.

I measured the overhead by taking things to the extreme, with a stress test calling 100 million fast functions. When tracing all 100 million calls and performing copyinstr()s, this test ran two orders of magnitude slower. While this was useful to estimate individual probe cost, this may serve another role: showing that you really can shoot yourself in the foot with the pid provider. While DTrace probes are individually fast, multiply by millions and the cost adds up.

The overhead is manageable if you know what to expect. If you'd like to be really sure before using the pid provider in production, setup a test environment and run a simple benchmark with and without the pid provider probes enabled.

Brendan Gregg's Blog