DTrace pid Provider return

I originally posted this at http://dtrace.org/blogs/brendan/2011/02/14/dtrace-pid-provider-return.

In my previous posts, I introduced the DTrace pid provider, and discussed function "entry" probes, their arguments, and the stability of these. Here I'll continue discussing the pid provider, introducing the function "return" probe, return value, and the stability of these as well.

For more information on the pid provider, see the pid provider chapter of the DTrace Guide and the examples in the forthcoming DTrace Book.

Function Return Probe

The "entry" probe traces the start of a function, and the "return" probe traces the end. The name "return" comes from the "ret" assembly instruction, which is how a thread ends execution of a function and returns to the parent on the stack. Listing return probes for the bash shell on Solaris:

# dtrace -ln 'pid$target:::return' -p 809
   ID   PROVIDER            MODULE                          FUNCTION NAME
76293     pid809              bash                             __fsr return
76294     pid809              bash                        exit_shell return
76295     pid809              bash                           sh_exit return
76296     pid809              bash               shell_is_restricted return
76297     pid809              bash             maybe_make_restricted return
76298     pid809              bash                 disable_priv_mode return
76299     pid809              bash                       unbind_args return
76300     pid809              bash                  unset_bash_input return
[...6759 lines truncated...]

DTrace can trace the return of functions.

Tracing the return lets us examine the return value of functions, follow the code path, and measure the time spent for functions. For some functions, the entry arguments have a pointer which is only populated on return, which is only valid when examined in the return probe.

Return Value

# dtrace -n 'pid$target:libc:strlen:return { printf("%x %d", arg0, arg1); }' -p 809
dtrace: description 'pid$target:libc:strlen:return ' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0  69433                    strlen:return 44 8
  0  69433                    strlen:return 44 11
  0  69433                    strlen:return 44 11
  0  69433                    strlen:return 44 8
  0  69433                    strlen:return 44 2
  0  69433                    strlen:return 44 20

Here I've printed arg0 and arg1 for return. arg1 is the return value, showing the length of the string in bytes. arg0 is the program counter offset from where we left the function (assembly, "ret"). Functions can return from different points, so the address can shed light on which code path was taken (if it's relevant).

This was the dissassembly of the strlen() function, using mdb:

> strlen::dis
libc.so.1`strlen:               movl   0x4(%esp),%eax
libc.so.1`strlen+4:             testl  $0x3,%eax
libc.so.1`strlen+9:             jne    +0x28    <libc.so.1`strlen+0x31>
libc.so.1`strlen+0xb:           nop    
libc.so.1`strlen+0xc:           movl   (%eax),%edx
[...]
libc.so.1`strlen+0x36:          incl   %eax
libc.so.1`strlen+0x37:          testl  $0x3,%eax
libc.so.1`strlen+0x3c:          jne    -0xb     <libc.so.1`strlen+0x31>
libc.so.1`strlen+0x3e:          jmp    -0x32    <libc.so.1`strlen+0xc>
libc.so.1`strlen+0x40:          subl   0x4(%esp),%eax
libc.so.1`strlen+0x44:          ret

The "ret" is at address 0x44, which we found when using DTrace. Functions can have multiple return points.

I find arg0 for program offset a little confusing since I also frequently use the syscall provider, where arg0 (and arg1) both contain the return value.

The pid provider rval is available as arg1 (not arg0).
The function return offset is available as arg0.
arg1 is generally the one I'm interested in.

More Discoveries

I'm always discovering things using the pid provider, even when tracing standard system library functions such as these. Here I'm tracing the return of strcmp(), and printing the rval as a distribution plot summary:

# dtrace -n 'pid$target::strcmp:return { @ = quantize((int32_t)arg1); }' -p 210
dtrace: description 'pid$target::strcmp:return ' matched 2 probes
^C

           value  ------------- Distribution ------------- count    
              -8 |                                         0        
              -4 |@                                        36       
              -2 |                                         0        
              -1 |@                                        36       
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      1253     
               1 |@                                        24       
               2 |@                                        36       
               4 |                                         0        
               8 |                                         12       
              16 |@                                        36       
              32 |                                         0

Interesting. I would have guessed that strcmp() only returned -1, 0 or 1, but it does in fact return other values as well. This is allowed, as the man page does say "The function returns an integer greater than, equal to, or less than 0,". Wikipedia notes that most real-world implementations can return other values due to branching optimization.

Associating Entry to Return

Data in the entry probe can be associated with the return probe using the DTrace thread local variable. For example, the following script traces calls to fopen(), showing both the entry arguments (filename and mode) and the return value (success/failure).

# cat -n fopen.d
     1  #!/usr/sbin/dtrace -s
     2  
     3  #pragma D option quiet
     4  
     5  dtrace:::BEGIN { printf("Tracing fopen() on PID %d...\n", $target); }
     6  
     7  pid$target::fopen:entry
     8  {
     9          self->file = arg0;
    10          self->mode = arg1;
    11  }
    12  
    13  pid$target::fopen:return
    14  /self->file/
    15  {
    16          this->result = arg1 != NULL ? "success" : "failure";
    17  
    18          printf("mode=%s result=%s file=%s\n", copyinstr(self->mode),
    19              this->result, copyinstr(self->file));
    20  
    21          self->file = 0;
    22          self->mode = 0;
    23  }
# ./fopen.d -p 14550
Tracing fopen() on PID 14550...
mode=a result=success file=highscores_c
mode=r result=success file=highscores_c

The probes were associated using thread local variables, self->file and self->mode, which are associated with the current thread. One of these is tested on line 14, to ensure they have been set before dereferencing them on lines 18 and 19. It's possible that DTrace was executed after the function entry and before the function return, which would result in only the return firing for that invocation - hence the test. These variables are then cleared by setting them to zero on lines 21 and 22.

Access entry arguments during the return probe by first saving them as thread local variables.

This won't work for recursive functions, which can have multiple entries before a return.

Flow Indent

When tracing both entry and return probes, the DTrace flowindent option (-F) will print an output indented by stack depth - which shows the flow of code. Here is the bash shell on Solaris reading a single keystroke:

# dtrace -Fn 'pid$target:bash::entry,pid$target:bash::return' -p 809
dtrace: description 'pid$target:bash::entry,pid$target:bash::return' matched 2407 probes
CPU FUNCTION
  0  <- rl_getc
  0  <- rl_read_key
  0  -> _rl_dispatch
  0    -> _rl_dispatch_subseq
  0      -> rl_insert
  0        -> _rl_insert_char
  0          -> _rl_any_typein
  0          <- _rl_any_typein
  0          -> rl_insert_text
  0          <- rl_insert_text
  0        <- _rl_insert_char
  0      <- rl_insert
  0    <- _rl_dispatch_subseq
  0  <- _rl_dispatch
  0  -> rl_redisplay
  0    -> _rl_move_cursor_relative
  0    <- _rl_move_cursor_relative
  0    -> _rl_output_some_chars
  0    <- _rl_output_some_chars
  0  <- rl_redisplay
  0  <- readline_internal_char
  0  -> readline_internal_char
  0    -> _rl_init_argument
  0    <- _rl_init_argument
  0    -> rl_read_key
  0      -> _rl_next_macro_key
  0      <- _rl_next_macro_key
  0      -> rl_getc
^C

Back when DTrace was new, this sort of functionality had to be seen to be believed, and a demo was circulated showing a flow indent trace of the read() syscall in the kernel. The above excerpt continues from where the read syscall finished, showing the flow in user-land code in bash as it processed a keystroke, and prepared to read the next one.

While it's a cute demo of DTrace functionality, I don't use this feature that often. I've used it occasionally when the flow of a function was not obvious, for example, when function pointers are used and the source code doesn't show which function names were called. The output can be many pages long, so I'll usually write it to a text file and then read it using vi.

flow indent is cute but not hugely useful.

There are a few issues with flow indent: overhead, shuffling and entropy.

Overhead: Many functions were traced for the above example to work - 2407 probes. And this was just the functions from the bash segment ("pid$target:bash"). If I were to trace all functions in the program (including library functions), there would be thousands more. Tracing so much adds CPU overhead to the execution of the program, and can slow it down significantly (not just from the probe execution, but also from producing this much text output). If there is no other way to get this data, then the overhead may be worth the cost.

When I do use this, I'm careful to not leave it running too long to minimize the affect on performance. It can also produce too much output than is really needed, anyway. I solve these by limiting the time spent tracing, such as by using an "{ exit(0); }" action in a certain probe or in a timed probe (eg, profile:::tick-100ms).

Shuffling: The output can be shuffled on multi-CPU systems, since DTrace buffers output data per-CPU and periodically collects and prints these buffers without recombining them. (The rate can be tuned via the switchrate tunable, which defaults to once per second.) To work around this, I'll add an action that includes a timestamp (eg, "{ trace(timestamp); }"), so that the output can be saved to a file and post-sorted. Having timestamps in the output is useful for another reason - locating the source of latency for performance analysis.

Entropy: Sometimes when you use flow indent, you'll notice the flow moving to the right and not returning:

  0        -> memset                          
  0          -> memset                        
  0            -> mbrtowc                     
  0              -> pthread_getspecific       
  0              <- pthread_getspecific       
[...]
  0                    -> mbrtowc             
  0                      -> pthread_getspecific 
  0                      <- pthread_getspecific 
[...]
  0                          -> mbrtowc       
  0                            -> pthread_getspecific 
  0                            <- pthread_getspecific 
[...]
  0                                    -> mbrtowc 
  0                                      -> pthread_getspecific 
  0                                      <- pthread_getspecific

If you look closely, you may spot what is amiss - some functions are not returning (missing their "<-" returns), leaving whitespace entropy that pushes the indent further and further to the right. And in the above output, we've entered mbrtowc() four times (and it's not a recursive function).

The problem here wasn't that the return probe didn't fire, but that the return probe doesn't exist in the first place:

# dtrace -ln 'pid$target::mbrtowc:entry' -p 670
   ID   PROVIDER            MODULE                          FUNCTION NAME
148363     pid670 libSystem.B.dylib                           mbrtowc entry
# dtrace -ln 'pid$target::mbrtowc:return' -p 670
   ID   PROVIDER            MODULE                          FUNCTION NAME
dtrace: failed to match pid670::mbrtowc:return: No probe matches description

Missing Returns

The previous example showed that the pid provider has made an entry probe available for a certain function, but is missing its return probe. This can happen when DTrace struggles to instrument a complex function, such as one that contains a jump table added by the compiler as an optimization. Jump tables blur the line between data and instructions, making it difficult for DTrace to dissassemble the instructions to determine where the returns are. Another compiler optimization that can confuse DTrace is tail-call optimization, where a function can return to its grandparent directly, instead of a return to a parent that simply then returns to its parent (also can save a stack frame). Since it skips a return, the return probe does not fire.

Sometimes return probes don't exist, or don't fire.

Hopefully you won't lose too many return probes, and have enough (along with the entry probes) to fetch the data you need.

You'll notice the difference with probe counts. Here is Firefox on Mac OS X:

# dtrace -ln 'pid$target:::entry' -p 210 | wc -l
  256599
# dtrace -ln 'pid$target:::return' -p 210 | wc -l
  250972

That shows we are missing 2% of the return probes. Still, with a total probe count around 500,000, hopefully we can still get what we need.

If the missing return probes are a real problem and the software can be recompiled and redeployed, then compiling with a lower optimization level is likely to bring many of the missing probes back.

Compiler optimization can sometimes cause return probes to be missing.

The worst case I've seen was an application compiled 64-bit with Sun Studio 12.2, where 25% of the return probes were missing due to optimization. Usually it's only about 2% of probes, and I've worked around it by picking other available probe points.

Return Probe Stability

The stability for return probes follows the stability of the entry names, as discussed in my first pid provider post. In summary: since this is the pid provider and is exposing application functions, nothing is guaranteed to be stable. This means that while your script may work fine on one version of an application, it may not work at all on the next because the function names and return values from functions may have changed. And, they could change in subtle ways (silent failure).

Given that return probes may also vanish due to compiler optimizations, your scripts may stop working due to an update in the compiler, not just an update in the target application.

Summary

The pid provider provides function return probes. These allow us to examine the return value and follow the code path, as demonstrated here. There are more limitations to be aware of with the return probe than the entry probe; apart from the stability and performance effect, certain return probes may not be available due to compiler optimizations.

Brendan Gregg's Blog