DTrace in New York

I originally posted this at http://bdgregg.blogspot.com/2008/02/dtrace-in-new-york.html.

Back in February I gave several DTrace talks in New York, including one at the New York OpenSolaris User Group meeting (NYCOSUG). I used an updated slide deck and was asked to put the PDF on my blog; I think Isaac must have beaten me to it and put it here (thanks!). I did intend to blog about this in case anyone was looking. Sorry for the delay.

The NYCOSUG had a good turn out and asked some great questions, allowing me to deviate from the prepared slides and cover other things of interest (which is the value of an in-person presentation.) After the presentation I realised there was one point I could have explained better, which would make an interesting blog post.

I started with the following simple demos. The point of these is to build on something commonly understood (such as the behaviour of fork() and exec()), to introduce something new: DTrace.

Tracing exec():

# dtrace -n 'syscall::exec*: { trace(execname); }'
dtrace: description 'syscall::exec*: ' matched 4 probes
CPU     ID                    FUNCTION:NAME
  0  98087                      exece:entry   bash                             
  0  98088                     exece:return   ls                               
^C

In the above output, we traced an exece() system call, printing the current process name when we entered and returned from that system call. That process name changed from "bash" to "ls" (I executed "ls -l" in another window), which is what exec() does: replaces the current process image with another.

While unsuprising, the significance is that we are able to dymanically trace this kernel activity whenever we would like, along with thousands of other kernel events. I could, for example, trace the time taken for exec() to execute; or the exit status when exec() returned along with the error code; I can also trace the internal operation of exec() with enough detail to fill hundreds of pages (I just counted 47396 lines of output when tracing every kernel function entry and return during exec()).

Now to trace fork():

# dtrace -n 'syscall::fork*: { trace(pid); }'
dtrace: description 'syscall::fork*: ' matched 6 probes
CPU     ID                    FUNCTION:NAME
  0  98227                    forksys:entry     87417
  0  98228                   forksys:return     90769
  0  98228                   forksys:return     87417
^C

The above system call has one entry and two returns, which is what we expect from the fork() family.

Simple as this is, some interesting behaviour is already visible. Note that the parent returned before the child? On the fork() entry, the parent's PID is traced (87417), however the child PID (90769) appears first on return.

It's possible that the output could be shuffled due to how DTrace uses per-CPU buffers to minimise performance impact. To double check, add a "timestamp" column and post sort:

# dtrace -n 'syscall::fork*: { printf("%d %d", timestamp, pid); }' -o /tmp/out.dtrace
dtrace: description 'syscall::fork*: ' matched 6 probes
# sort -n +3 /tmp/out.dtrace
CPU     ID                    FUNCTION:NAME
  1  98227                    forksys:entry 268361844462135 87417
  1  98228                   forksys:return 268361844960455 90968
  0  98228                   forksys:return 268361844965924 87417

I asked the audience: why does the child return from fork() before the parent? I added that this was a very difficult question!

Someone responded to say that this was how all operating systems worked: the parent process waits for the child to complete. I said I was just tracing fork() and the parent could be scheduled first, but deliberately isn't, and explained why. My answer left them confused, and it struck me afterwards that I should have explained this better.

Consider this:

# dtrace -n 'syscall::fork*:,syscall::wait*: { trace(pid); }'
dtrace: description 'syscall::fork*:,syscall::wait*: ' matched 10 probes
CPU     ID                    FUNCTION:NAME
  0  98227                    forksys:entry    87417
  0  98228                   forksys:return    91088
  0  98228                   forksys:return    87417
  0  98163                    waitsys:entry    87417
  0  98164                   waitsys:return    87417
^C

In the above output we can see both fork() and wait(), and we can discuss behaviour such as the parent process waiting for the child to complete (since I was in a shell running foreground commands.)

But I was actually asking a much deeper question, that of thread scheduling immediately after the fork() system call, and before the parent has called wait(). Immediately after fork() you have two threads: which should go on-CPU first? The parent, so that it can get to wait() sooner and before the child may have exited? Or is there a reason to schedule the child first?

As DTrace shows, the child is getting scheduled first, and the reason is one of performance. The source code explains in uts/common/disp/ts.c :

/*
 * Child is placed at back of dispatcher queue and parent gives
 * up processor so that the child runs first after the fork.
 * This allows the child immediately execing to break the multiple
 * use of copy on write pages with no disk home. The parent will
 * get to steal them back rather than uselessly copying them.
 */
static void
ts_forkret(kthread_t *t, kthread_t *ct)

The fork() system call creates a clone of the parent, but rather than copy all memory pages to a new address space (which would add significant latency during process creation), Solaris bumps a reference counter on those memory pages to remember that two process refer to the same data. If one writes to a memory page later, this triggers a "copy on write" to create a private writable copy for that process. This means that expensive memory copies are only performed when needed, or if needed. Since a child process is likely to call exec(), it is likely to simply drop many existing memory references for the new process image, so copying those bytes would have been wasted cycles anyway.

However, if the parent is scheduled first – before the child has had a chance to exec() – then the parent may continue writing to its address space, triggering copy on writes. Then the child executes, calls exec(), and drops those newly copied pages anyway – which were copied in vain. To avoid this, the child is scheduled first, so that it calls exec() as soon as possible, as described in the comment above.

I learned about this behaviour when reading Solaris Internals 1st edition, but that was a time before DTrace and OpenSolaris. It's great that we all can now both read the code, and use DTrace to see it in operation.

Brendan Gregg's Blog

DTrace in New York