Systems Performance 2nd Ed.

BPF Performance Tools book

Recent posts:
Blog index

MySQL performance schema and DTrace

23 Jun 2011

I originally posted this at

For a series of blog posts on file system latency, I used MySQL as an example application to examine. I was doing so using DTrace, but as it turns out, MySQL 5.5 GA and later has the performance schema which can measure file system latency as well. Mark Leith recently posted how to do this. For more background on the performance schema, see the recent talk by Peter Zaitsev; (and if you want to get really deep, see the 600+ line comment at the top of storage/perfschema/

However, the road for DTrace doesn't quite end here.

If DTrace is available, you can use it in addition to the performance schema to investigate other areas, including kernel internals. But you can also use DTrace with the performance schema, by tracing the functions that were added to MySQL to implement it. This is instrumenting the instrumentation - taking the capabilities of the performance schema and building upon them. In this post I'll demonstrate this for a couple of reasons: examining the overhead of the performance schema and extending the performance schema a little.


There have been many good articles about performance schema overhead: see DimitriK and Vadim Tkachenko's posts. The overhead depends on the workload, what instrumentation is enabled, how much CPU headroom there is (idle), and the MySQL version. CPU-bound workloads with lock instrumentation enabled will cost the most to instrument, whereas I/O-bound workloads will cost the least (approaching 0%). Extra instrumentation does come at a cost, just as it does with DTrace, and cost that is relative to the frequency of events. Lock events can happen very frequently (100,000+s per sec), especially if the workload is hot on-CPU. And if the workload is bound by CPU performance, then extra instrumentation may be taking cycles away from doing queries.

An example of a CPU-bound workload is benchmarks, such as sysbench, which I've been using to test the performance schema. Unless you spend time tuning them to match your environment they will likely paint a worst-case picture for instrumentation overhead. It's easy to create a working set that caches entirely in DRAM, driven by multiple client threads that are calling queries as fast as possible, making it sensitive to extra instrumentation. The performance overhead may look bad: for both the performance schema and DTrace.

What's different here is that the performance schema adds overhead when it is enabled but not actually used (consumers OFF), which could cost 10% performance for heavy workloads. As DimitriK wrote, this should be lower, and was on prototype code which showed 2-3% overhead instead. This is still a degree of overhead, whereas DTrace has near-zero when not in use (mysql provider's additional NOPs), and literally zero when the pid provider is used (original binary runs unmodified). Vadim wrote:

As I understand dtrace / systemtap probes can be disabled / enabled at run-time, and when they disabled – it is almost 0% overhead, why Performance Schema can’t do the same ?

If the performance-schema option wasn't enabled on MySQL startup, then the the performance schema can't be used at all until mysqld is restarted with that option. Such a restart may be costly in terms of downtime, and it may also cause MySQL to behave differently for hours (cold cache) until the original performance issue reappears. Hence the desire to be able to always have this enabled, but without costing too much overhead until it's needed and active.

Examining Overhead

I'll use DTrace to take a quick peek at the overhead caused by the performance schema on MySQL 5.5.31. I started by profiling stack traces (dtrace -n 'profile-997 /pid == PID/ { @[ustack()] = count(); }'), but had too much data to dig through quickly. Instead, I've picked a simple function to examine how it runs with and without the performance schema, by tracing every sub-function that it calls. This is set_query_id() from sql/

3505 void THD::set_query_id(query_id_t new_query_id)
3506 {
3507   mysql_mutex_lock(&LOCK_thd_data);
3508   query_id= new_query_id;
3509   mysql_mutex_unlock(&LOCK_thd_data);
3510 }

The performance schema defaults include instrumentation for locks, so this should operate differently when the performance schema is enabled.

I'll use the following DTrace script to trace the execution of the function:

# cat -n set_query_id.d 
     1  #!/usr/sbin/dtrace -Fs
     3  pid$target::*set_query_id6*:entry
     4  {
     5          self->ok = 1;
     6  }
     8  pid$target:::entry,
     9  pid$target:::return
    10  /self->ok/
    11  { ; }
    13  pid$target::*set_query_id6*:return
    14  /self->ok/
    15  {
    16          exit(0);
    17  }

Please do not run this in production, as it traces over 49,000 probes! Dimitri made this point very well as a danger of using DTrace: if you don't know what you are doing, you could cause significant performance loss - much more so than with the MySQL performance schema. The script above limits its overhead by exiting on line 16 as soon as it has traced a single flow of set_query_id(). I'm also not running this in production - I wrote this just to explore the execution of MySQL on a test server.

This script uses the DTrace flowindent option (-F) to indent the output based on the function depth. The target function name was traced using "*set_query_id6*" to match the correct function; the wildcards match the extra C++ symbol data and the '6' matched this function instead of another:

# dtrace -ln 'pid$target::*set_query_id*:entry' -p 20775
   ID   PROVIDER            MODULE                          FUNCTION NAME
75845   pid20775            mysqld       __1cDTHDMset_query_id6Mx_v_ entry
76201   pid20775            mysqld __1cEItemYreset_query_id_processor6MpC_b_ entry

Now for the results.

No Performance Schema

# ./set_query_id.d -p 8767
dtrace: script './set_query_id.d' matched 49744 probes
CPU FUNCTION                                 
 14  -> __1cDTHDMset_query_id6Mx_v_
 14    -> mutex_lock
 14      -> mutex_lock_impl
 14        -> sigon
 14        <- sigon
 14      <- mutex_lock_impl
 14    <- mutex_lock
 14    -> mutex_unlock
 14      -> mutex_unlock_queue
 14        -> clear_lockbyte
 14        <- clear_lockbyte
 14        -> sigon
 14        <- sigon
 14      <- mutex_unlock_queue
 14    <- mutex_unlock
 14   | __1cDTHDMset_query_id6Mx_v_:return
 14  <- __1cDTHDMset_query_id6Mx_v_

7 functions were called (count the entries, "->") during set_query_id().

DTrace's flowindent option has made the output easy to follow. It's neat for short uses like this where the thread stayed on one CPU the whole time. If the thread switches CPU, then the output can be shuffled when DTrace prints the per-CPU buffers, and requires including a timestamp for post-sorting.

Performance Schema Enabled, consumers=YES, instruments=YES

Now I've restarted mysqld with the performance-schema option:

# ./set_query_id.d -p 20251
dtrace: script './set_query_id.d' matched 49744 probes
CPU FUNCTION                                 
  6  -> __1cDTHDMset_query_id6Mx_v_           
  6    -> get_thread_mutex_locker_v1          
  6      -> pthread_getspecific               
  6      <- pthread_getspecific               
  6    <- get_thread_mutex_locker_v1          
  6    -> start_mutex_wait_v1                 
  6      -> __1cPget_timer_value6FnPenum_timer_name__X_ 
  6        -> my_timer_cycles                 
  6        <- my_timer_cycles                 
  6      <- __1cPget_timer_value6FnPenum_timer_name__X_ 
  6    <- start_mutex_wait_v1                 
  6    -> mutex_lock                          
  6      -> mutex_lock_impl                   
  6        -> sigon                           
  6        <- sigon                           
  6      <- mutex_lock_impl                   
  6    <- mutex_lock                          
  6    -> end_mutex_wait_v1                   
  6      -> __1cPget_timer_value6FnPenum_timer_name__X_ 
  6        -> my_timer_cycles                 
  6        <- my_timer_cycles                 
  6      <- __1cPget_timer_value6FnPenum_timer_name__X_ 
  6      -> __1cbBinsert_events_waits_history6FpnKPFS_thread_pnQPFS_events_w...
  6      <- __1cbBinsert_events_waits_history6FpnKPFS_thread_pnQPFS_events_w...
  6      -> __1cbGinsert_events_waits_history_long6FpnQPFS_events_waits__v_ 
  6        -> atomic_add_32_nv                
  6        <- atomic_add_32_nv                
  6      <- __1cbGinsert_events_waits_history_long6FpnQPFS_events_waits__v_ 
  6      -> __1cbLfind_per_thread_mutex_class_wait_stat6FpnKPFS_thread_pnPPF...
  6      <- __1cbLfind_per_thread_mutex_class_wait_stat6FpnKPFS_thread_pnPPF...
  6    <- end_mutex_wait_v1                   
  6    -> unlock_mutex_v1                     
  6    <- unlock_mutex_v1                     
  6    -> mutex_unlock                        
  6      -> mutex_unlock_queue                
  6        -> clear_lockbyte                  
  6        <- clear_lockbyte                  
  6        -> sigon                           
  6        <- sigon                           
  6      <- mutex_unlock_queue                
  6    <- mutex_unlock                        
  6   | __1cDTHDMset_query_id6Mx_v_:return    
  6  <- __1cDTHDMset_query_id6Mx_v_   

Many more functions are executed: those ending in "_v1" are from the performance schema additions. The extra code included calling my_timer_cycles() twice, to fetch high resolution timestamps to time the wait event, and functions to insert the event into the waits history.

Performance Schema Enabled, consumers=NO, instruments=YES

I read a couple of articles that seemed to suggest that setting CONSUMERS=OFF was how to dynamically disable the performance schema, even though INSTRUMENTS are still ON. ie, this:

mysql> update setup_consumers set ENABLED='NO';
Query OK, 8 rows affected (0.00 sec)
Rows matched: 8  Changed: 8  Warnings: 0

mysql> select * from setup_consumers;
| NAME                                         | ENABLED |
| events_waits_current                         | NO      |
| events_waits_history                         | NO      |
| events_waits_history_long                    | NO      |
| events_waits_summary_by_thread_by_event_name | NO      |

Here's what happens:

# ./set_query_id.d -p 20775
dtrace: script './set_query_id.d' matched 49744 probes
CPU FUNCTION                                 
  1  -> __1cDTHDMset_query_id6Mx_v_           
  1    -> get_thread_mutex_locker_v1          
  1    <- get_thread_mutex_locker_v1          
  1    -> mutex_lock                          
  1      -> mutex_lock_impl                   
  1        -> sigon                           
  1        <- sigon                           
  1      <- mutex_lock_impl                   
  1    <- mutex_lock                          
  1    -> unlock_mutex_v1                     
  1    <- unlock_mutex_v1                     
  1    -> mutex_unlock                        
  1      -> mutex_unlock_queue                
  1        -> clear_lockbyte                  
  1        <- clear_lockbyte                  
  1        -> sigon                           
  1        <- sigon                           
  1      <- mutex_unlock_queue                
  1    <- mutex_unlock                        
  1   | __1cDTHDMset_query_id6Mx_v_:return    
  1  <- __1cDTHDMset_query_id6Mx_v_   

This is much better, but it's still calling more functions: 9 instead of 7. The additional functions are get_thread_mutex_locker_v1() and unlock_mutex_v1(). Turning off all consumers has not stopped all the performance schema code from firing.

Performance Schema Enabled, consumers=NO, instruments=NO

Now I'll turn off instruments as well, using:

mysql> update setup_instruments set ENABLED='NO';
Query OK, 214 rows affected (0.01 sec)
Rows matched: 214  Changed: 214  Warnings: 0

mysql> select * from setup_instruments;
| NAME                                                       | ENABLED | TIMED |
| wait/synch/mutex/sql/PAGE::lock                            | NO      | YES   |
| wait/synch/mutex/sql/TC_LOG_MMAP::LOCK_sync                | NO      | YES   |
| wait/synch/mutex/sql/TC_LOG_MMAP::LOCK_active              | NO      | YES   |

Resulting in:

# ./set_query_id.d -p 20775
dtrace: script './set_query_id.d' matched 49744 probes
CPU FUNCTION                                 
  2  -> __1cDTHDMset_query_id6Mx_v_           
  2    -> mutex_lock                          
  2      -> mutex_lock_impl                   
  2        -> sigon                           
  2        <- sigon                           
  2      <- mutex_lock_impl                   
  2    <- mutex_lock                          
  2    -> mutex_unlock                        
  2      -> mutex_unlock_queue                
  2        -> clear_lockbyte                  
  2        <- clear_lockbyte                  
  2        -> sigon                           
  2        <- sigon                           
  2      <- mutex_unlock_queue                
  2    <- mutex_unlock                        
  2   | __1cDTHDMset_query_id6Mx_v_:return    
  2  <- __1cDTHDMset_query_id6Mx_v_  

That's better - we are back to 7 function calls. This would seem like a better way to have the performance schema ready but dynamically disabled. I don't know if there is a downside to disabling the instruments as well as the consumers in this way.

Not so fast...

Unfortunately, not all functions work like set_query_id(). Many others still call into the performance schema when consumers and instruments are OFF:

# dtrace -n 'pid$target::*_v1:entry { @[probefunc] = count(); }
    tick-10s { exit(0); }' -p 20775
dtrace: description 'pid$target::*_v1:entry ' matched 57 probes
CPU     ID                    FUNCTION:NAME
 10  75844                        :tick-10s 

  get_thread_file_descriptor_locker_v1                             39
  destroy_rwlock_v1                                               577
  init_mutex_v1                                                  4782
  init_rwlock_v1                                                17476
  signal_cond_v1                                               219790
  get_thread_cond_locker_v1                                    220468
  get_thread_rwlock_locker_v1                                  500894
  unlock_rwlock_v1                                             520604
  get_thread_mutex_locker_v1                                  1269483
  unlock_mutex_v1                                             1269485

That hundreds of thousands of these functions are still being called every second, despite trying to disable them, should go far to explain the disabled overhead that was reaching 10%.

DTrace can identify the reason why some of these are still being called by examining their stack trace:

# dtrace -n 'pid$target::unlock_mutex_v1:entry { @[ustack(5)] = count(); }' -p 20775
dtrace: description 'pid$target::unlock_mutex_v1:entry ' matched 1 probe




Given that the set_query_id() function was returned to its original state, I would hope that it may be technically possible to stop calling all or most of these additional performance schema functions, allowing the disabled overhead to approach 0%. Stack back traces like those above would be essential for this effort, as it lets the engineer immediately see where in the code still needs work.

I'll stop here; I just wanted to take a peek into performance schema overhead using DTrace, and found some interesting things that could help reduce this overhead. And that's a key difference between the technologies:

The performance schema can't trace DTrace.



The performance schema adds functions (many dynamically) to gather event data. There is a list of these in the performance schema source, storage/perfschema/

2042 PSI_v1 PFS_v1=
2043 {
2044   register_mutex_v1,
2045   register_rwlock_v1,
2046   register_cond_v1,
2047   register_thread_v1,
2048   register_file_v1,
2049   init_mutex_v1,
2050   destroy_mutex_v1,
2051   init_rwlock_v1,
2052   destroy_rwlock_v1,
2053   init_cond_v1,
2054   destroy_cond_v1,
2055   get_table_share_v1,
2056   release_table_share_v1,
2057   open_table_v1,
2058   close_table_v1,
2059   create_file_v1,
2060   spawn_thread_v1,
2061   new_thread_v1,

As a DTrace-guy, this is the next-best-thing to having a USDT provider (like the mysql provider). These add traceable functions at logical points and with meaningful names. Instead of having to wade through all the source looking for things to trace, I can start with these as they've already been added at interesting places.

I'll trace one as an example; here's the prototype for start_mutex_wait_v1():

1596 static void start_mutex_wait_v1(PSI_mutex_locker* locker,
1597                                 const char *src_file, uint src_line)

These functions provide meaningful arguments as well, in this case a char * for the source file and int for the line number. This makes it very easy to write powerful DTrace scripts, or one-liners:

# dtrace -n 'pid$target::start_mutex_wait_v1:entry {
        self->loc = strjoin(strjoin(copyinstr(arg1), ":"), lltostr(arg2));
        self->s = timestamp; }
    pid$target::start_mutex_wait_v1:return /self->s/ {
        @[self->loc, "(ns)"] = quantize(timestamp - self->s);
        self->s = 0; }' -p 22147
dtrace: description 'pid$target::start_mutex_wait_v1:entry ' matched 2 probes

  /export/home2/pb2/build/sb_0-3314961-1305135650.29/mysql-5.5.13/sql/  (ns)                                              
           value  ------------- Distribution ------------- count    
            1024 |                                         0        
            2048 |@                                        405      
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        9377     
            8192 |@@@@                                     1191     
           16384 |@                                        216      
           32768 |                                         114      
           65536 |                                         63       
          131072 |                                         33       
          262144 |                                         5        
          524288 |                                         2        
         1048576 |                                         0        

  /export/home2/pb2/build/sb_0-3314961-1305135650.29/mysql-5.5.13/storage/innobase/buf/buf0buf.c:2766  (ns)                                              
           value  ------------- Distribution ------------- count    
            2048 |                                         0        
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@             27843    
            8192 |@@@@@@@@@@                               10287    
           16384 |@                                        955      
           32768 |@                                        532      
           65536 |                                         272      
          131072 |                                         154      
          262144 |                                         49       
          524288 |                                         19       
         1048576 |                                         1        
         2097152 |                                         0   

There I'm tracing mutex wait time in nanoseconds as a distribution plot, along with the filename and line number for the mutex lock. (And some pathname stuff that looks like it's from the build server for this MySQL distribution.) Here is line 2766 from buf0buf.c:

2766         buf_pool_mutex_enter(buf_pool);

According to the distribution, most of the time threads waited on this for between 4 and 16 microseconds, with one instance reaching over 1 millisecond.

Stack Traces

One of the easiest extensions is to add stack traces to show the code path that led to the event; here I've added ustack() as a key to the aggregation in the previous one-liner, resulting in:

  /export/home2/pb2/build/sb_0-3314961-1305135650.29/mysql-5.5.13/sql/  (ns)                                              

           value  ------------- Distribution ------------- count    
            2048 |                                         0        
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           3261     
            8192 |@@@@@@@@                                 838      
           16384 |@                                        82       
           32768 |@                                        70       
           65536 |                                         16       
          131072 |                                         12       
          262144 |                                         1        
          524288 |                                         5        
         1048576 |                                         0 line 3486 is in set_query():

3484 void THD::set_query(const CSET_STRING &string_arg)
3485 {
3486   mysql_mutex_lock(&LOCK_thd_data);
3487   set_query_inner(string_arg);
3488   mysql_mutex_unlock(&LOCK_thd_data);
3489 }

But the stack trace above is showing dispatch_command(), not set_query(). I couldn't find documentation on this (other than the source code), but I think the start_mutex_wait_v1() function is showing us the locker - who has held the lock, whereas the DTrace ustack() is showing who is waiting on the lock. set_query() would be holding onto LOCK_thd_data and blocking dispatch_command().


You do need the performance schema enabled and running for DTrace to trace these functions. If the performance schema is off, DTrace still works, but you'll need to trace the original MySQL functions.

So far this looks of use for providing more context for lock, file and thread events. Here are all the PFS functions that are firing during my test:

# dtrace -n 'pid$target::*_v1:entry { @[probefunc] = count(); }' -p 23285
dtrace: description 'pid$target::*_v1:entry ' matched 56 probes

  delete_current_thread_v1                                          1
  destroy_cond_v1                                                   1
  init_cond_v1                                                     17
  new_thread_v1                                                    17
  set_thread_v1                                                    17
  destroy_rwlock_v1                                               195
  init_rwlock_v1                                                  212
  end_cond_wait_v1                                                685
  start_cond_wait_v1                                              685
  get_thread_cond_locker_v1                                       686
  broadcast_cond_v1                                              2366
  destroy_mutex_v1                                               4751
  init_mutex_v1                                                  4803
  end_file_wait_v1                                              11375
  start_file_wait_v1                                            11376
  get_thread_file_descriptor_locker_v1                          11377
  signal_cond_v1                                                14182
  end_rwlock_wrwait_v1                                          85181
  start_rwlock_wrwait_v1                                        85181
  end_rwlock_rdwait_v1                                         224720
  start_rwlock_rdwait_v1                                       224721
  get_thread_rwlock_locker_v1                                  309902
  unlock_rwlock_v1                                             379472
  end_mutex_wait_v1                                           1176697
  start_mutex_wait_v1                                         1176698
  get_thread_mutex_locker_v1                                  1176700
  unlock_mutex_v1                                             1176711

Most of these are lock-related functions, which I'm glad to have: lock issues in MySQL can get hard, and any additional context to help diagnose them can be helpful.

To investigate other behaviour, such as of the query cache and indexing, the raw mysqld functions can be traced. Or better, use the mysql provider if it's available.

I'd also expect more functions to be added to the performance schema over time; it's been designed with new versions in mind (see the block comment in

The Rest of the World

The other time to reach for a tool such as DTrace is whenever you'd like to see behaviour outside MySQL and relate it to MySQL context. For example, showing what code paths are causing page faults to grow memory (not just the malloc() reservations, which we can DTrace too if we like):

# dtrace -n 'vminfo:::as_fault /pid == $target/ { @[ustack()] = count(); }'
dtrace: description 'vminfo:::as_fault ' matched 1 probe



I sometimes use this approach to track down the reasons why applications are growing or leaking memory. The vminfo:::as_fault probe is in the kernel, in the virtual memory code.

Other system areas that can be useful to trace with MySQL include the TCP/IP stack, to examine connection behaviour including dropped connections and packets, and file system internals, to understand the origin of slow I/O. The DTrace book is full of ideas.


In this post I showed how DTrace could examine the performance of the performance schema itself, and how it could trace and extend the performance schema functions a little. DTrace does a lot more, given visibity into the rest of the system, which it can express in MySQL context as needed. I included an example of tracing address space faults by MySQL stack trace, which shows the code path causing memory growth.

That's not to say that DTrace is better than the performance schema - these are different tools. Application-level tools, like the performance schema, should be able to run at a lower enabled-overhead than DTrace (avoiding context switching for instrumentation), and should be safer for beginners to use (no accidentally tracing everything on the system and slowing the target). They can also be customized to keep data in a way that best suits the application. Try to use these tools first if they are available, and then DTrace to dig deeper.

Right now, I'm still analysing a lot of systems that don't have the performance schema (older MySQL) or don't have it enabled. Since DTrace is available, I'm able to use it to investigate issues immediately. I posted examples of this for on-CPU vs off-CPU time and file system latency.