TIME on steroids
Nikodemus's recent post about a yet-to-be-released micro-benchmarking tool prompted me to talk about Perfpiece. I describe it as a tool for measuring the performance of Lisp code, not unlike the standard CL:TIME
and I wrote over two years ago to measure the performance of SBCL's garbage collector.
Features
Here are its main features:
Lispy interface to PAPI (Performance Application Programming Interface). PAPI is a library that enables access modern CPU’s hardware counters. This allows us to measure several events such as processor cycles, cache misses, number of floating-point instructions, and almost two hundred other events. Perfpiece dynamically inspects the current platform’s supported events at runtime and enables the user to inspect this list and measure these events. We also use this library to measure real (wall-clock) time and user (virtual) time.
Support for other non-PAPI events such as the number of GC runs, CPU usage, and operating system resource usage (via the
getrusage()
system call).Transparent support for multi-threaded programs. It includes a helper C library that when loaded through the POSIX
LD_PRELOAD
mechanism will preempt pthread creation/termination calls and allow for the individual measurement of events across threads created during a measurement session. This includes both Lisp threads as well as threads created by C code. This is rather limited at the moment; only real/user/cpu time is measured for new threads.Segregation of measurements between mutator and GC.
Support for sampling. Perfpiece can repeat a given a measurement a number of times then calculate and report basic statistic analysis: minimums, maximums, geometric means, and standard deviations for each measured event.
Usage
The simplest way to interact with this library is through the ascertain
macro, which works very much like cl:time
. The following example shows the default events measured for a very simple arithmetic form:
PERFPIECE> (ascertain (+ 1 1)) non-GC GC Total ──────────────────────────────────────────────────────────────────────────── Total cycles: 8,343 0 8,343 Instructions completed: 481 0 481 Level 2 data cache misses: 78 0 78 Level 2 instruction cache misses: 24 0 24 GC count: - - 0 Involuntary context-switches: 0 0 0 Voluntary context-switches: 1 0 1 Page faults: 0 0 0 Page reclaims: 6 0 6 System time: 0 0 0 CPU usage: 100.00% - 100.00% User time: 6.00 µs 0 6.00 µs Real time: 613 ns 0 613 ns 0 new threads were spawned 2 ; printed result of (+ 1 1)
Having loaded the helper library using LD_PRELOAD
, we can measure multi-threaded code:
PERFPIECE> (ascertain (loop repeat 2 do (sb-thread:join-thread (sb-thread:make-thread (lambda () (sleep 0.5))))) :events '(:real-time :user-time :cpu-usage)) non-GC GC Total ──────────────────────────────────────────────────────────────────────────── CPU usage: 0.02% - 0.02% User time: 206.00 µs 0 206.00 µs Real time: 1.335 s 0 1.335 s 2 new threads were spawned #0 real: 667.54 ms, user: 101.00 µs, cpu: 0.02% #1 real: 667.60 ms, user: 144.00 µs, cpu: 0.02%
The other main function is sample
. In the following example, we're measuring FP instructions, invoking some code 10 times, and aggregating measurements in several ways:
PERFPIECE> (sample (lambda () (* 2 pi)) :events '(:papi-fp-ins) :samples 10) [Floating point instructions] Min Max Mean Stddev ────────────────────────────────────────────────────────────────────────────── total: 35 36 35.60 ± 1.376% non-gc: 35 36 35.60 ± 1.376% gc-only: 0 0 0 ± 0.000%
sample
's got a :report
keyword argument that you can use to get machine-readable results:
PERFPIECE> (sample (lambda () (* 2 pi)) :events '(:papi-fp-ins) :samples 10 :report nil) ((:PAPI-FP-INS :MIN (35 35 0) :MAX (38 38 0) :MEAN (184/5 184/5 0) :STDDEV (1.0770329 1.0770329 0.0)))
Fork it!
That's pretty much it. There is some SBCL-specific code, but it shouldn't be too hard to port to other implementations. Its use of PAPI could be made optional since that makes this library pretty much Linux-only otherwise. Patches are most welcome!