drspin

Matt Jacobson
June 2022

I'm working on a project (the subject, I'm hoping, of a future blog entry) that runs on FreeBSD, and I needed to profile its performance.

I'm accustomed to using sampling call-graph profilers like spindump, sample, Instruments, and Shark on macOS. They're useful because their overhead tends to be small and relatively uniform, and their memory usage is small. I'm sure I didn't do an exhaustive search, but I looked around a bit, and I couldn't find anything similar on FreeBSD. As far as I could find out, my options were:

Valgrind, which (effectively being a virtual machine) imparts a significant observer effect
The dtrace profile provider. This would only capture executing code, not off-CPU threads.^[1] (Also, I'd have to write a post-processor to turn the dtrace output into something usable.)
gprof, or its predecessors, which would require me to recompile all the code with instrumentation
Some other use of profil(2), which I'm not sure actually works these days
pmcstat, which I'm pretty sure would only capture executing code, but more importantly which I couldn't figure out how to get working at all

So, after putting it off for a while (and repeatedly resorting to manual tracing), I decided to bite the bullet and write something myself. (I've put the source on GitHub.) So far, it's working great for what I need.

Sampling

There are two interesting parts to the profiler. First, the sampling mechanism. The best thing I knew to use was ptrace(2), which is the main mechanism used by standard debuggers. ptrace allows me to (1) suspend the target process and wait for the suspension to begin and (2) read target registers and memory.

For comparison, spindump and Instruments on macOS use kernel-built-in samplers that have some definite advantages. They avoid the overhead of making tons of syscalls (drspin makes one syscall per stack frame, plus one per thread, plus three extra, each sample). They are able to take samples of the entire system without fear of deadlock. They can even walk kernel stacks.

Unlike some of the alternatives I listed above, drspin—being a stack-walking sampler—does require code to be compiled without -fomit-frame-pointer. The performance gain from using the frame pointer as an extra register seems to be small in most cases; in my opinion, it's outweighed by the increased ease of profiling.

Symbolicating

The second interesting piece is the symbolicator. Initially, I was hoping that something like macOS's atos(1) existed, but I didn't find anything like that. My plan would be to spawn it as a subprocess and "puppet" it, feeding it addresses through a pipe and reading back symbol names.

Not wanting to write my own atos equivalent, I had the idea to use the same technique with a command-line symbolicator that I knew did exist: lldb. When attached to a process, you can type the command:

(lldb) image lookup -a <address expression>

and lldb will symbolicate the address. The tricky part was ensuring I'd read enough output for each command to prepare for the next command, since lldb's symbolicator sometimes outputs multiple lines of data. I worked around this with a hack: after the image lookup command, I also send a dummy command, p (void)0. Since lldb echos each command back to me, I can look for the echo of the dummy command to know when to stop reading. It's not the prettiest solution—to say the least—but it works!

There are other downsides to the lldb-based symbolicator. It's slow, since the marshalling of requests and decoding replies adds a bit of overhead. Also, in order for lldb to symbolicate addresses, it needs to use ptrace to attach to the target. There can only be one "debugger" attached to a process at any given time, so that means that drspin has to detach first. In the interim, the process could go away, unload libraries, or otherwise change state.

So I ended up writing a second symbolicator that does the work by itself. Since drspin is already attached to the process with ptrace, the new symbolicator can read memory directly. It starts by finding the data structures set up by the dynamic linker that describe the loaded libraries and where they are in memory. Then, for each library it finds, the symbolicator initializes a rudimentary ELF parser that knows how to read the symbol table(s). This is all it needs to be able to symbolicate any of the addresses collected during sampling.

Anyway, I was pretty happy with how easily it came together, and I'm glad to now have a familiar-feeling sampling call-graph profiler for my FreeBSD projects.

A few years back, there was a push to switch to a new tracer/profiler tool within Apple. It came with a lot of advantages over existing tools; unfortunately, one downside compared to the tools it displaced was that it didn't profile off-CPU time well. This limitation, and the subsequent stubborn refusal of the tool's maintainers to even acknowledge it was a problem, was an enormous time-sink and source of frustration to those of us who understood it. ↩︎