The following is a cleaned up log I wrote for myself while debugging a bug. Writing a log while working helps me keep track of the debugging effort in case I’m interrupted (life, sleep, work, etc.). It also requires me to explain all findings to myself in fully spelled out sentences, making my thoughts considerably easier to follow.
In addition to serving as an example of a personal debug log, I hope it can be useful as an introduction to debugging segfaults or other low-level bugs.
How to reproduce
I have three computers running Ubuntu 11.04. Mopidy revision 9c23949 consistently crashes with a segfault on one of them when I do the following:
- Start Mopidy.
- Connect with an MPD client, e.g.
ncmpcpp. - Search for anything, e.g.
foo. - Wait less than 10 seconds for the segfault to happen. It always happens
directly after a log message from Mopidy’s Spotify backend stating that it
is
Updating metadata.
Rule out the obvious
I check that I’m actually using the latest versions of the important pieces of software, not just some old version I installed by hand and forgot about.
I try uninstalling pyspotify v1.1+mopidy20110405 installed from my own Debian package, and install the latest revision (c6e2a02) from Git instead. Still the same error.
Nobody has reported the same problem, even though both other developers and users should have been on the same mix of versions for the last couple of months.
Digging in
Luckily I can consistently reproduce the segfault in less than 20 seconds of work. I’m not too familiar with debugging C programs and especially the infamous segfaults, but I’m taking on this fight.
I expect the problem to be in pyspotify, as the rest of the code is either pure Python or more well-tested and broadly used libraries. Also the segfault consistently happens directly after a log message from Mopidy’s Spotify backend.
Antoine Pierlot-Garcin, the new main contributor to pyspotify, says to rebuild pyspotify with
CFLAGS="-g -O0". -g will include debug information that gdb will
understand in the resulting binary. -O0 will override the -O2 default
of the pyspotify build system, and turn off any optimizations to ease the
debugging.
Googling “debugging segfaults” yields a nice howto on debugging segfaults.
In summary: Get a core dump, inspect it with gdb.
Possible causes: “There are four common mistakes that lead to segmentation faults: dereferencing NULL, dereferencing an uninitialized pointer, dereferencing a pointer that has been freed (or deleted, in C++) or that has gone out of scope (in the case of arrays declared in functions), and writing off the end of an array.”
Rerunning Mopidy nothing more interesting happens, except the usual segfault. No core dump is produced.
Hum, how to get a core dump? I’ve done it before, but I can’t remember. Google helps.
I try ulimit -c 50000 to get a core dump of maximum 50MB, and then
reproduce the segfault.
Run python mopidy -v, connect with ncmpcpp, search for foo, wait less
than 10 seconds. Kaboom:
Loads the core dump up in gdb:
gdb complains that the core dump is truncated, and that is should be
approximately 180MB.
man bash and search for ulimit. Aha. I can specify unlimited. But
I’m not allowed to? Restart shell, run ulimit -c unlimited again. Works.
Rerun Mopidy to get an untruncated core dump.
Analyzing the core dump
Yay! gdb loads debug symbols from all the linked libraries that got them
available. Notably, the proprietary libspotify library does not include
debug symbols. So what does gdb say?
Verifying a hypothesis
According to the libspotify docs for the album subsystem
sp_album_add_ref takes an sp_album struct and increases the reference count
of the album. Looking back at the list of reasons for segfaults from the
segfault debugging howto, it may sound as we increase the reference count of
NULL, which obviously isn’t good. Let’s see if that hypothesis is correct…
Looking in the howto for ways to proceed, backtrace reveals the events
happening just before the segfault:
Moving one step up the stack, we get to the pyspotify code:
Our guess was that we’re increasing the reference count on an album that is
null. Let’s see what album actually was…
A pointer to an sp_album struct at the address 0x0, also known as NULL.
Hypothesis confirmed.
Finding a solution
Let’s take a look at the pyspotify code in question:
We create a pointer to a sp_album struct. Given a Spotify track, we request a
reference to the related album, and assign the result to our pointer. Then we
create an Album Python object, before we increase the reference count on the
sp_album and give the reference to the Python object. Finally, the Python
object is returned.
Lets try just returning NULL if we get NULL from sp_track_album:
If we rebuild pyspotify and try to reproducing the segfault, we now get a familiar Python traceback instead:
Note that we get a SystemError and not e.g. AttributeError: 'NoneType'
object has no attribute 'year' which we would expect if album() returned
None. This is because we return NULL from Track_album, which Python
considers an “error return”, without setting an exception code first. If we
want to return None on failure instead of throwing an exception, we can
change track.c as follows:
Rebuilding pyspotify again, and reproducing the error, we now get the expected Python error which we can handle nicely:
By this point, I consider the bug squashed and ready to be reported.
The only part left is to decide how to best fix it. E.g. if album() should
return None, throw an exception, or maybe return an empty album object.
That’s another story.