Monday, May 21, 2012

Worst bug ever: random and concurrent software, but the bug was somewhere else!

Today I finally pinpointed the worst bug I ever met. The bug itself manifested perhaps a week ago when I decided to plot the distribution of a given attribute in a simulator. Essentially, the distribution should have been a power-law but a different behavior occurred. Varying some parameters changed the shape of the curve, but not the fact that it was not a power law.ccdf.png

Here, in red the "expected curve", while the other three colors are the plots of the distributions I found (blue the original one, green and cyan variants with different parameters). The essential problem is that finding the exact cause of this was a hell. This kind of bugs is typically not found with unit-testing. In fact, the very problem is randomized and even mocking the random number generator does not really help. In fact, each individual event occurred apparently "right" and only the whole process showed a wrong behavior.

Since the simulation is concurrent, I spent the first days analyzing if the actual order of execution did somehow cause the problem (the red curve is obtained with a non concurrent model). Turns out that I did things right: event order did not influence the problem.

Eventually, I reviewed the drawing code (in the sense of urns, not in the sense of painting!). After a couple of days I figured out that perhaps I should have mathematically proved that the code implementing the draw had the same distribution of the "red" variant. My wrong assumption was that sampling a graph edge and taking one end-point was equivalent to an extraction from an urn where each node is placed n times, where n is its degree. Essentially the question is that even though the graph was undirected, the computer did ordered the edges end-points.

This is the distribution obtained taking the other endpoint:


ccdf.png

And here, eventually, what we got when I randomize between the two endpoints:

ccdf.png

I just have to rewrite the code implementing the drawing. And I also have to make it fast (my former procedure was quite fast, given the fact that the whole graph is distributed).

No comments: