Thursday, May 10, 2012

Euphoria after finding a hard-core bug

Just a rant about debugging sessions.
I have been writing some low-level code related to GPU driver and compiler stuff on Linux. Well, at some point, I run some of my tests and bam, a bug.

Here starts the usual story, the one I think all programmers experience the hard way.

The bug does not appear on the HW simulator. Do not expect using valgrind or some stuff. This is happening on the GPU. Youhou.... You are on your own: The hardware against you.

This is the story everybody who has to work on low-level stuff already knows:

Day 1: You found a bug before leaving. Well, let's see that tomorrow.

Day 2: OK, You think it is easy. You read the code just trying to figure out what is happening. You deeply think about the root causes that you can imagine. So, you play with some of parameters. Sometimes it crashes, sometimes it does not.
Then, you start to see that a dozen of parameters are starting to activate / deactivate the crash.
The horror really begins when every hypothesis you may imagine are just discarded by the next experiment.
Then, you know you are going to suffer. This is not an algorithm bug. This is a damn random corruption due to some buffer overrun or unitialized value somewhere in the code.

Day 3: There is no way around. You are not clever enough to consider all the symptoms together, find the disease and cure it. Well, you are not Dr House and your intellect is limited. So, you decide to go the brutal way: broad spectrum antibiotics, chemotherapy, radiation therapy and amputation. You do not know what is going on so you start butchering the code. One by one you deactivate the sub-systems.
1/ Do I do a buffer overrun on the compiler? OK, compile the code offline, dump it to file and reload it instead.
2/ Did I do something bad in the c++ pre-main function? OK, remove the cpp file you have and do without them.
3/ Is this structure buggy? OK, just do not use it and directly use the low-level pieces instead
4/ Cool! You find an alternative way to make the code work. Now, progressively merge both source codes (the sick one and the healthy one)
532/ You now have a test case made of 10 C files that either works or fails.
1023/ After 10 hours of debugging sessions, you find the bug... Just here. Plain stupid, so obvious but fucking well hidden.

This is just so good. You do not even need to implement the fix. You just *know* it works. All symptoms make sense. One bug that just explains everything.

Finding a bug is so much like making a disease diagnosis.