Debugging is an essential part of the development process. It is an inescapable reality that code will have errors and developers will have to investigate unexplainable issues sooner or later. Debugging can often be referred to as an Art, but much more than that, I think it is very much part of Science and a clear illustration of the Scientific Method.
I’ve sometimes seen developers ignoring how to proceed when investigating issues. Particularly, if they’re young or at the beginning of their careers. At this stage, investigating a mysterious problem can be much more daunting than writing code in the first place. There are many unknowns, no signposts to follow, no algorithms to study or apply. Every bug is basically a new exploration, and we may feel like there isn’t a structured process we can follow.
In this document, I want to show that this is very much not the case. Debugging can be fully done with a scientific mindset, the same approach that has been used all over science for centuries to uncover the laws governing our world. Rather than art, Debugging can be a pretty accurate instance of the Scientific Method, and I would like to make that evident here.
The steps in the Scientific Method
The Debugging Process
There is a very close parallel between the Scientific Method and the steps we can rationally follow to lead a successful investigation of a software bug. This should not be surprising, as Science is, after all, a methodology to search for testable truth. I’ve listed the commonly recognized steps of the Scientific Method at the top of this section. In the rest of this article, I will show the phases of an investigation. They closely match the above steps, even if in some cases I’ve joined a couple in a single instance.
Observe problems and ask WHY
The first step is noticing a problem. Observe that something is not going as expected. Sometimes, you don’t have an option, and a bug identified by someone else is thrown on your lap. But as developers, we have a duty to monitor our products, see them running and note if they’re acting as we expect.
The most important thing at this stage, is to observe things that don’t seem to conform to our expectations, and ask Why? This is the most powerful question in Science, and the essential beginning of the scientific mindset and any investigation.
Gather Data for Problem Solving
As soon as you identify a question about something that needs fixing, your most immediate task is to gather all the information you can about it.
First, note the error message. This is the main clue. If the programmers have been competent, it will at least point you in the general direction of what is wrong. It may even point you to the right place in the code. Unfortunately, in many situations, where code has several abstraction layers, this may just be a surface warning. That is why you need to collect as much more data as you can.
- Can you reproduce the error?
- When does the error happen?
- What are the conditions of the environment when the error shows? This includes networking conditions, disk space, configuration details, etc.
If you can change and recompile the code, add printouts to critical variables. Make all the possibly relevant information visible and write it down. If not, try to change the logging level and re-run in order to get as much information as the developers have made available.
Most importantly, try to make the error reproducible. I can’t stress this enough.
Printouts are generally useful, but rather old school. In complex code, with lots of loops or deep function calls, they can also be confusing. In these cases, you may be better off using breakpoints and live debugging. But even in this case, try to make a point of copying down relevant information about the state of important variables.
Formulate a Hypothesis
When you have enough data, hopefully you’ll start having some ideas as to what can be causing the problem. At this stage, you should formulate a hypothesis that might explain it. Maybe you’ll even have several.
It is possible that some of the data you’ve collected is irrelevant. Some may be unrelated and lead you up a wrong path. It takes experience and knowledge of the codebase to know these. But at any rate, settle on a hypothesis that explains the error and all the relevant data you have collected.
Create a Plan for Testing the Hypothesis
Prepare to test the hypothesis. This depends on the actual problem. It may involve a code fix, or changes to the setup, or changes to the configuration. It may involve preventing some conditions in the environment.
In this stage, you should write down a list of actions that, according to your hypothesis, should prevent the error from occurring.
Testing your Hypothesis
You have your plan. Now it’s time to test it. Execute the changes you’ve set out to do and run the experiment again. This requires that, in step 2, you’ve actually found out how to make the issue reproducible. If you haven’t you won’t be able to test the hypothesis.
This is why “random errors” are so difficult to debug. The scientific method fails here.
Errors of this type usually include: timing errors, concurrency issues, intermittent environment conditions (eg bad connections)
Analyse the Results
Once you’ve run your experiment again, what happened? Did the error still occur? Is the issue fixed, or improved?
If the bug is fixed, well done, you’ve successfully applied the scientific method to solve it!
If not, do not despair. The scientific method does not guarantee success and is an iterative and persistent approach.
Go back to step 2 and repeat again. The new experiment and its results provide more data and will enable you to create a new hypothesis. Repeat this process until the problem goes away or is fully explained.
If you haven’t managed to solve the problem, but in any case managed to give a full explanation to why it happens, then you’ve made significant progress. Now, it’s time for Business and Engineering to decide how to tackle it. Perhaps there is a workaround. Perhaps the impact is just low. Make a risk assessment and bring the results to the table. And let planning decide what to do with the issue. With any luck, it can even be discarded.
But as for your exploration work, that one is done.
Congratulations, you’ve successfully used the Scientific Method to explain and resolve a software bug!
This text was originally written and published for the Aventus blog.