Improve your Dev and Ops skills with Troubleshooting Theory
Years before joining DNSimple I worked for about 3.5 years at an Apple Retail store in the Genius Bar helping diagnose and repair everything from iPods to XServe systems. The biggest takeaway I had during my time there was Troubleshooting Theory which has helped me through a lot of complex problems both as a developer and a system administrator here at DNSimple. I'll explain the process of troubleshooting step by step with some real-world examples about how you could apply this to development, server administration, and hardware repair near the end of my article.
The steps to troubleshooting are generally the following:
- Gather information
- Isolate the problem
- Planning a solution
- Implementing a solution
- Testing a solution
Keep in mind there are variations to this model, but they generally follow the same overall steps. I've also simplified them for my needs and this blog post, but I'm confident they'll be helpful for you too.
First Step: Gather Information
When you are presented with a problem, try to obtain as much information as possible about the problem. Such as:
- Symptoms (What is happening that appears abnormal?)
- Reproduction Steps (Can you re-create the problem? How?)
- Anything that happened leading up to the problem ("We were restarting the server when…")
- Areas that are affected, or most likely to be affected (Is it a file of code, maybe a set of methods in your code? The network card on a server?)
Sometimes the smallest details can be the clue to locating the problem, then implementing and testing a solution. If you're troubleshooting the issue with another person, ask probing questions to gather as much detail as possible. If you're troubleshooting your own issue, consider asking those questions out loud or doing a little rubber ducking to gain some details.
Second Step: Isolate The Problem
Now that you have gathered information it is very possible that if you are familiar with the system being affected, you know where the problem is likely to be. Is it a file with some new code? A new network card you just installed? A firewall that was recently added? The more familiar you are with a system you are troubleshooting, the more you can use your knowledge of that system to quickly isolate the issue and create a solution. The opposite can also be true where unfamiliarity with a system can help ask questions or try steps that someone familiar with a system would not normally perform. To isolate a problem, there are two methods to doing so:
- Half-Split (Also called a Split-Half)
Depending on constraints of time, tools, and sometimes budget, you may be forced down one or both of these paths to isolating the issue fully and over time you'll get a feel of which methods can work best in each given situation. This is likely to be where you'll spend most of your time troubleshooting. If the fix was obvious or easy then you'll fly right by these in-depth approaches to finding and resolving the problem.
The bottom-up approach usually applies to physical systems such as servers or desktop computers. You'll suspect the problem lies in the physical space, so you'll start there and work your way up to the software layer of the operating system and its software. Physical issues are usually easy to troubleshoot when you start from a base system and work your way up the component tree until fully built. This method also applies to software if you're a developer working within a framework. Sometimes the language or framework could be considered the bottom of the stack with your code changes being at the top. Changing or removing this if it can be easily done through package management would be a way to isolate a potential issue.
It's helpful to know ways of skipping the bottom-up approach as sometimes physical or low-level issues may be harder to diagnose. For example, you can use the ping command to test network connectivity and if it works, it's likely not the network card or even a physical issue, right? Same goes for frameworks. Can you do a hello world or use something else in the framework not related to your code? Likely not the culprit then.
The opposite of the bottom-up approach is simpler and generally what you might try first. This typically begins at the application or operating system layer as you work your way down the stack towards the framework or hardware. For developers, try commenting out a line or two of code. System administrators can try thing like accessing another website, setting up another virtual host, or restarting a process and working your way down. The top-down approach can follow the OSI model where you begin with layer 7 and work your way down to 1 in isolating an issue, especially if it's network related. This could lead you to testing the basics like DNS resolver settings, firewall configurations, and network settings to identify an issue.
Half-Split (or Split-Half)
If you've got some experience under your belt with a given system you are troubleshooting whether it be code or a physical system, you can opt for the Half-Split approach. This is sometimes called "Divide-and-Conquer" because you'd use your experience with a given system to make an educated guess as to where the problem is most likely to be, and work up or down the stack from there. You potentially eliminate areas you assume to be unrelated, though sometimes they can be, but it can save you valuable time when getting towards the root cause of a problem and ultimately, a solution.
Maybe it's not the code you just wrote, but some other supporting code in another file so try commenting out code not related to your issue. For system administrators maybe you already know the problem is with a certain set of services so you can shut off unrelated services or configuration options to quickly get to the problem. You might be sure it's a physical problem by trying another drive with a good, bootable operating system to eliminate anything on the software side and go from there.
Third Step: Planning a solution
Now that you have isolated the problem, it's time to plan out a solution. Some solutions can have minor impact while others can have major, catastrophic impact. How many users will be affected by the change? Does the change risk data loss? Perhaps this will change the method signature and break other code? Measuring the estimated impact can help drive the solution and even help you in breaking apart the steps of implementing a solution to avoid a major impact.
Consider setting up a test environment to plan out any possible solution. For code, this can be a branch to spike a solution to the problem. To do this with hardware, if you can acquire a similar system and reproduce the issue, this will let you explore solutions safely as well.
Fourth Step: Implementing a solution
Now that you've planned out a solution, it's time to execute that plan. Beware of deviating from your plan to avoid unintended consequences of any solution. If something does come up, start back at the beginning and fill in the pieces that were missing from your initial conclusion now that you have more information. When in doubt, go back a few steps and check all of your work; because as I alluded to before, the devil is in the smallest of details.
Final Step: Testing your solution
Your solution is in place, but did it work? Check to see if your original issue is now resolved. Has any new issue arisen as a result of your involvement? Did modifying that line of code break a different test just as it made the one you were focusing on pass? Did replacing a network card allow the system to talk to the internet again? Testing thoroughly and reviewing the problem that got you this far in the first place is key to making sure your solution was accurate and effective.
Now that I've given you the steps, let me generalize some examples of how these approaches would work.
Software Debugging example
Let's say you have a line of Ruby code that was suddenly misbehaving and you're not sure why. You might have been working on a new feature with some new mocking going for your tests—which depends on a new library. When you run the tests however, you discover that there's a line of code in the backtrace that doesn't quite make sense to you—so you investigate.
Taking the linear approach from the top down which generally makes the most sense for code, you start by commenting out the newest lines of code. From there you can begin digging deeper if the problem doesn't change or resolve itself. If you're using source control such as git, you might then stash your changes and re-run your tests to see if they're passing again. Going from the top down until you find the spot where things were working and passing can help greatly, but this can be a time consuming process depending on how many changes are present.
Going the split-half route means you might already have a hunch that your new mocking library could be introducing conflicts in the code. As a starting point, you'll temporarily eliminate its presence to see if the newer backtrace leads you to the answer. Depending on the size of your change set, this could be trivial or time-saving, though sometimes more difficult to pull off when you have newer libraries being introduced into the mix.
Planning, applying, and testing the solution
Through your investigation, you realize the fault to be the way you're using the new testing framework which is suddenly giving you false positive assertions. After carefully checking documentation and source code, you now have a better understanding of the framework usage and write in a new change to the code. The new tests now pass and you're able to make a fresh new commit to the source. Onto the next feature!
Hardware Debugging example
You've just bought the newest, sweetest gaming computer rig known to geek kind, but in the middle of the game something goes horribly wrong and the system simply crashes. Getting back into the operating system seems impossible all the sudden. Did you get infected with a virus somehow? Maybe the video card overheated?
Linear Approach: Top-down
Not knowing much about the newest system, your first instinct may be to roll back the operating system to a previous installation point. Maybe you'll start with a new install? Sadly you might lose that game save if you weren't careful about backing up. Digging deeper and finding the operating system reinstall didn't do you any favors, you start looking into hardware being a possible culprit. The next steps would be to start checking into the hardware.
Linear Approach: Bottom-up
You're certain the operating system is totally in working order, the codes are perfect so that shiny new hardware must be at fault, right? You start by disconnecting the hard drive first and going down to the minimum system configuration needed just to get it to pass the Power-On Self Test (POST). From here you can use focused diagnostic tools to keep the operating system out of the mix and make sure the hardware is sound.
Being a handy gamer, you happened to have a linux live disc on a thumb drive that you have used before and know works. So you can use it to boot up really quickly and test if that works, possibly getting you closer to whether it's a hardware or software issue. Assuming that your rig boots up, you then reach for the built in memory and math tests on the linux live disc to do some quick RAM and CPU checks to eliminate another major source of errors. This narrows down the candidate sources of your issue dramatically in a matter of minutes.
Planning, applying, and testing the solution
After examining all the possibilities you ultimately find out the new RAM you purchased has a bad block that cannot accept data correctly. Initially you can attempt re-seating the memory to see if it's just not installed correctly, but you also plan to purchase new memory if that is not the case. After installing new memory, you re-test with a memory testing program to verify your plan worked and you can get back to gaming bliss.
I hope this post has given you some new perspective on troubleshooting.
You may have already been doing this or some form of it and not known there were some general rules of thumb out there. Remember that there is really no wrong approach and to take each investigation of a problem as a lesson for the future. Sometimes you can speed up the process of finding a solution while others can take you days, weeks, perhaps years to find the culprit.
Take your time, ask questions, and if possible share your process with your team to help them grow too.
Software and Server maintainer by day, board and video game geek by night.
We think domain management should be easy.
That's why we continue building DNSimple.
How We Work as a Remote Team
Inspired by a recent blog post from Travis CI, I'd like to share details about how DNSimple team members work together without offices.
DNSimple Now Supports Secondary DNS Hosting
Configure DNSimple as your secondary DNS provider to improve your domain's availability and redundancy with AXFR zone transfers.