On Saturday, April 6th, around 7:49 AM UTC NS2 (name server 2) stopped responding to queries. At 10:32 AM UTC, NS3 stopped responding to queries as well. During the outage NS1 and NS4 picked up the additional traffic. This abnormal load caused a portion of ALIAS resolutions to fail – directly affecting DNSimple customers.
We received notification that there were issues resolving queries immediately after the first resolver began failing, however the person on-call did not understand how severe the issue was and thus the issue was not escalated. As this was during the middle of the night in North America the next engineer to observe the problem wasn't online until several hours later.
While complex systems fail in interesting ways it generally takes the form of several interdependent pieces failing simultaneously. This is not unexpected. What is expected is that, as a team, we have the responsibility, knowledge, and practice to respond quickly and recover in a calm and determined fashion. Unfortunately this incident demonstrated our team wasn't communicating around alerts effectively. I apologize for failing to focus on this most important aspect.
Why did it happen?
- NS3 moved to our new physical infrastructure recently in response to a hosting provider's network change. These hosts have many CPU cores.
- NS2 moved to a host with more CPU cores as a side effect of a system upgrade.
- We have an open issue with our ALIAS resolution software where, at a relatively low frequency, it blocks indefinitely.
- We use a monitoring program (monit) to detect this condition is nearing and restart that part of the software preemptively. While less than ideal it will suffice until we permanently solve this issue.
- The monitoring software detects this condition as a function of CPU utilization.
When we ran our software on hosts with many cores it no longer detected the case we were targeting. I noticed this and made an incorrect assumption about how monit behaved. I made a change to the command logic that I hoped would improve the situation. Unfortunately it silently failed to take any action instead.
When NS2 blocked it added pressure to the remaining hosts. When NS3 blocked pressure increased again. NS1 and NS4 continued to respond normally to most queries, but the additional processing requirements of ALIAS records resulted in a portion of those query type being dropped.
How did we respond and recover?
Once Anthony came online and discovered the issue he began looking into its source. He discovered processes on both NS2 and NS3 using 100% of the CPU which was blocking the name server from responding to further requests. He restarted processes on NS2 and NS3 which allowed the name servers to begin resolving queries again. Once NS2 and NS3 were back online there were no additional failures reported for NS1 and NS4.
How can we prevent similar unexpected issues from occurring again?
I eventually realized monit calculates CPU as a multiple of the number of cores. We now automatically configure monit based on the number of cores for a given host. I also recalled the particulars of how monit executes a command and corrected the incomplete logic appropriately. I have a high degree of confidence in the current monit configuration and behavior appears to have stabilized.
As a team we have clarified our on-call escalation policy:
- If an alert is raised - always respond until it is resolved.
- If an alert doesn't have a solution linked directly and the fix is not immediately apparent - escalate the issue to a larger team. Make sure the solution is linked after the fact.
Finally, we are actively working on delivering re-architected DNS software that is markedly more resilient. That project is approaching production quality and we hope to write more about it in the near future.