On Saturday, April 6th, around 7:49 AM UTC NS2 (name server 2) stopped responding to queries. At 10:32 AM UTC, NS3 stopped responding to queries as well. During the outage NS1 and NS4 picked up the additional traffic. This abnormal load caused a portion of ALIAS resolutions to fail – directly affecting DNSimple customers.
We received notification that there were issues resolving queries immediately after the first resolver began failing, however the person on-call did not understand how severe the issue was and thus the issue was not escalated. As this was during the middle of the night in North America the next engineer to observe the problem wasn't online until several hours later.
While complex systems fail in interesting ways it generally takes the form of several interdependent pieces failing simultaneously. This is not unexpected. What is expected is that, as a team, we have the responsibility, knowledge, and practice to respond quickly and recover in a calm and determined fashion. Unfortunately this incident demonstrated our team wasn't communicating around alerts effectively. I apologize for failing to focus on this most important aspect.
When we ran our software on hosts with many cores it no longer detected the case we were targeting. I noticed this and made an incorrect assumption about how monit behaved. I made a change to the command logic that I hoped would improve the situation. Unfortunately it silently failed to take any action instead.
When NS2 blocked it added pressure to the remaining hosts. When NS3 blocked pressure increased again. NS1 and NS4 continued to respond normally to most queries, but the additional processing requirements of ALIAS records resulted in a portion of those query type being dropped.
Once Anthony came online and discovered the issue he began looking into its source. He discovered processes on both NS2 and NS3 using 100% of the CPU which was blocking the name server from responding to further requests. He restarted processes on NS2 and NS3 which allowed the name servers to begin resolving queries again. Once NS2 and NS3 were back online there were no additional failures reported for NS1 and NS4.
I eventually realized monit calculates CPU as a multiple of the number of cores. We now automatically configure monit based on the number of cores for a given host. I also recalled the particulars of how monit executes a command and corrected the incomplete logic appropriately. I have a high degree of confidence in the current monit configuration and behavior appears to have stabilized.
As a team we have clarified our on-call escalation policy:
Finally, we are actively working on delivering re-architected DNS software that is markedly more resilient. That project is approaching production quality and we hope to write more about it in the near future.
I like shiny things (it says so on my blog). I keep the machines humming at DNSimple.
Configure DNSimple as your secondary DNS provider to improve your domain's availability and redundancy with AXFR zone transfers.
Get a free limited-edition t-shirt featuring the characters of howdns.works and howhttps.works with any new yearly subscription to DNSimple.