A few days ago you may have noticed your domains would not resolve requests coming from in or around San Jose, California or Tokyo, Japan. The outage lasted roughly 1 hour and 30 minutes. During this period some customers were unable to resolve DNS queries in those regions. This post outlines what went wrong as well as what we've learned and how we'll adjust for the future to prevent outages like this from happening again.
At 16:43 UTC Monday, January 18th a scheduled maintenance window with CloudFlare (our DDoS mitigation provider) ended, and the changes inadvertently caused requests to San Jose to not reach our name servers. As a result, at least one name server in the DDoS-defense provider's cluster was returning SERVFAIL errors for domains under our administration. Though our monitoring tests picked up on the failed resolutions, we unfortunately did not have alerts in place and did not learn of the issue until our customers began reporting problems around 17:11 UTC.
Why did it happen?
As with any maintenence window, there is always the possibility of something going wrong once the updates are rolled out. In this case, the "why" is less about the technical details behind the outage (because we don't know for sure ourselves) and more about our awareness and communication with our third-party providers. Typically we might see an email reminding us of a scheduled maintenence, or our monitoring software would alert us as soon as a problem occurs. Neither of those things happened–and due to that our customers had to experience a disruption in their service.
How did we respond and recover?
Once we were alerted of the outage by customers writing in on Twitter, we immediately began to investigate. Several members of the team were already on a video call, which was convenient in terms of brainstorming and finding a solution. CloudFlare was one of the first places we checked, a quick browse of their status page yielded information that they had just completed a scheduled maintenence right around the time our monitoring software began picking up on failed resolution. Assuming that this was no coincidence, at 17:24 UTC I contacted their enterprse support informing them of the issue at hand. As we waited for a response, we kept busy by informing and responding to our customers with as much detail as we could, along with contacting some other third-party providers just in case it turned out to be unrelated to CloudFlare.
CloudFlare responded back at 17:54 UTC and informed us that they've temporarily moved our routing to Los Angeles due to the fact that something went wrong with their routing in San Jose. At 18:00 UTC our monitoring software reported that everything was back on track and that both ALIAS and A records were resolving correctly.
How might we prevent similar issues from occurring again?
This particular outage raised a lot of questions and concerns from the team here at DNSimple. For one, we need to make sure that we're using our monitoring software to its full potential, and this means making sure that we have alerts in place for any problems that may impact service for our customers. Another step is making sure that we're following status updates issued from our third-party providers. Although this is a two-way street, we will do what we can to maintain an open line of communication so as to not be taken by surprise in a situation like this again.
Downtime is something we take very seriously at DNSimple, and I sincerely apologize for any negative affect this may have had on your systems. In some cases, the outages are caused by something that is out of our hands, but we should still be aware of the situation and the steps we need to take to resolve it within the first few minutes. I am optimistic that we've learned from this outage and I hope that the steps we take to improve our tooling and communication will prevent another like this from occuring in the future.