Lessons learned from buying, connecting, and operating domains
Free Trial
Status

Incident Report: Amsterdam Data Center DNS Failure

Anthony Eden's profile picture Anthony Eden

I would like to apologize to all customers affected by the outage in our Amsterdam (AMS) data center last week. We strive to maintain a solid, fast and reliable global DNS network. We let you down when we did not deliver. Please know that we have and will continue to work for all of you to make our systems better and more resilient so events like this are less and less likely in the future.

What Happened?

Last Thursday on February 20th at approximately 14:00 UTC we experienced an initial, short outage at our Amsterdam (AMS) data center. This lasted four minutes and affected some, but not all eight name servers in that region. This was correlated with an inbound traffic spike.

At 15:03 we observed a similar pattern however the system did not recover from the spike. This caused DNS resolution failure throughout Europe and in parts of Asia for those domains where we provide authorative name service. Customers affected by the outage were also unable to access the DNSimple web site.

In both cases we were able to correlate the initial outage with what appears to be an unusually high volume of inbound requests.

Why did it happen?

Multiple factors appear to have been involved:

Additionally, our desire to contain the incident to a single data center may have prolonged the outage.

How did we respond and recover?

Initially we attempted to stop and start one of the name servers. This failed because outbound DNS queries to our own names were failing since they were using the same name servers that were already offline. We changed our configuration to use an IP address instead of a name. Normally our deployment process involves Chef. At this point we were unable to resolve domains from these servers so deployments were blocked.

Initially we did not withdrawl the data center routes because we were concerned the failure would cascade to other systems thus widening the outage. The majority of the name servers were still operating, but were unable to resolve ALIAS queries. In turn this was blocking the ability of name servers to respond to any queries at all.

After exhausting several possible solutions, and after determining that inbound traffic did not appear to be at levels that would cause an escalation, we removed the data center from the routing tables.

We began restarting name servers within the single data center. Once they were able to start properly, we verified that they were behaving as expected, and returned the AMS datacenter routes to normal.

How might we prevent similar issues from occurring again?

Share on Twitter and Facebook

Anthony Eden's profile picture

Anthony Eden

I break things so Simone continues to have plenty to do. I occasionally have useful ideas, like building a domain and DNS provider that doesn't suck.