I would like to apologize to all customers affected by the outage in our Amsterdam (AMS) data center last week. We strive to maintain a solid, fast and reliable global DNS network. We let you down when we did not deliver. Please know that we have and will continue to work for all of you to make our systems better and more resilient so events like this are less and less likely in the future.
Last Thursday on February 20th at approximately 14:00 UTC we experienced an initial, short outage at our Amsterdam (AMS) data center. This lasted four minutes and affected some, but not all eight name servers in that region. This was correlated with an inbound traffic spike.
At 15:03 we observed a similar pattern however the system did not recover from the spike. This caused DNS resolution failure throughout Europe and in parts of Asia for those domains where we provide authorative name service. Customers affected by the outage were also unable to access the DNSimple web site.
In both cases we were able to correlate the initial outage with what appears to be an unusually high volume of inbound requests.
Why did it happen?
Multiple factors appear to have been involved:
- A combination of a traffic spike of unknown origin and of questionable purpose,
- combined with a bottleneck in the DNSimple name server software,
- and what appears to be upstream resolution blocking.
Additionally, our desire to contain the incident to a single data center may have prolonged the outage.
How did we respond and recover?
Initially we attempted to stop and start one of the name servers. This failed because outbound DNS queries to our own names were failing since they were using the same name servers that were already offline. We changed our configuration to use an IP address instead of a name. Normally our deployment process involves Chef. At this point we were unable to resolve domains from these servers so deployments were blocked.
Initially we did not withdrawl the data center routes because we were concerned the failure would cascade to other systems thus widening the outage. The majority of the name servers were still operating, but were unable to resolve ALIAS queries. In turn this was blocking the ability of name servers to respond to any queries at all.
After exhausting several possible solutions, and after determining that inbound traffic did not appear to be at levels that would cause an escalation, we removed the data center from the routing tables.
We began restarting name servers within the single data center. Once they were able to start properly, we verified that they were behaving as expected, and returned the AMS datacenter routes to normal.
How might we prevent similar issues from occurring again?
- We will review our processes for deciding when to withdraw BGP routes. While we are not certain removing the data center sooner would have avoided a more significant outage we should consider this option as part of a clearly defined mitigation process.
- We have changed the ordering of system-level resolving name servers such that requests will be split between multiple upstream resolvers.
- We have changed our name server software to remove a bottleneck, especially in failure conditions, that was present in ALIAS lookups.
- We are preparing a new name server software release to provide an alternative to the current ALIAS caching mechanism. We are testing the effect of this change in a controlled environment before deploying it system-wide.
- We are researching alternative routing topologies with the goal of finding a better mix prioritizing both performance and reliability. This may include routing some nodes in a data center using one topology and other nodes in a different topology.