Lessons learned from buying, connecting, and operating domains
Incident Report: Amsterdam Data Center DNS Failure (April 14, 2020)

Anthony Eden's profile picture Anthony Eden

What happened?

On Tuesday, April 14th, 2020 we saw a significant increase in ALIAS resolution failures in our Amsterdam (AMS) data center. The incident started at 07:00 UTC with an increase in SERVFAIL responses for certain requests in the AMS region. This correlated with an increase in ingress traffic, although the volume of traffic was not directly responsible for the incident. At the same time, customers in Europe began reporting resolution failures with their ALIAS records.

Why did it happen?

Multiple contributing factors were identified:

There was a loss of IPv6 traffic into our AMS data center at the same time the incident started. It is unclear if this was a contributing factor.

How did we respond and recover?

Team members in Europe opened an incident and began investigating the issue after receiving reports from customers of ALIAS resolution failures. We ultimately identified that the issue was at least partially due to the new software version of the name server. We rolled back to the previous version in response.

We also reverted the resolver configuration changes (that were made the previous day) to remove ECS support to mitigate impact on a small subset of ALIAS records.

How might we prevent similar issues from occurring again?

Our goal is to provide you with solid authoritative ALIAS resolution that you can trust to never fail. While we failed to live up to that goal during this incident, we are working with the knowledge gained to improve our system and processes to avoid incidents like this in the future.

Thank you for your trust and your business – all of us at DNSimple appreciate it.

