On Tuesday, April 14th, 2020 we saw a significant increase in ALIAS resolution failures in our Amsterdam (AMS) data center. The incident started at 07:00 UTC with an increase in SERVFAIL
responses for certain requests in the AMS region. This correlated with an increase in ingress traffic, although the volume of traffic was not directly responsible for the incident. At the same time, customers in Europe began reporting resolution failures with their ALIAS records.
Multiple contributing factors were identified:
There was a loss of IPv6 traffic into our AMS data center at the same time the incident started. It is unclear if this was a contributing factor.
Team members in Europe opened an incident and began investigating the issue after receiving reports from customers of ALIAS resolution failures. We ultimately identified that the issue was at least partially due to the new software version of the name server. We rolled back to the previous version in response.
We also reverted the resolver configuration changes (that were made the previous day) to remove ECS support to mitigate impact on a small subset of ALIAS records.
Our goal is to provide you with solid authoritative ALIAS resolution that you can trust to never fail. While we failed to live up to that goal during this incident, we are working with the knowledge gained to improve our system and processes to avoid incidents like this in the future.
Thank you for your trust and your business – all of us at DNSimple appreciate it.
I break things so Simone continues to have plenty to do. I occasionally have useful ideas, like building a domain and DNS provider that doesn't suck.