Lessons learned from buying, connecting, and operating domains
Free Trial

Post Incident Report for Partial DNS Outage on March 2, 2021

Anthony Eden's profile picture Anthony Eden

On March 2nd, 2021, multiple regions of the DNSimple authoritative DNS network were impacted by an incident, resulting in failures to respond properly to a portion of queries received. This incident began at approximately 14:25 UTC and lasted until approximately 20:08 UTC.

When an incident like this occurs, we have a specific response plan that we follow. This includes bringing all team members into a call so we can work together to mitigate the incident as quickly as possible.

There were several potential contributing factors to this incident. The first was an abnormal increase in traffic across several ALIAS records, many of which are chaining either within their own zone or to other zones within DNSimple. There was also an increase in slow upstream resolution responses, which caused resolvers to retry requests to our system at a higher rate.

As a mitigating strategy, we temporarily switched resolution to external resolvers rather than our internal resolver network. This successfully helped reduce the backlog in the authoritative query queues, but resulted in a higher-than-normal number of SERFVAIL responses from these external resolvers. This was likely due to rate limiting from those providers. This resulted in many ALIAS responses returning NOERROR responses along with an SOA record, indicating the zone is present, but there are no records present.

An additional strategy we applied was to greatly reduce the maximum wall clock time before our own internal resolvers would respond with a SERVFAIL. This may have initially helped reduce the load, but in exchange it created ALIAS failures for customers where they previously would not have failed. Our ALIAS handling was seeing an unexpected SERVFAIL rather than a timeout and bypassing internal caching of ALIAS results.

While we do not have conclusive proof at this time, we believe the primary factor for this incident was resource contention and resource exhaustion. In response we are tuning request handling in our internal authoritative network to reduce the wasted resource usage. We have also increased logging around these errors related to ALIAS resolution failures and increased our monitoring of these logs. We are continuing to investigate ALIAS resolution failures to better understand why they are failing.

At DNSimple we work hard to ensure maximum uptime of our name service, and we will continue to do so. Thank you for your support and patience throughout the incident and beyond.

Share on Twitter and Facebook

Anthony Eden's profile picture

Anthony Eden

I break things so Simone continues to have plenty to do. I occasionally have useful ideas, like building a domain and DNS provider that doesn't suck.