Over the last six months, the team at DNSimple has worked diligently to identify the source of recent incidents impacting our production name servers. After months of investigation, we finally identified the source of the issue and deployed changes to our production systems. I'd like to share what we found and what we learned while addressing this issue.
TL;DR: Read performance was greatly reduced in large zones with regular updates. These expensive reads led to a dramatically reduced read capacity whenever we had a high volume of requests for these large zones. Reduced capacity meant queues would grow, resolvers would time out, and we would see these requests double or triple due to the aggressive retries of the resolvers. This issue with large zone read performance is now fixed.
At DNSimple, we operate an Erlang-based name server called erldns. The Erlang virtual machine has proven to be an excellent environment for running DNS name servers – thanks in part to its soft real-time capabilities and fault-tolerant design. Erldns is open source, on which we have several proprietary elements for our needs.
Erldns stores all the DNS zones it needs in-memory – specifically in Erlang Term Storage (ETS), an in-memory key/value database. On each request, erldns hands the request off to a worker that finds the correct record set using multiple ETS tables. These tables have gone through changes over the years to improve response times on reads, usually by removing unnecessary data from tables, and ensuring ETS queries result in the smallest amount of data copied into processes and request time.
In October 2020, we deployed a new mechanism for distributing zone changes to our name servers. This mechanism greatly improved the performance of zone changes, but introduced a defect – the prime factor in the incidents over the past few months. The new implementation was unnecessarily placing the zone's entire set of records into an ETS table that's regularly queried for info on the zone only, while the records would never be used. Those records were copied from ETS into process memory on every query. For large zones, this resulted in enough data being copied that it slowed down the request handling. This significantly limited the queries-per-second the name servers could process. Coupling this reduction in throughput with spikes in traffic resulted in name server queue backlogs that propagated all the way up to the UDP sockets. This led to packets being dropped or timeouts between our name servers and our upstream DDoS defense network.
Finding the problem required a number of tools we hadn't used before at DNSimple. For example, while we've always operated canary environments for testing under real operational conditions, we now have a separate environment where we can run name server instances and execute performance tests to locate bottlenecks more efficiently. In this system, we can make changes more rapidly to narrow the focus of the tests, allowing us to locate problems sooner. We can also run those same performance tests against our canary environment, which demonstrates load handling in conjunction with real traffic. Combining these two environments greatly helped us when locating the defect described above. We've also started rolling out a new observability platform that allows us to identify what names may be involved in traffic spikes or slower response times. Finally, we're adding span tracing with open telemetry to our toolset – another mechanism for identifying bottlenecks.
We also found another location where we needed to optimize a query to ETS to reduce the amount of table scanning required to find answers to specific types of queries. Combining the changes to address both of these issues now results in consistent throughput on zones from five to five million records.
To address the first issue, we removed the transfer of unnecessary data between processes and ETS. For the optimization, we ran experiments reducing the scope of the ETS query and determined we could achieve the same results with a tightly scoped query. We deployed the change on Thursday, July 8th to address the defect and improve the read performance.
Since completing the rollout, we've seen major reductions in memory use, improvements in CPU usage, and no further incidents.
While we always work to balance deployment of system improvements with ensuring operational stability, in this instance we failed to provide the proper safety nets for the engineering team to ensure changes would not impact query performance. Moving forward, we're including performance testing as part of our continuous integration pipeline in our operational policies for our name servers.
I know these incidents have been frustrating for many of our customers, and I want to thank you for your patience as we've worked to identify and address these issues. As always, the entire DNSimple team is working to build and operate systems you can trust and rely on.
If you have any questions, don't hesitate to get in touch – we're always here to help.
I break things so Simone continues to have plenty to do. I occasionally have useful ideas, like building a domain and DNS provider that doesn't suck.
DNSimple is moving our DNSSEC out of beta and into general availability.
Configure DNSimple as your secondary DNS provider to improve your domain's availability and redundancy with AXFR zone transfers.