Engineering

Building a Resilient Authoritative Nameserver

Nelson Vides on Feb 26, 2026

Remember the last time a critical service felt sluggish, or worse, went down under pressure? Building erldns has always been about preventing those moments. Today, we're excited to share what we've been doing to significantly improve the resilience of this critical piece of infrastructure software.

Erldns is an open-source DNS server written in Erlang that powers DNSimple's authoritative nameservers. The release of erldns v10.0.0 marks a major milestone in our journey to build a highly resilient authoritative nameserver. It introduces significant improvements to network performance, protocol support, and reliability that benefit all DNSimple customers and has lessons that apply beyond DNS. If you're building any services exposed to the outside world, this post is for you.

Why v10 Matters

Modern authoritative nameservers face two competing challenges:

They must handle massive query volumes while maintaining low latency
They must remain stable and responsive even under DDoS attacks or extreme load

Traditional approaches often sacrifice one for the other. Large network buffers maximize throughput but create latency spikes. Small buffers keep latency low but risk packet loss during legitimate traffic bursts.

Version 10 of erldns addresses these challenges through adaptive algorithms that dynamically balance throughput and latency based on real-time network conditions.

Battling Bufferbloat with CoDel

The most significant improvement in v10 is the implementation of CoDel (Controlled Delay) queue management for UDP workers. To understand why this matters, we first need to understand bufferbloat.

What is Bufferbloat?

Bufferbloat occurs when network equipment stores too much data in buffers, creating excessive latency. While large buffers were designed to prevent packet loss, they risk impacting the user experience by introducing unpredictable delays.

For a DNS server, bufferbloat means queries sit in queues far longer than necessary. A DNS query that should complete in milliseconds might wait tens or hundreds of milliseconds simply because it's stuck behind a backlog of other queries.

See this example of traditional tail-drop queue management:

            [ Incoming Packet ]
                    |
                    V
     +-----------------------------+
     |    DROP (Queue is Full)     |
     +-----------------------------+
                    ^
                    |
+--------------------------------------+
|                Queue                 |
|              (Size: 4)               |
+--------------------------------------+
| [ Packet | 10ms ]  (Oldest, Next)    |
+--------------------------------------+
| [ Packet |  8ms ]                    |
+--------------------------------------+
| [ Packet |  6ms ]                    |
+--------------------------------------+
| [ Packet |  4ms ]  (Newest, Last in) |
+--------------------------------------+

By the time a packet is dropped, the one at the front has already waited 10ms. Latency has built up, and we only reacted when it was too late.

How CoDel Solves It

CoDel takes a fundamentally different approach to queue management. Instead of monitoring queue length, it tracks sojourn time (how long each packet spends in the queue).

The algorithm uses three key components:

Target Setpoint: erldns sets an optimal queue delay threshold of 10ms, at the upper end of RFC 8289's recommended range. This balances throughput against latency based on network power optimization principles.

Control Loop: When persistent queuing is detected, CoDel gradually increases packet dropping rates, giving endpoints time to react while maintaining link utilization.

Hysteresis: A multiplier of 16 prevents oscillation between drop and no-drop states, ensuring stable operation under varying load conditions.

In our implementation, CoDel operates in parallel with traditional backlog-based active queue management (AQM), providing optimal behavior across different bandwidth scenarios.

The result is a system that distinguishes between "good queue" (necessary bursts that drain quickly) and "bad queue" (persistent delays), dropping packets only when genuinely problematic.

This might leave you wondering: why does the industry still tolerate self-inflicted latency from bloated buffers?

CUBIC Congestion Control

While CoDel manages packet queuing, we need a complementary mechanism to control the rate at which we accept new work. Enter CUBIC, a congestion control algorithm used in the TCP protocol.

What makes our CUBIC implementation particularly exciting is its novel adaptation from TCP congestion control to intelligently manage UDP acceptor admission rates: when CPU saturation is detected (above 90% scheduler utilization), CUBIC dynamically reduces the work acceptance rate.

The algorithm uses three phases:

Concave phase: Fast initial growth when recovering from congestion Plateau: Reaches previous maximum capacity Convex phase: Gradually probes for new capacity using polynomial growth

%% From erldns_cubic.erl
-define(C, 0.4).           %% Cubic scaling factor
-define(BETA, 0.8).        %% Multiplicative decrease (gentler than TCP's 0.7)
-define(MIN_RATE, 1.0).    %% Never stops completely

The system monitors scheduler utilization using exponential moving average (EMA) smoothing with a 0.3 alpha factor. This reduces noise from transient spikes while remaining responsive to genuine load changes.

Benchmark results show CUBIC gracefully handles extreme loads:

At 460K QPS, the system processes approximately 459K queries (virtual momentum absorbs transient spikes). At 1.2M QPS, it sustains 300K effective queries with 75% load shedding. At 2.1M QPS, the maximum safe capacity is reached at approximately 220K effective queries. At higher QPS, no meaningful difference was observed.

But why not push for 100% utilization? The answer lies in fundamental queueing theory, specifically Kingman's formula. This key principle shows that queueing latency grows hyperbolically with service utilization, so the closer we get to 100%, the worse, much worse, latency gets. By setting our ceiling at 90% scheduler utilization, we make a deliberate trade-off to stay away from that "latency cliff." This 10% buffer allows the system to absorb legitimate, transient bursts without grinding to a halt, ensuring erldns remains responsive and stable precisely when it's under the most pressure.

The Bigger Picture

These improvements represent months of work informed by real-world operational experience and deep protocol knowledge. Earlier this month, I had the opportunity to present Anatomy of a Resilient Nameserver at FOSDEM 2026, discussing many of these concepts in detail.

Building an authoritative nameserver that delivers under heavy load, processes malformed packets safely, and resists DDoS attacks requires careful attention to concurrency models, traffic management, and packet handling. Version 9 addressed ENT compliance; version 10 brings these resilience concepts together into a production-ready system that scales gracefully from idle to extreme load conditions.

What This Means for DNSimple Customers

All these improvements translate directly to better service for DNSimple customers:

Lower latency: CoDel ensures queries receive responses quickly even during traffic spikes, rather than waiting in bloated queues.

Better reliability: CUBIC prevents system overload by gracefully shedding excess load before it can destabilize the server.

Modern protocol support: Compliance with RFC 7766 and RFC 7858 means better performance and security options for DNS-over-TCP and DNS-over-TLS users.

The erldns project demonstrates our commitment to building robust, high-performance infrastructure that benefits everyone who relies on DNSimple for their DNS needs. All code is open source and available in the erldns GitHub repository.

If you have questions about erldns or DNSimple's infrastructure, get in touch. We love talking about DNS engineering.

Not using DNSimple yet? Give us a try free for 30 days.

Share on Twitter and Facebook