Status

Incident Report - Service Outage (February 10th, 2026)

Anthony Eden's profile picture Anthony Eden on

On February 10th, 2026, we experienced a complete service outage that affected name resolution, website, public API, and redirector. The incident began at approximately 16:50 UTC and lasted until approximately 20:18 UTC, or about three and a half hours. This post explains what happened and how we responded.

DNSimple takes all incidents seriously. This was an all-hands event from the moment our first Nomad health alert fired, until we had confirmed all customer-facing services were back online and we had upgraded our infrastructure to prevent recurrence.

What happened

Our production infrastructure uses HashiCorp Nomad to deploy and orchestrate our services: name servers, zone servers, the web app, redirectors, and others. A bug in Nomad's Raft consensus layer caused sudden state corruption that spread across our production Nomad cluster. All servers in the cluster began crashing during leader election, so the cluster could not form quorum. Without a healthy Nomad cluster, every Nomad managed service had its client start failing to connect to the server. Within minutes, Nomad clients hit their retry limits and shut down all jobs, which led to the full outage our customers experienced.

We were alerted at 16:52 UTC and immediately brought the team together on a call to triage. We confirmed that the Nomad servers were panicking due to Raft-related errors. We attempted to restore from a Raft snapshot in the hope of regaining quorum, but Nomad's snapshot restore is designed for freshly built clusters, not for restoring into an existing cluster with connected clients. The restored state did not match client state, and we could not place new jobs. Our internal restore procedure had not documented this limitation.

We made the decision to wipe Raft state and rebuild the cluster from scratch. We brought the cluster back with new ACLs, reinserted credentials, and manually bootstrapped our deployer. We then brought canary zone servers and name servers up, then rolled out name servers by region. We had to work around a dependency problem: name servers need the app to call the API, while the app could not resolve DNS until name servers were back. We adjusted our deployment specs to skip post-start checks that depended on our own DNS being reachable, then brought the web app and redirectors online. By 20:18 UTC we had restored sandbox and remaining internal services and confirmed all systems operational.

After the emergency passed, we identified the root cause as a nil pointer panic in Nomad's Raft state parser, a known bug fixed in Nomad v1.11.0. At the time of the incident, the DNSimple cluster was up to date with the latest available patch release in the v1.10.x series, which did not include the fix. We then immediately upgraded our entire Nomad fleet to v1.11.1.

Contributing factors

When an incident occurs, we focus on identifying the contributing factors.

The primary cause was the Nomad Raft bug: a nil pointer dereference in the Raft state parser corrupted the Raft log. The corruption replicated to all three servers, caused a crash loop, and prevented leader election.

A second factor was how our Nomad clients behave when they cannot reach the server. They retry template updates and, after a configured number of failures, mark the pre-start task as failed. In our case, that setting caused the whole job to fail and terminate. This is the default behavior in Nomad, not a deliberate choice. However, we acknowledged this setup brought down all our services once the cluster became unhealthy.

A third factor was our restore procedure. We had not documented that Nomad's snapshot restore is intended for new clusters only, and we had not tested restore into an existing cluster with connected clients. Recovery also required manually reinserting secrets and resolving the app and name server dependency, which added time. We are improving our automation and documentation so that future recovery is faster and more repeatable.

Post-incident review

Once the emergency passed, we conducted a post-incident review to learn from the event and prevent similar issues in the future.

We have already taken, or are taking, the following immediate steps:

  • Upgraded all production and staging Nomad servers and clients to v1.11.1, which includes the Raft fix
  • Established a deployment freeze until we had confirmed the root cause and completed the upgrade
  • Documenting a proper Nomad restore procedure for production
  • Applying changes so that Nomad clients can tolerate server unavailability better and are less likely to take down all jobs when the cluster is unhealthy

Conclusion

All DNSimple customers were affected during this incident. Name resolution, as well as access to the site and public API, was unavailable for most of the incident window. We sincerely apologize for the disruption and the impact on your operations.

We work hard to ensure maximum uptime for our services. We will continue to improve our restoration procedures, monitoring, and resilience so we can respond even more effectively when issues arise.

Thank you for your support and patience throughout this incident and beyond. If you have any questions, please contact our support team, and we will be happy to help.

Share on Twitter and Facebook

Anthony Eden's profile picture

Anthony Eden

I break things so Simone continues to have plenty to do. I occasionally have useful ideas, like building a domain and DNS provider that doesn't suck.

We think domain management should be easy.
That's why we continue building DNSimple.

Try us free for 30 days
4.5 stars

4.5 out of 5 stars.

Based on Trustpilot.com and G2.com reviews.