Lessons learned from buying, connecting, and operating domains
Free Trial
Status

Incident Report - DNS changes blocked (May 20th, 2013)

Darrin Eden's profile picture Darrin Eden

On May 20th, beginning at about 11:30am Pacific time (18:30 UTC), we experienced a partial service outage. A component failure blocked our customers from updating their DNS records. We apologize for any trouble this caused DNSimple customers.

Highlights

Pros: Our alerting system mostly worked.

Cons: Our redirector service was out of sync and we were not alerted directly.

Overview

Long story short — we have a single point of failure around a MySQL database. Our name servers rely on MySQL to replicate changes. Maintenance side effects apparently caused a MySQL database to fail slowly. In actuality the physical server hosting the write-version of this database was failing slowly. These failures surfaced as high latency response times. Eventually this resulted in connection timeouts. When I attempted to reboot the physical machine it wasn’t booting normally. Next I migrated the container to our Tokyo data center. I restored the container from a backup database and then synced all the unicast name servers. Unfortunately, I forgot that our redirection service also relies on MySQL replication working. It wasn't repaired until after a DNSimple customer reported the issue.

Remediation

We will add additional alerts around our redirector service. We will migrate MySQL to a high availability configuration. Customers may also want to consider migrating to DNSimple's anycast network currently in testing. It does not have this single point of failure.

Share on Twitter and Facebook

Darrin Eden's profile picture

Darrin Eden

I like shiny things (it says so on my blog). I keep the machines humming at DNSimple.