Incident Report - DNS changes blocked (May 20th, 2013)
On May 20th, beginning at about 11:30am Pacific time (18:30 UTC), we experienced a partial service outage. A component failure blocked our customers from updating their DNS records. We apologize for any trouble this caused DNSimple customers.
Highlights
- Detected time: 11:40 PDT
- Resolved time: 23:43 PDT
- Severity: All customers wanting to change DNS records.
Pros: Our alerting system mostly worked.
Cons: Our redirector service was out of sync and we were not alerted directly.
Overview
Long story short — we have a single point of failure around a MySQL database. Our name servers rely on MySQL to replicate changes. Maintenance side effects apparently caused a MySQL database to fail slowly. In actuality the physical server hosting the write-version of this database was failing slowly. These failures surfaced as high latency response times. Eventually this resulted in connection timeouts. When I attempted to reboot the physical machine it wasn’t booting normally. Next I migrated the container to our Tokyo data center. I restored the container from a backup database and then synced all the unicast name servers. Unfortunately, I forgot that our redirection service also relies on MySQL replication working. It wasn't repaired until after a DNSimple customer reported the issue.
Remediation
We will add additional alerts around our redirector service. We will migrate MySQL to a high availability configuration. Customers may also want to consider migrating to DNSimple's anycast network currently in testing. It does not have this single point of failure.
Darrin Eden
I like shiny things (it says so on my blog). I keep the machines humming at DNSimple.
We think domain management should be easy.
That's why we continue building DNSimple.
4.3 out of 5 stars.
Based on Trustpilot.com and G2.com reviews.