On May 20th, beginning at about 11:30am Pacific time (18:30 UTC), we experienced a partial service outage. A component failure blocked our customers from updating their DNS records. We apologize for any trouble this caused DNSimple customers.
- Detected time: 11:40 PDT
- Resolved time: 23:43 PDT
- Severity: All customers wanting to change DNS records.
Pros: Our alerting system mostly worked.
Cons: Our redirector service was out of sync and we were not alerted directly.
Long story short — we have a single point of failure around a MySQL database. Our name servers rely on MySQL to replicate changes. Maintenance side effects apparently caused a MySQL database to fail slowly. In actuality the physical server hosting the write-version of this database was failing slowly. These failures surfaced as high latency response times. Eventually this resulted in connection timeouts. When I attempted to reboot the physical machine it wasn’t booting normally. Next I migrated the container to our Tokyo data center. I restored the container from a backup database and then synced all the unicast name servers. Unfortunately, I forgot that our redirection service also relies on MySQL replication working. It wasn’t repaired until after a DNSimple customer reported the issue.
We will add additional alerts around our redirector service. We will migrate MySQL to a high availability configuration. Customers may also want to consider migrating to DNSimple’s anycast network currently in testing. It does not have this single point of failure.