On May 20th, beginning at about 11:30am Pacific time (18:30 UTC), we experienced a partial service outage. A component failure blocked our customers from updating their DNS records. We apologize for any trouble this caused DNSimple customers.
Highlights
- Detected time: 11:40 PDT
- Resolved time: 23:43 PDT
- Severity: All customers wanting to change DNS records.
Pros: Our alerting system mostly worked.
Cons: Our redirector service was out of sync and we were not alerted directly.
Overview
Long story short — we have a single point of failure around a MySQL database. Our name servers rely on MySQL to replicate changes. Maintenance side effects apparently caused a MySQL database to fail slowly. In actuality the physical server hosting the write-version of this database was failing slowly. These failures surfaced as high latency response times. Eventually this resulted in connection timeouts. When I attempted to reboot the physical machine it wasn’t booting normally. Next I migrated the container to our Tokyo data center. I restored the container from a backup database and then synced all the unicast name servers. Unfortunately, I forgot that our redirection service also relies on MySQL replication working. It wasn’t repaired until after a DNSimple customer reported the issue.
Remediation
We will add additional alerts around our redirector service. We will migrate MySQL to a high availability configuration. Customers may also want to consider migrating to DNSimple’s anycast network currently in testing. It does not have this single point of failure.
We have scheduled a one hour maintenance window beginning at 6pm Pacific time on Wednesday, May 15 (Thursday, May 16, 2013 at 0100 UTC). This will be used to update network routers at each of DNSimple’s data centers to increase overall availability. We expect the DNSimple application and redirection service to each be unavailable for less…
Continue Reading »
Today we’re officially deprecating our advanced editor v1.
Continue Reading »
Today we deployed a change to how the Heroku one-click service works. Heroku recently released their European region, and with that change they deprecated the “proxy” subdomain for Heroku apps. To make DNSimple’s one-click service compatible with this change, we now require your Heroku app name, as it appears in your herokuapp.com subdomain. When you…
Continue Reading »
In order to update several low level components of our software stack we have scheduled a maintenance window beginning Sunday, April 21 at 0400 UTC (Saturday, April 20th at 9PM Pacific time). It will remain open for two hours. We expect the DNSimple application and API to be unavailable for less than twenty minutes. Name service will not…
Continue Reading »
On Saturday, April 6th, around 7:49 AM UTC NS2 (name server 2) stopped responding to queries. At 10:32 AM UTC, NS3 stopped responding to queries as well. During the outage NS1 and NS4 picked up the additional traffic. This abnormal load caused a portion of ALIAS resolutions to fail — directly affecting DNSimple customers. We…
Continue Reading »
Modern web applications benefit from fast DNS responses. DNS response times may still add anywhere from 10s to 100s of milliseconds to your applications overall latency. DNS providers are constantly working to reduce lookup times using technology such as caching and the use of BGP protocols. Even web browser makers like Google have joined in…
Continue Reading »
Due to a combination of factors, including availability of new hardware and changes from one of our server providers, we will be moving ns3 from its current IP address to a new IP address. The new IP address is: 50.31.225.68 We will be changing our records to reflect this new IP address on 7…
Continue Reading »
One of the services that we provide at DNSimple is URL redirection. You enter a special record in your DNSimple DNS for a domain and when we receive a DNS query for an A name of that record we return the IP address of our redirection service. If the next request is an HTTP request…
Continue Reading »
On Friday, March 8th we had a partial DNS outage. Available and well performing DNS is a critical utility for the businesses our customers are running. We take this responsibility seriously. We continue to investigate the reason behind this outage and are working hard to prevent it from reoccurring. I would like to apologize to any of our customers…
Continue Reading »