Today, a change to the configuration of our zone server, which distributes DNS changes, and redirection service, which responds for URL records, caused outages for approximately an hour. The following post outlines what happened and how we're updating our procedures to to prevent outages like these in the future and improve our response for future incidents.
At 12:57 UTC today, a change to the configuration of a load balancer was deployed during an automatic Chef run. Within a minute, this change triggered alerts that our redirector service was returning 503 status codes rather than the expected 301 redirects. We also received notifications that our zone server was refusing connections. At 13:01, an incident on our status page was opened and the status component for our redirector service was marked as having a "Major Outage". As soon as possible after receiving notifications, manual reconfiguration and restart of the load balancer was completed and the service was brought back online by 13:08. The incident was marked as resolved at 13:27.
At 13:31, we began seeing elevated error rates which were enough to trigger another alert. At this time we also began receiving tweets that customers were still seeing the effects of this outage. A new incident was opened at 13:46 indicating the redirector service was out again. Attempts to revert the changes made in Chef were started at this point, but by 14:00 the changes were reverted manually again.
By this time, we had confirmed that the misconfiguration had affected zone changes as well and an incident for that was opened at 14:03. By 14:08 we were able to see standard updates to the zone server and at 14:10, we were able to retry all failed updates sucessfully.
Why did it happen?
We had been testing some upgrades with our internal zone server deployment cookbook which used an upgraded public cookbook dependency. To fully test these changes, they were published to our Chef server which inadvertently began distributing the breaking changes to our other production systems because no version constraint had been applied. Once the other production systems received the updated cookbooks, this caused the wrong configuration file to be generated and used, taking both the zone updates system and redirector offline at nearly the same time. This meant that any DNS changes suddenly stopped being distributed and any URL records also stopped responding properly.
How did we respond and recover?
As noted above, the changes made were manually reverted to a working state. Additionally, Chef was temporarily stopped from re-applying changes to the affected servers until we could recover from the downtime. Once we had the version pinning in place, we re-applied the changes via Chef and services stayed operational.
How might we prevent similar issues from occurring again?
We are taking several measures to improve our response to issues like these and to prevent problems like this from recurring. Specifically, we'll be investigating what sort of monitoring alerts we can add to provide earlier alerts of issues of this nature. Additionally, version pinning for cookbooks upon which the affected cookbook depends have been put in place so we can execute further testing. We also plan on completely removing any cookbooks that depend upon the older library. Finally we'll be reviewing our incident response procedures and will be specifically reviewing how and when to give an "All Clear" signal to our customers.
We take any downtime very seriously at DNSimple, and I'm very sorry about any effect on your systems this may have had. Hopefully these changes will help prevent issues like this from happening in the future. As always, please let us know if there's anything we can help you with regarding the use of our service.