A few days ago our customers in Japan and South Korea experienced issues resolving DNS queries via our name servers in those regions. The outage lasted 4 hours and 20 minutes in Japan and 5 hours and 6 minutes in South Korea. This post outlines what went wrong, what we've learned, and how we'll adjust for the future to prevent outages like this from happening again.
At 23:40:00 UTC Saturday, March 12th our monitoring systems started reporting ALIAS record lookup failures from the monitoring system in Tokyo. A few minutes later customers started reporting failures when resolving their domains. At 01:19 we contacted our upstream DDoS defense provider through their support system providing information that lead to identification of the root cause of the problem and subsequently providing and deploying a fix which slowly restored our service in the region.
Why did it happen?
A network provider for our DDoS defense protection that connects to us in the Japan and South Korea regions started experiencing network routing issues between them and our networks. These problems resulted in DNS lookup outages for this region making our website and many of our customers' domains unreachable as DNS lookup caches began expiring.
How did we respond and recover?
Once we were aware of the outage by our monitoring we started to investigate. A video call was set up and DNSimple team members joined to discuss the problem. First, we confirmed that our servers were functioning correctly. Once we confirmed this, we contacted our DDoS defense provider's support system, informing them of the issue. We exchanged several emails with them providing information such as traceroutes and other diagnostic details provided by our customers, until they located the problem. The total time to identify the source of the issue was about two hours. At the same time we provided customers with answers, both on Twitter and via email support, based on the information we had available at that moment.
Once our DDoS defense provider's engineers were aware of the situation, they started to work on detecting possible failures and found the routing issue with their network provider in Asia. They then routed the traffic around another network provider and name resolution started to come back, first in Tokyo and then in South Korea.
How might we prevent similar issues from occurring again?
We have started a discussion with our DDoS defense provider about how to improve issue escalation and improve detection of routing issues. In addition, a review of our monitoring systems and alert policies is being performed to identify and improve alerting in cases like this.
Uptime is very important at DNSimple. Businesses of thousands of customers depend our platform, and our goal is to maximize uptime of DNS resolution. I sincerely apologize for any negative effect this issue may have had on your business. While outages are sometimes out of our control, we continue to strive to implement monitoring and alerting solutions so we can locate issues quickly and provide our upstream providers with all the information necessary to identify and solve the problem as soon as possible and in an effective manner.
If you have any additional questions about this outage, or any other questions for us, please contact firstname.lastname@example.org.