On Friday evening (Saturday morning for Europe), the San Jose data center suddenly stopped responding along with our Redirector service. All of our customers that were using URL records for redirects or attempting to resolve DNS in that area would have seen an outage or very slow lookups. Right about the same time, a Denial of Service attack was in progress causing slow or unresponsive DNS queries on one of our DNS servers in the Europe region. The following post will outline what happened in detail along with how we are responding to help prevent outages like this in the future.
At 2:29 UTC on May 9th, we began receiving alerts from our monitoring system that DNS lookups of dnsimple.com were failing lookup. This alert kept resolving itself and re-triggering, indicating a potential network problem because lookups were failing every so many times. Initially, we were unaware of an attack and assumed it was regional networking issues that needed ot be escalated to our hosting provider.
About 3 minutes later, a separate monitor for our Redirector service suddenly alerted. This is when we were paged again and more DNSimple employees joined chat to observe what was happening with the Redirector service. It was shortly after this alert that we realized none of the systems in the San Jose data center were responding. We alerted our hosting provider who acknowledged the issue at the time and they were already investigating. They had identified a network switch which was suddenly not responding, so none of our servers in the San Jose data center would have internet connectivity. This is also where we had our redirector service hosted at the time, so any customers with URL records were suddenly experiencing outages. Customers who were attempting to query our DNS servers in this region were automatically sent to other regions, increasing response time for lookups.
Minutes into the realization that the entire data center was down, we began to search for alternative redirectors we could use until the servers return to service. Before this outage, we were rebuilding the redirector and zone server services to have at least one in every data center in the event of an outage like this, but we did not have any online at the time. About an hour into the outage is when we attempted to rebuild a redirector since we still did not have an ETA from our hosting provider about the network switch coming back online.
At approximately 4:39 UTC, the network switch had been rebooted and was answering traffic normally. This resolved the monitoring check for the Redirector service and we upgraded the status of the Redirector service to "Monitoring" on our status page. Around this time, we were still seeing the DNS checks in the Amsterdam region continuing to cycle and we contacted our DDoS protection provider, CloudFlare, about the issue. It was then that we had confirmation a Denial of Service attack was in progress in Amsterdam, which was triggering these alerts of lookup failures.
Around 5:13 UTC we stopped receiving alerts from our monitoring system and we got a message from CloudFlare about the issue being resolved and network access was restored to normal service. Roughly 5 hours after this whole issue started, we updated our status page to all systems clear.
Why did it happen?
Our hosting provider later identified the cause of the network switch failure as some packets from our monitoring checks getting stuck in the network switch, causing a slow memory leak. Eventually, this would lead to the network switch running out of memory to be able to process incoming packets leading to the failure we experienced. Unfortunately, this was also the region where our redirector service was currently hosted so when the network switch stopped working, so did the redirector because it was unreachable from the internet. As for the Denial of Service attack, we have not gotten any reason of why it happened, which is sadly par for the course with Denial of Service attacks.
How did we respond and recover?
When it appeared the network issues in San Jose were not going to be resolved easily, we began rebuilding a redirector server in the Ashburn, VA data center as a backup measure. By the time it was ready, the network issue had been resolved by the hosting staff. As for the Denial of Service attack, this was largely the work of CloudFlare and is why we have them for this protection service. Their response to the attack was quick and it was mitigated fairly quickly.
How might we prevent similar issues from occurring again?
Before this outage began, we had already identified this type of situation as a potential disaster scenario. Our plan was and still is to setup every data center we can as a potential disaster fall back in the event this happens. This is something the operations team at DNSimple plans for as a disaster scenario, think "What if this data center falls off the map?". Our hosting provider is also going to be upgrading the software on our network switches to prevent the memory leak issue we mentioned earlier. This will require a little downtime while they upgrade, so please keep an eye on our status page for updates about when we will schedule these outages for the upgrades.
The same goes for our DDoS protection that started to be put in place after our outage in December and for its first realistic test seemed to work better than expected. The network was degraded in performance, but it was not completely down as it was in December. We will be adding this protection to additional name servers worldwide in order to mitigate future attacks.
This type of situation very rarely happens and we did not put disaster recover plans in place fast enough to respond to this outage. For that, we are very sorry the recovery from this situation was not as fast as it should have been. We are already working very hard to prevent this in the future. Please let us know if there is anything we can help you with regarding the use of our service or have any concerns of your own.