On Saturday, September 19th, 2014 at approximately 11:17 UTC we experienced a preliminary DDoS attack aimed at a customer's domain name. The initial attack was primarily felt in our Tokyo data center. At 11:45 UTC we received initial reports from customers on Twitter reporting issues with their domain names. At 12:17 UTC the attack escalated significantly, impacting multiple data centers and causing a major DNS outage for our customers.
Here is the shape of the attack:
The outage was primarily due to the DDoS attack, however the effect of the attack continued even after the attack had subsided. The impact of volume of queries caused some of our name servers to stop responding to queries though they continued to broadcast themselves to the upstream routers. Other name servers were still responding but with significantly degraded performance, resulting in some responses being sent but others timing out.
Following the incident we spoke with the customer that was the target of the attack, however we were unable to determine the reason behind the attack.
When we began receiving reports on Twitter from our customers our first step was to review the network traffic patterns. The traffic patterns showed a spike in traffic, but those traffic levels appeared to be within the bounds of what our systems can handle. At this point our monitoring systems were not reporting the impact on the TKO data center as an outage, although later we determined that the Tokyo data center was in fact affected by the initial attack. When the second, larger attack occurred we begin receiving notifications from our monitoring system indicating issues in both our Tokyo and San Jose data centers.
At 12:24 UTC we posted the first notice to our status page and we contacted our network provider to inform them of the attack. Additionally we began looking for the target of the attack. After identifying the attack target we contacted the customer and requested that they change the delegation for the domain to another provider. Additionally we provided additional details to our network provider in hopes that they would be able to assist.
By 13:34 UTC the delegation was changed and the attack began to subside. At this point we were still seeing flapping on our servers and additionally noticed some of the physical nodes reporting memory warnings, which indicated that there were still issues on the servers after the attack. We began remotely connecting to servers by IP address, since the outage was still affecting resolution of our own host names we use to address our virtual containers, and we determined which servers were still responding to queries and which were not. Once we determined that at least one data center was fully operational we began removing the failed name servers from our Anycast network so we could repair them. By approximately 14:00 UTC we had removed the failed name servers and customer DNS resolution had returned to normal.
We began returning each failed name server to production one-by-one to avoid overloading our internal zone servers. By 15:30 UTC the majority of the name servers were fully operational. There were still some name servers that had residual issues and needed additional maintenance. By 16:10 UTC all name servers were returned to full operational capacity.
Denial-of-service attacks will happen. While there is no way to prevent attacks, we, as your DNS service provider, can improve our mitigation strategies. We have either already started taking, or will take, the following steps:
While attacks are inevitable, we still let our customers down by not being able to mitigate this attack faster. I am sorry that you were affected by this attack and I assure you we will continue to do everything in our power to improve our defensive capabilities for all of our customers. Thank you to everyone who provided positive support during this event, the entire DNSimple team appreciates it.
Early this morning we discovered instances in two of our data centers (Virigina and Chicago) which appeared to be operating normally were, in fact, not behaving as expected. We believe that this was due to residual effects from the attack on Saturday.
The servers which were failing have now been restarted and are responding properly.
Due to this additional information, we will now add the following steps to those outlined above:
I break things so Simone continues to have plenty to do. I occasionally have useful ideas, like building a domain and DNS provider that doesn't suck.
Configure DNSimple as your secondary DNS provider to improve your domain's availability and redundancy with AXFR zone transfers.
Get a free limited-edition t-shirt featuring the characters of howdns.works and howhttps.works with any new yearly subscription to DNSimple.