Incident Report - DDoS Attack (December 1st, 2014)
This past Monday, we experienced a major volumetric DDoS attack which caused significant downtime for our site and the sites of our customers. The attack included sustained traffic of up to 25gb/s and about 50 million packets per second sent to our servers. I am very sorry that this outage happened and had such sustained and wide reaching effects. We have assembled here what we know about the attack, what we did to mitigate it, and the steps we are taking to mitigate similar attacks. The following post-mortem is a collaboration from the entire DNSimple team.
On December 1st, 19:09 UTC, the DNSimple team was alerted by Sensu that the dnsimple.com web site was not resolving. Based on those alerts, our status page was automatically updated to indicate a major outage on our web site and redirector. At 19:19 UTC, we put up a notice on our status page indicating we were investigating DNS resolution issues. At 19:30 UTC, the entire DNSimple team was assembled in a Google Hangout to investigate the outage. The entirety of the team remained in the Google Hangout for the majority of the event. The team was split up to handle customer support emails, Twitter response, hands-on technical adjustments, and communications with our upstream provider. During the course of the outage, we posted updates to our status page as well as to our main Twitter account.
Why did it happen?
A new customer signed up for our service and brought in multiple domains that were already facing a DDoS attack. The customer had already tried at least 2 other providers before DNSimple. Once the domains were delegated to us, we began receiving the traffic from the DDoS.
DNSimple was not the target of the attack, nor were any of our other customers.
The volume of the attack was approximately 25gb/s sustained traffic across our networks, with around 50 million packets per second. In this case, the traffic was sufficient enough to overwhelm the 4 DDoS devices we had placed in our data centers after a previous attack (there is also a 5th device, but it was not yet online in our network).
How did we respond and recover?
After identifying the attack we started a call with the entire DNSimple team, as well as engineers from our upstream network and managed server provider and the device manufacturer. We identified the traffic volume and began considering ways to mitigate the issue. Network engineers at our upstream provider looked for patterns in the traffic that would allow them to be blocked on the router. They discovered the attack targeted random subdomains under a specific target domain. It appears to have included both UDP and TCP requests.
We also attempted to contact the domain registrar to request removal of the delegation, as well as the domain registry. Unfortunately neither responded until the incident was past.
Once we determined that we would not be able to handle the traffic with the DDoS devices we had in place and that we would not be able to remove the delegation, we began working with our providers to find a larger device. Our upstream provider had one such device, with capacity for 20gb in and 20gb out in their primary data center. We decided to try to put this device into production to see if it could act as a scrubber for all DNS traffic. All traffic would be sent to one data center, thus losing the Anycast benefits, but this was better than having all systems remaining unresponsive.
The process of installing the device took longer than we had hoped. The device was physically some distance from our cabinets and it took time to find the appropriate length fiber cables to connect it to our cabinet's switches. Additionally, the device needed to be configured by the manufacturer with rules that would address the traffic. This process took around 2 hours to complete.
During this process, the attacks subsided somewhat and we began seeing service restored in our US data centers, which we indicated on our status page. However, this reprieve was short lived and we soon saw the attacks start again with renewed vigor. Once the full installation and configuration of the larger capacity mitigation device was complete, we disabled all routing except to the data center housing it.
Unfortunately, this did not resolve the issue for multiple reasons. First, there was not sufficient time to configure the device properly, which resulted in problems detailed below. Second, even once the device was in place, the amount of traffic that was actually passed through the device caused our name server software to crash shortly after having receiving the volume of requests.
After the attempt to scrub traffic through one data center failed, we reverted back to distributing traffic across our Anycast network. At this point we were still under attack, but no longer at the original volumes. We were able to recover four of our data centers and began handling some DNS traffic. During the rebooting process, there were also issues with some systems where the software was not able to obtain TCP connections and thus the reboot took additional time while those TCP resources were released.
By about 6 AM UTC we had restored UDP traffic to the majority of our systems. We were still experiencing resolution failures for A records in our Amsterdam and Tokyo data centers, but other record types were resolving properly. After some research (many thanks to Peter van Dijk of PowerDNS for his help here), we decided to disable the DNS defense mechanism from DDoS protection device in the Amsterdam facility. Once we did this, all resolution returned to normal there.
We were still showing resolution issues with some public resolvers (such as Google's Public DNS). We went through the remaining data centers and removed the DNS defense mechanism from other DDoS protection devices and eventually we were able to get successful resolution from Google's public DNS.
Once resolution became more normal, we began reviewing all of our systems and discovered that the redirector was not functioning properly. We rebooted the software and it was able to recover and begin redirecting again.
During the entire outage team members traded off handling customer support, Twitter and updating our status page. Keeping customers up-to-date is critical in an event like this, and everyone stepped in to make sure we did our best to keep the updates flowing.
How might we prevent similar issues from occurring again?
We do not have the skills and budget to develop a complete DDoS solution internally. It is a very expensive endeavor and requires expensive equipment, lots of bandwidth, and deep knowledge on how to mitigate attacks. We have signed a contract with a well-known third-party service that provides external DDoS protection using reverse DNS proxies. Presently, our primary objective is to get all DNS traffic routed through the vendor first so that they can cache and deflect volumetric attacks like the one we just experienced.
We are also well aware that one mitigation strategy is to allow customers to have secondary servers that can work with our primary servers, and that this would have allowed many customers to continue operating, albeit at a possibly degraded level. While there are some challenges with providing this service, we have determined that the development of this feature is essential to moving forward at this point and are starting work on the development of support for secondary name servers immediately.
We will work on providing zonefile backups and downloads from a location outside of our primary data center, as well as the ability to change authoritative name servers outside of the primary DNSimple application for domains registered with us.
Additionally, we are reviewing our internal name resolution architecture to see if we can improve the reliability of our internal naming structure and disconnect it from customer name resolution, thus allowing the DNSimple site to maintain operations even while under attack.
Finally, we are evaluating segmenting customers onto different networks, allowing us to better mitigate attacks on one customer so that only a subset of other customers are affected, and so that those customers may be moved to other segments if an attack is sustained for an extended period of time.
These sorts of attacks come with the business of being a DNS service provider and for our failure to mitigate this attack faster we are deeply sorry. We are doing all we can to improve our response in situations like these in the future and hope to prevent such wide reaching effects of future attacks. Everyone involved, from the DNSimple team, to our upstream provider, to the DDoS device provider, did the best they could given the situation. Ultimately we are responsible for our uptime and we failed to deliver.
We deeply appreciate the positive support we received from so many of our customers, and in a very special way appreciate the assistance offered by Mark Jeftovic from easyDNS, Steven Job from DNSMadeEasy, and Peter, Bert and all of the #PowerDNS IRC channel. Thank you so much.
I break things so Simone continues to have plenty to do. I occasionally have useful ideas, like building a domain and DNS provider that doesn't suck.
We think domain management should be easy.
That's why we continue building DNSimple.
Two years of squash merge
A retrospective of the last two years where we adopted --squash as our default merge strategy for git branches.
Using time tracking to improve your remote working habits
What we learned, individually, from our collective time tracking experiment.