Incident Report - DNS Outage due to DDoS Attack (June 3rd, 2013)
Several days ago we detected the start of a DNS amplification attack. This attack was nothing special until the morning of June 3rd when it changed in a manner causing an outage for DNSimple name servers.
In this type of attack we are one player in a larger game. The attacker wants to use our servers to bring down yet another network. Unfortunately this style of attack is becoming more common. To entirely remove the possibility of this attack is probably a larger discussion for the internet industry as a whole. We'll stay focused on how it affected DNSimple customers.
Because this is a common attack we attempt to absorb it and avoid participating in the larger attack. We employ phreld to filter IP traffic from sources behaving well outside "normal" DNS traffic.
At about 5:45am Pacific time (1345 UTC) on 3 June 2013 we started receiving alerts indicating resolution was failing on multiple name servers. In the next five minutes all name servers were no longer responding. We woke the entire team and began working on our response.
By 6:45am Pacific time (1445 UTC) we had narrowed the trouble down to our ALIAS resolution software and pulled that capability from production. The DNSimple application was available to customers at this point and anyone not relying on ALIAS or URL redirection should have experienced some improvement.
We discovered our software responsible for ALIAS resolution bogs down when sent a specially crafted ANY DNS request. We don't know if the bad guys happened on this vulnerability accidentally or purposefully.
Boundary is our go-to service when we want a picture of network behavior. We identified two distinct patterns of attack. The first style (which continues at this time) is a common amplification attack peaking near 60 Kpps (thousand packets per second). Since a single UDP DNS packet is a complete request (for an ANY type from a spoofed IP address) each response may potentially be many orders of magnitude larger hence the "amplification".
phreld takes a few seconds to ban a bad IP address we still have to field many thousands of requests. Normally this isn't a problem, but with a specially crafted packet our software responds too slowly and queues requests until we stopped responding to all traffic.
Why did it happen?
Our current DNS server implementation allows ANY queries on UDP to pass through and attempts to respond to them, albeit with the TC (truncation) bit set. In addition, the overhead created by our ALIAS resolution system was also a factor, especially with ALIAS records pointing to other records within DNSimple.
The situation was further exacerbated by the fact that NS3 was offline due to phreld not functioning correctly. Since keeping NS3 online in this condition would have aided the attackers, we felt it was better to take it offline completely. Unfortunately this also reduced our capacity at the same time.
How did we respond and recover?
First we attempted to identify the attack patterns. When the attack caused a failure of all name servers due to ALIAS resolution we disabled that feature (along with URL forwarding and POOL records) to enable the system to recover. We also attempted to introduce a software change to minimize the size of the response sent for ALIAS queries, however this affected our ability to deliver MX records in conjunction with ALIAS responses and thus we rolled that change back.
During the peak of the attack we manually banned several IP blocks when we detected that the attacker was targeting multiple addresses from within that block.
Once the attack was sufficiently mitigated we began looking for possible changes to our software that would help us deal with this type of an attack better in the future. We explained the attack in the PowerDNS IRC room and received several helpful responses. There is a feature in PowerDNS 3.3 which causes all ANY UDP queries to result in an empty response with the TC bit set. This feature is specifically designed to deal with attacks like this and as such we began looking into what it would take to deploy it. We deployed PowerDNS 3.3 to NS3 and enabled the feature for sending the response described above and it worked as advertised. Huge thanks to the PowerDNS crew and developers on the IRC channel who helped out, especially Peter van Dijk (@habbie) who tipped us off to the availability of this feature in the newest version of PowerDNS.
We also reduced the time before
phreld will ban IP addresses from 10 seconds to 2 seconds, which helped reduce the number of packets that get through to our daemons as well.
How might we prevent similar issues from occurring again?
First we will deploy versions of our DNS server that support the ANY response with the truncate bit as described above. In addition we are beginning to move customers to our Anycast network which will increase our capacity by an order of magnitude over our current capacity. Finally, we will be reviewing the ALIAS records currently in our system to identify records which pose a potential risk due to resolution back into our own servers.
I break things so Simone continues to have plenty to do. I occasionally have useful ideas, like building a domain and DNS provider that doesn't suck.
We think domain management should be easy.
That's why we continue building DNSimple.
Elapsed time with Ruby, the right way
Elapsed time calculations based on Time.now are wrong. Learn why they are wrong and how to fix them.
Manage Your GoDaddy Domains in DNSimple
See your GoDaddy domains with contact information, name server delegations, and renewal details, all in DNSimple.