On Thursday, May 20th, we experienced a denial of service attack targeting all of our data centers. This post will outline what happened during the attack as well as how we are addressing the issues that arose as a result of the attack.
On Thursday morning, beginning at 8:19 AM UTC, we were alerted to DNS resolution failures on ns1.dnsimple.com in our Amsterdam data center. We quickly identified a significant spike of traffic that had hit all 5 of our data centers. This traffic was appearently malicious traffic directed specifically at one of our name server names. In addition to port 53 UDP traffic we also identified spikes of traffic on UDP port 80 and UDP port 5060. The attacker appeared to be using multiple attack vectors concurrently. The volume of the traffic was within our network capacity and all other data centers, as well as the other name servers in Amsterdam, appeared to continue resolving correctly. We opened an issue on our status page notifying customers at 8:26 AM UTC.
At 8:46 AM UTC we received notification from our monitoring systems that DNS resolution in our Amsterdam data center was failing across multiple name servers, including those not under attack. Additionally we identified an issue resolving DNS entries from our web application residing in our Virginia data center. This issue led to 502 errors for some customers in our web application. We reported this issue to our upstream network provider at 8:54 AM UTC.
The attack stopped at approximately 9:30 AM UTC. As of 9:35 AM UTC we were again able to make outbound DNS queries from our Virginia data center.
Why did it happen?
The root cause of the issue was the denial of service attack. 4 out of the 5 DoS defense devices in our system responded correctly to the attack and blocked it, the device in Amsterdam exhibited a different behavior, blocking all inbound port 53 traffic.
In the Virigina data center, the DoS defense device introduced a rule which impacted our outbound DNS resolution, which may be due to higher than normal outbound DNS lookups.
How did we respond and recover?
Per our internal processes, once we identified the issue we posted a notice on our status page. Several team members were present in our chat room during the issue as well, discussing the impact and gathering data to assist in identifying the root cause. We contacted our upstream network and data center provider to inform them of the issue and get their feedback as well.
When the attack subsided but the resolution issues were still present in the Amsterdam data center we started withdrawing the routes from that data center. Shortly after a portion of the routes were withdrawn, resolution returned to normal. We then brought back the Amsterdam data center routes a few minutes later.
How might we prevent similar issues from occurring again?
When a single data center behaves differently than the others, we likely need to remove its routing announcements earlier than we did. Near the end of this attack we began withdrawing the Amsterdam name servers, and we did see resolution after doing so. On the other hand, the return of normal resolution occurred shortly after the end of the attack, so it is unclear if removing the routes actually improved the situation or if the timing was the only factor.
During the attack we did not escalate the issue to the full team, however our internal policies are unclear about whether or not it was warranted in this case, so we will review those policies as a team to determine if escalation was warranted, and if so, we will clarify the situations that require escalation to the full team.
The majority of our DoS defense systems worked correctly, however we still consider the temporary resolution issues in Amsterdam as a critical failure, and for that, I apologize. Adapting to the ever-changing attacks we see is a constant battle, but we will continue to improve our processes and systems to minimize the impact on all of our customers.