On May 30th, 2020, Sectigo's Root certificate CN = AddTrust External CA Root expired. What should have been a transparent, non-noticeable change turned into an internet-wide issue. Particularly because some old versions of OpenSSL and other crypto libraries were unable to validate the alternate certificate chain, the certificate chain was treated as invalid.
A wide range of software and services were affected. Just to name a few: Stripe, Spreedly, and Roku all had incidents. A number of additional companies posted updates, including RedHat, CPanel, and various SSL certificate resellers, like DNSimple.
DNSimple's systems were partially affected by this issue for a brief period of time, and a number of DNSimple customers have been experiencing various issues. I want to explain what happened, why, and how DNSimple reacted.
I will also share what we at DNSimple have learned from this incident, and how we are planning on handling similar changes in the future.
DNSimple customers affected by this issue can follow these instructions to update the certificate bundle and resolve the error.
Before we get into the main section of this post, I want to provide some necessary context so we're all on the same page.
For this article, it's important to be familiar with terms like Root certificate, intermediate certificate chain, and Certificate Authorities (CAs). You will also want to know how a certificate works, and how a client validates the certificate and its chain to determine if it's trusted. If any of this is unfamiliar, take a look at this recent article from Scott Helme. He does a good job setting the stage before looking at the same issue we're going to discuss here.
To fully understand the issue, we need to time travel a few years into the past. It's 2017 when Comodo CA, Comodo's certificate division, is acquired by Francisco partners. Comodo continues to run its business as Comodo Cyber Security. One year later, in November 2018, Comodo CA is rebranded as Sectigo.
This rebranding will have implications for our story and in the issue we're talking about.
The rebranding has been carried over in multiple phases. Initially, it was purely aesthetic (logo, site, marketing material, etc). Certificates continued to be issued by Sectigo and signed by CN = AddTrust External CA Root via CN = COMODO RSA Certification Authority. During this period, links and support documentation kept changing almost monthly in an effort to replace the Comodo brand with Sectigo. Very often, previous resource links resulted in broken links, until one day the entirety of the support documentation was gone from the old Comodo support site and scattered across the new Sectigo domain name.
We still see trails of these changes in our source code where we used to document all the links to stay on top of the changes.
Around January 2019, Sectigo started to issue new certificates under the new intermediate CN = Sectigo RSA Domain Validation Secure Server CA.
This happened essentially overnight, with no prior communication. I still recall when I was told the news - we were in our quarterly company team meeting in Lanzarote, and had to stop our morning activities to deal with a critical issue - our customers were served an incorrect bundle by our installer.
Dealing with the issue was very painful. After searching for the new bundle with no success, we reached out to Sectigo support. Even they were unable to provide us the new Root chain from a publicly available source. We were left without a solution, until we were finally able to extract the bundle from one of our customer's orders, at which point we made it available to everyone via our certificate installer.
A change to the intermediate chain could (and should) have been handled significantly differently by Sectigo. Unfortunately, this is another contributing factor to the issue from the 30th of May.
The last piece of this story is about our certificate installer, also called the SSL certificate installation wizard in our original announcement post back in October 2014.
During our first few years of reselling SSL certificates, we learned from our customers that the biggest difficulty of obtaining an SSL certificate was installing it. More specifically, figuring out the correct intermediate chain, and how to package it along with the server certificate.
This should come as no surprise considering that, when searching for documentation, most CAs offer you documentation similar to this:
We were selling a few different SSL products from Comodo and other certification authorities. In some cases, certificates issued by the same company needed different intermediate chains. In the years between 2012 and 2014, almost 20% of DNSimple support requests were about SSL certificate chains. Close to 100% of SSL certificate problems were related to SSL certificate chains.
Immediately following the release of the certificate installer, the number of support requests on this topic dropped close to zero.
DNSimple is not a certificate authority. We are not involved in the issuance process or the trust chain. DNSimple is not required to supply the intermediate chain. This is entirely under the control of the certificate authority.
However, at DNSimple we constantly try to improve our user experience and do our best to provide that extra personal touch that our customers appreciate and look for. That's why we decided to take this customer pain point as our responsibility, and commit to maintaining an intermediate chain builder as accurately as possible - to simplify our customers' lives.
If someone asks me what the most successful feature I've ever built in DNSimple is, this is probably one of the top 5.
On May 30th, Sectigo's Root certificate CN = AddTrust External CA Root expired. This certificate was issued 20 years ago, and was the Root certificate originally used by Comodo. This was considered the legacy Root certificate. In 2010, the certification authority issued a new Root certificate, valid until 2038, to replace the legacy one. They then started to distribute the new Root to various certificate Root stores. As the new Root was distributed to software in various updates, they used a process called cross-signing to sign a new certificate with both Root certificates.
A certificate should be considered 'trusted' if at least one of the trust chains associated with the certificate is trusted. Using cross-signing, the new Root certificate would have guaranteed a trusted chain, as the old Root certificate chain became invalid due to the expired Root. Comodo's (and then Sectigo's) plan was that all modern browsers would initially have both the expired Root and the new Root. They would have automatically switched to using the new Root certificate once the old expired. Users should not have not experienced any issues due to the expiration.
Clearly, this is not what happened. Certain users started to receive invalid certificate errors. Several online services had outages.
DNSimple had an outage as well. This is our initial incident event timeline, from our internal Post Incident Review:
As time passed, and we found more non-DNSimple related cases being reported, we started to realize it was not a single issue, but a combination of issues. It took a while to put all the pieces together. Let's take a look at why this happened.
We learned the hard way that a number of legacy network clients and libraries were not able to correctly detect, follow, or trust the alternate intermediate chain. As a result, devices using these clients and libraries failed to validate the certificate, returning an invalid certificate error.
As reported by a study from Carnegie Mellon University, there are two main categories of incompatibilities:
In almost all cases we observed directly, OpenSSL was the issue. OpenSSL versions prior to 1.1.1 appear to always validate the first (invalid) trust chain, assuming that certificates are a single linear chain. Unfortunately, OpenSSL is one of the most widely used crypto libraries, and it's embedded in a large number of programming languages. For instance, the SSL implementation of the Ruby programming language is built on top of OpenSSL. As a result, any library developed with the Ruby programming language compiled against an OpenSSL version lower than 1.1.1 stopped working, as the Root certificate expired on May 30, 2020.
Programming languages like Go or Java that implement their own crypto library were not affected. In the investigation we performed at DNSimple after the incident was addressed, we realized all our affected clients were software written in Erlang or Ruby, both of which rely on OpenSSL.
Go is the second language at DNSimple, but Go implements its own cypto library, which explains why none of our Go systems showed any issues connecting to our systems when the expired certificate was included. Furthermore, modern web browsers successfully switched to the new chain, making our investigation process even more challenging.
The issue was caused by the inability of certain legacy or broken software to use the alternate and trusted chain, once the primary certificate trust chain became invalid as the primary Root certificate expired.
The DNSimple team reacted to the initial incident affecting our systems within 3 hours of the initial alert. That included identifying the issue, determining a mitigation strategy, and ultimately removing the expired certificate from the chain. The issue occurred on Saturday morning European time, so the direct impact on our customers was extremely limited. In fact, the impact was mostly on our internal tools.
As the issue evolved, and it started to become clear this was not an isolated issue to our system, we performed a number of actions to assist our customers:
One of the most freqently asked questions we've received via support is why did we not inform our customers about this event. The reason is that we did not expect the expiration of the Root certificate to become an issue, nor did we expect any impact on our customers.
As pointed out by The RedHat article:
Root certificate expiry is a normal, if infrequent, occurrence.
We did not expect this event to turn into an issue.
DNSimple is not directly responsible for the intermediate certificates. It's the responsibility of the CA. We trusted the CA's decision, and evaluated previous similar processes. As an example, Let's Encrypt has been cross-signing certificates since 2018, and we have never received a single complaint about validation issues.
A few customers asked, as a follow up question, why we did not consider sending notifications regarding this event. The main reason is that events like this happen every day with zero impact. If we sent out emails for each of these events, your inbox would be filled with hundreds of non-actionable emails a week.
As an example, every week a number of registries rotate their DNSSEC signing keys, with the potential risk to take down an entire TLD space – including our customer domains. But these events are not actionable for our customers, and in almost all cases the rotation completes without impact. Likewise, if you turn on DNSSEC at DNSimple, we rotate your signing key every 90 days. We expect these events to complete seamlessly, and generally they do.
In order to reduce the noise, we send notifications only for actionable events, or critical events over which we have control. We did not consider the expiration of a Root certificate one of them - rather an operational event that would have completed as many others do every day.
You can fix the issue by re-installing the SSL certificate. From your DNSimple account, go to the certificate page, follow the instructions to download the certificate intermediate chain, and replace it on the server.
If the issue persists, send us an email, and we'll assist you.
While most customers praised our quick reaction and fast support turnaround, the most common critique is that we did not effectively communicate the issue through the expected notification channels - we relied solely on Twitter. After internal discussion, we agree that our public response was ineffective. We will make sure to properly communicate similar issues in the future via our Status site.
We are also evaluating developing an automated mechanism to monitor intermediate certificates and update our certificate installer with the most recent intermediates whenever possible.
We stand by our decision to not include the Root certificate in the bundle served by our certificate installer. It turns out it was not the primary issue as we originally thought, yet there is no compelling reason to include a Root certificate in the bundle.
As we continue encouraging domain automation, we may consider stopping support of certificate authorities that fail to provide a sufficient level of automation to support our needs, and the needs of our customers. We will continue encouraging short-lived certificates, as multi-year certificates have proven to be the source of several security and maintainability issues. This may soon become a non-issue, as the 3-year expiration has been prohibited since 2017. Starting in September 2020, the maximum lifetime will be enforced to 1 year.
We will consider Root and intermediate transitions as potentially risky events. Whenever applicable, we will inform our customers of changes to the intermediate chain for certificates they ordered or Root transitions.
We will put our new processes into practice as part of the upcoming Let's Encrypt transition to ISRG root. This event was planned for May 2019, and postponed to July 2020. We will monitor the progress of the transition and notify our customers accordingly.
I want to thank all of our customers for their understanding and support. This issue has been my top priority since May 30th, and a top priority of several team members who helped our customers, and worked around the clock to investigate reports and update our system.
While this issue was caused by events beyond our control, I know many customers choose DNSimple because they trust that we can reduce the challenges of dealing with domain names or, in this case, SSL certificates. I will continue to make sure we fulfill this promise to the best of our capabilities.
I hope certificate authorities will learn from this incident. I hope they will better evaluate the risk associated with changes to their trust chain, and properly communicate with their customers and their resellers. I also sincerely hope that more certificate autorities will follow the lead of Let's Encrypt in considering automation a first-class citizen into their processes, so that we can finally stop relying on convoluted manual processes.
If you have any additional questions, you can contact support or reach out to me directly at simone at dnsimple dot com.
Italian software developer, a PADI scuba instructor and a former professional sommelier. I make awesome code and troll Anthony for fun and profit.
Get a free limited-edition t-shirt featuring the characters of howdns.works and howhttps.works with any new yearly subscription to DNSimple.
Configure HTTPS redirects with our easy-to-use DNSimple Redirector and a certificate from your DNSimple account.