Honestly, it was a mess. Most of the time during product outages you’ll have one or two tools that aren’t working or are having bugs, and it’s resolved within minutes or hours.
This time it was just about everything: emails, form submissions, workflows, lists, sales tools, CRM, imports, analytics. It’s hard to find any pieces of the tool that WEREN’T affected during that time.
What’s even more troubling is that this outage took about 36 hours to be mostly resolved, but the processing of the backlog of data from those 36 hours is still going on at the time this article was released.
Thankfully, during all of this, HubSpot’s crisis communication was solid. Updates on status.hubspot.com were regular and timely (although are they ever frequent enough??) given the scale of the situation.
JD Sherman (HubSpot’s COO) released an article on March 29 with an apology and an outline of next steps for the team -- namely, doing an in-depth retrospective on the cause of the issue and how they’ll make sure it won’t happen again.
That retrospective was delivered on April 4. You can read the full article here. There’s a lot of detail in there about how their systems are structured and what exactly happened. If you’re not into all of the “geek-speak,” we’ve got you covered.
A Quick Review of How HubSpot’s Infrastructure Works
HubSpot uses a combination of software systems -- Kafka and ZooKeeper -- that allow all of the HubSpot tools to talk to each other and all of the data to be processed effectively.
Both these software systems have redundancies and safeguards built into them so that if some servers crash, other servers can pick up the slack, and end users don’t experience any issues.
So What Broke?
It’s a bit difficult to explain without getting super technical, but think about it like a series of unfortunate events.
High strain was put onto ZooKeeper, causing parts of it to crash. Typically, ZooKeeper recovers quickly, but in this case, it took several minutes. The delay in recovery then broke the communication between ZooKeeper and Kafka, causing Kafka to crash.
Even though the team was able to restore ZooKeeper, the damage was done in Kafka and it wasn’t able to recover. What made things worse was a second outage in ZooKeeper accompanied by trying to restart Kafka, which started to cause data corruption.
Why Did It Take So Long to Fix?
Corrupted data? That sounds bad. And well, it is. This is actually why some things took so long to come back online.
When the HubSpot team realized that the server recovery was starting to corrupt data, they had a decision to make: either focus on recovering data (and safeguarding against corrupted data) or focus on restoring the tools.
They decided to focus on recovering data to ensure that there would be no gaps in historical data for customers (which in the long run they believe to be the right decision, and I’d personally agree!). This is the reason that the affected tools took almost 36 hours to be restored.
So, in the name of protecting customer data, HubSpot manually recovered a whoooole bunch of our data, and then was able to restore the affected tools.
This is also why you’re still seeing (at the time this article was published) the “continuing to process data from March 28 & 29” status message from HubSpot.
Now that we know exactly what happened, HubSpot’s got a plan to make sure this never happens again. An interesting note in all of this is that HubSpot’s own teams use many of their tools across different parts of the business, so this not only affected their customers but their own business (even more motivation to make sure it never happens again!).
They’re making changes in a few different areas to protect against another outage: technical/infrastructure, reliability, testing, and communication.
Technical / Infrastructure
As is to be expected, HubSpot will be doing some restructuring of their server clusters to make sure it’s not even possible to have an outage this large again. By doing this, any outage that does happen should be restricted to a small piece of the platform, and the recovery time for issues should be significantly quicker.
HubSpot does have a team of people who test and upgrade their systems, but it hasn’t been as high of a priority as it should be. Now, they’ll have a dedicated team of people who will “oversee new standards, frequencies, and resources to ensure that we're consistently evaluating our key infrastructure systems for code fixes and critical patches without gaps.”
Along with investing much more heavily into the reliability of their platform, HubSpot is also increasing the level of frequency and depth to which they’re testing their systems. Again, it’s not that these processes didn’t exist before, but this outage uncovered some gaps in the frequency in which they test for massive failures, as well as how comprehensively they test these systems.
Lastly, HubSpot is committing to making their communication during any major incident more frequent and helpful, specifically in the minutes and hours immediately following an issue.
Their status updates will now include more detailed explanations of what is going on, as well as when the next update can be expected.
No one here is pretending that this outage wasn’t bad. Not even HubSpot. But one of the things I appreciate the most about HubSpot as an organization is their transparency and willingness to admit when they’ve messed up.
They know the impact this had on their customers, and on their own business, and they’re actively seeking to make sure it never happens again.
So, even if you’re a little rattled by this outage, know that improvements are being made, fixes are being implemented, and HubSpot will continue to make their product the best it can be. Okay -- HubSpot lovefest over!