Main illustration: Cory Uehara
The EU was kind enough to provide a stress test for our email delivery pipeline during the GDPR surge on a scale that I doubt our engineering team as a whole would have agreed to, and we passed.
You’ve all heard about GDPR, as it ironically swamped your inboxes in the process of protecting you from unwanted emails. For our engineering teams at Intercom, there was a much more short term impact last week, which we think is interesting enough to share more widely.
The ability to stress Intercom to the point of breaking, without customer impact, is the ideal
Our customers were also sending these emails to their end users, using Intercom. That resulted in the email delivery pipeline operating at high output for several days leading up to and following the GDPR deadline. Everything was running at its max – the level that has historically been the fastest we can go without taking down Intercom.
Even at those levels, on the evening before the GDPR deadline we started seeing four hour backlogs in the email delivery fleets for messages with larger audiences. We knew this was a bad customer experience.
So what did we do?
A little context first: here at Intercom we have one of the largest databases in the AWS Cloud. Over the last while, to further improve our ability to scale, we have split aspects of this database out to function-specific secondary databases.
We’ve continuously upscaled the Elasticsearch infrastructure that enables much of our powerful, intelligent messaging capabilities, and we’ve improved our ability to handle the billions of end users that our customers communicate with through Intercom.
Before all these changes, upscaling to manage the GDPR email surge would have caused a definite outage across Intercom.
As services were already unacceptably degraded, our on call team carefully and incrementally scaled the email delivery fleet up way past what has historically been sustainable for us. We did this with the understanding that we have worked hard to improve Intercom’s scalability and that there was an expected but unknown capacity improvement in the system.
We scaled up considerably. Worker fleets for larger message sends (targeting 100,000 plus users) were raised from 80 to 100 instances and for medium sized message sends (targeting 10,000 – 100,000 users) from 30 to 100 instances.
Worker fleet size. Bigger is scarier.
Historically, this would have resulted in our largest database becoming non-responsive, and with it, the Intercom app.
The rate at which we created conversations between our customers and their end users hit just short of double its peak rate for this year and stayed there for an unnervingly long period of time. The majority of these conversations resulted in an email to an end user. The remainder instead delivered, at our customers’ choice, via our web Messenger or one of our mobile apps.
Conversation creation rates for this year. Bigger is much scarier.
Our biggest database did not care at all. It briefly hit a max of 32.5% CPU load. Previously, this quantity of emails would have caused an outage.
The CPU usage of our largest database during the GDPR surge.
The CPU usage of our secondary database during the GDPR surge.
A separate MySQL database, where a significant portion of the load from our largest database has been moved to, also didn’t care. It spiked to 28% for a total of 2 minutes during this whole thing, and sat at around 13% the rest of the time.
Our customers’ message queue times decreased quickly as the upscaling caught up with the backlog.
Email wait time in delivery queues. Lower is better. :)
We sent tens of millions of emails during a 24 hour period and despite the large increase in emails sent, we didn’t see an increase in bounce rate. In the graph below, you can see that the accepted email rate tracks the number of emails sent very nicely.
Emails sent over the last 30 days.
Why is this exciting?
The delay to sending our customers’ messages over a period of hours was not good. As our backlogs increased to 4 hours ahead of the GDPR deadline, we still did not feel that we could safely scale up further in anticipation of the GDPR rush as we were already at our safe limits.
Before the improvements made to the scalability of Intercom’s message deliverability pipeline, a message send of this size would have caused a significant app wide outage.
There are probably further things that the backend teams have improved during our ongoing push to make things scale and be more reliable, without which Intercom would have been down for significant periods of time during the multi-day GDPR surge.
We’re constantly working on what we can do at Intercom
These constant efforts to scale our capacity and our resiliency are key to Intercom’s efforts to continue to better the experience for existing customers and to be able to support larger customers. This is great validation and strong evidence that we can significantly increase our safe limits for message delivery.
We’re pretty excited about how things went, but there’s always room for improvement.
We strongly suspected that we had capacity improvements in our email delivery pipeline, but with limited stress testing, we were loath to go above the point which historically has been an outage risk.
The ability to stress Intercom to the point of breaking, without customer impact, is the ideal. The reality is we can’t break things without at least some customer impact. Maximizing the stress testing possible, while minimizing or eliminating any customer impact is an important ongoing process for Intercom.
Securely and at scale
My engineering team in Intercom has as its own mission statement: “Connecting web businesses with people, at the right time in the right place. We do this securely, and at scale.” We’re constantly working on what we can do at Intercom, but also on the scale at which we can do it. Both matter significantly to our customers and to us. Without the efforts of the backend engineering teams, this could have been a very different story.
So thanks to the EU for the stress test, and we’ve begun to move on and get ready for the next peak, whatever causes it (ePrivacy, among others). At the rate that we and our customers are growing, it’s bound to be even bigger.