Main illustration: Sonny Ross
When you launch a new feature, what has worked for your beta users won’t always work as well when your code and your databases are dealing with millions of end users every day.
This is the exact problem we ran into when we launched one of our features last year, Smart Campaigns. (In a nutshell, a campaign is a group of messages sent through Intercom to a group of users to achieve a common goal e.g. to convert free users into paid ones).
When we launched the feature initially, the backend pipeline held up pretty well. But as the feature became more and more popular with customers, we became the victims of our own success.
The amount of resources consumed was growing massively. For example, the number of worker instances required to process the Campaigns, already quite high, became extraordinarily so. In the graph below, you’ll see that the angry looking red arrows are the number of instances we’d needed, growing at a significantly higher rate than the feature usage. At this rate, our Campaigns feature was quickly become prohibitively expensive to run.
It was also putting a lot of pressure on our databases. This wasn’t just a load issue. It could have spelled disaster for the whole Intercom app. ? A new, large campaign coming online could have caused a sudden load, taking down our entire database and with it the Intercom app.
Naturally, this was a scary situation for our engineers. We were living in fear of taking Intercom down, we were getting paged in the middle of the night about dangerous levels of database load, and we were working late firefighting. We needed to make Smart Campaigns scale more effectively.
We tackled it by asking ourselves a series of questions.
- How can we build an understanding in the team of both how the system works and its shortcomings?
- What are the easiest, quickest things we can do to mitigate impact
- How do we optimise the system, to make the necessary technical improvements to make it faster and more efficient.
- How, by looking at usage of the feature, can we tweak the product to make it easier to scale?
One of the first things we did was to get everybody in the same room with a whiteboard, markers and a lot of Post-its. Together, we drew up a big flow chart of how all the various queues, worker fleets and data sources interacted.
Up until this point, the whole picture was really only in a single engineer’s head. Afterwards, all of the engineers had the same up-to-date mental model of how the different pieces fitted together. Now we had five times the number of engineers working effectively, all with a shared understanding of the system.
Now that we had a shared understanding of the system, we could really start to understand where its problems were. We added a ton of metrics to the codebase so we could figure out exactly where the performance bottlenecks really were, and tackled the most impactful first one by one.
Mitigate the impact of your shortcomings
As an example of what we found when we dug deeper, we realised that a particular sub-feature caused campaigns to process slowly. And unfortunately there was no quick engineering work we could do to fix it. What we could do was to limit blast radius of the slow code and hence the number of customers that were affected.
The diagram below shows a vastly simplified version of the pipeline. On the left, we have a group of workers that break the work up and pass it into our queuing system, seen in the middle below. We then have another group of workers on the right that process the Campaigns and send the messages out.
What’s important here is that the slow code was being run on the group of workers on the right, which was processing a mix of slow and fast jobs. The fast jobs were getting stuck behind the slow jobs. How could we make the fast jobs process quickly without being delayed by being stuck behind the slow ones?
We solved that by creating a new queue and new worker fleet for the slow jobs, segregating them from the fast jobs.
The slow jobs were no better off but they were no worse off either. However, the fast jobs were now getting processed and sending their messages quickly, and this represented the campaigns of about 99% of our customers. We might not have gotten to the root of the problem just yet, but we at least bought ourselves some time.
Optimise the system
To really fix the problem, we needed to make the slow jobs faster. Our metrics told us that pulling the data from our data stores (required to process the campaigns) was where we were spending the most time. This diagram below summarises a lot of the performance improving changes we made to address this. In short, we looked to move all of our data accesses as high up on this pyramid as we could.
We had good metrics on how long each query was taking, not just averages but medians and worst cases too. So we looked at those metrics. We ensured we were always hitting good effective indices in our queries. We refactored code so that we could query in batches of 500, 1000 or 5000, whichever give us the best performance for the particular query. We increased our usage of caching massively so that we were avoiding hitting the database servers as much possible.
As I mentioned earlier, Campaigns was a very heavy user of our databases, in particular our MongoDB cluster. Campaigns alone was responsible for about half the load on it. ?
We had discovered the average user object for our large campaigns changed rarely, often not for two or three weeks. We were pulling it from MongoDB every two hours, so we knew we could get a huge cache hit rate.
Armed with this knowledge, we introduced a Redis cache for our Mongo database, a huge performance and availability win.
We quickly saw a massive decrease in the load on the Mongo database. In the left of the below graph, you’ll see the amount of queries before adding the Redis cache. The right hand side shows after adding the Redis cache, decreasing the load to almost nothing.
If our Mongo database goes down, the entire Intercom app goes down. Introducing this meant we could prevent the worst case scenario from happening.
Another big performance improvement we made by re-evaluating how Campaigns worked when we are processing millions of users as opposed to processing a single user. For simplicity, Smart Campaigns had been coded to work much the same way during both. It loaded a user from Mongo and then figured out what messages to send. This makes sense where we are only dealing with a single user, but using Elasticsearch when processing millions of users was far more efficient.
Choosing the right tool, the right data store, for the job makes all the difference, especially when scaling up a feature.
Look at how your users are using the feature
Even after all improvements a small number of campaigns still weren’t processing quickly. What we uncovered was a problem that couldn’t be improved through engineering work. It required us to tweak the product a little in order to make it much more scalable. By working closely with product research and product management, we found a path that would lead us to faster, better product.
Our problem was that it was possible for a campaign to accumulate a large number of users. Sometimes these users would become “inactive” – they would never exit the campaign because they would never receive any messages. Processing campaigns with a large number of inactive users took a very long time and put a huge load on our databases. It was also something that had never been intended when Smart Campaigns was originally designed and that it happened was usually undesired by our customers.
Limiting how long users could be inactive in a campaign effectively limited the worst case scenarios in terms of processing and made the feature easier for our customers to use and understand.
This graph above shows the results of our work:- on the left here you can see that the number of instances required to run campaigns was growing significantly faster than the feature usage.
Since we made our changes, the cost of campaigns processing has barely nudged up even as usage of campaigns has continued to grow higher every month.
Basically, the trajectory of the load and the costs of running campaigns have gone from scary to really kind of awesome.
Plus, we no longer have worries about the usage of campaigns impacting the overall availability of the Intercom app. And I’m no longer getting woken up in the middle of the night due to pages about campaigns. I’ve been sleeping soundly for months! ?
Should you find yourself in a position where your feature takes off more quickly than expected here’s my advice.
- Build understanding: Get the team on the same page with as many metrics as you can.
- Mitigate the impact of poorly performing code.
- Optimise: Assuming data access is where your code is spending the most time, work to make it cheap. And use the right data stores for your workload.
- Finally, look at how your feature is being used, see what small product tweaks you can make to make it scale better.