Why we sacrificed product capability for availability

Product Engineer at Intercom

March 21, 2017

Main illustration: Fabiola Lara

When your company is experiencing growth, you need to concentrate on mitigating downtime – even when that means sacrificing capability.

At Intercom, our growth means that sometimes we have huge spikes in users online, which can overload our databases and make parts of the app slow or even unavailable.

For a fast growing company, you need to ensure your infrastructure can scale. Goal number one is to minimize the risk of your users being unable to complete the job they’re hiring your product for. For us, that means our customers messaging and responding to their users.

First and foremost, you want to have mechanisms in place to avoid an app-wide outage. Think about what the core features of your product are. It might be useful for you to think about this by understanding your customers’ purchasing decisions, i.e. what they are hiring your product for. At Intercom, one of ours is the user list. Each app can contain millions of users, so the health of our user-related services directly impacts the rest of Intercom.

Once you’ve identified and prioritized your core features, you then have some decisions to make. What feature might you be willing to sacrifice in favor of another, if it meant keeping your product available?

We’ve made these hard decisions in our own business. For example, we think it’s a worthwhile trade-off to temporarily disable the user list to prevent the whole app from becoming unavailable, so we implemented a killswitch for the user list. Now, if a situation arises, we can avoid having any other part of the app suffer by simply disabling the user list for a brief period of time until it recovers. A broader opportunity than the user list remained, though.

Introducing the Big Red Button

Our goal for Intercom availability is that each of our API endpoints have greater than 99.9% availability, equivalent to no more than 45 minutes of downtime a month. In Q3 last year we dipped below 99.5% uptime over the month of October, far below our “green” level. To help our teammates in infrastructure and product, the Growth team I work on offered to tackle a few tightly-scoped, high-impact projects. We decided to expand the killswitch from the user list into a global killswitch called the Big Red Button.

How it works

We have customers that use our API to upload large amounts of data at once. We can’t process the data synchronously because the request will timeout. So we break up the work into smaller bits that can be processed by workers in parallel using a queuing system, which will massively speed up the process.

The Big Red Button works by pausing all workers. Our utilization drops almost immediately and relieves stress on the database. The app remains available, and the only impact to our customers is a slight delay in updates to their user data.

The graphs below show the button’s effect on three of our key utilization metrics. Naturally, more jobs are being added to the queue for the duration of the pause, so we can expect to see a spike in activity when we re-enable the switch.

We set it up to send a notification to our engineering channels whenever it’s turned on, and have a bot that lets us know where to find it.

The button is a huge toggle switch that is red when the workers are off, reminiscent of the missile-launching button in Hollywood movies that begs to be pushed. Try it yourself below :)

The Big Red Button and killswitch are just two specific examples, but this approach can be applied to almost any product. Sometimes it’s worth sacrificing one feature temporarily and in a measured way to protect the overall stability of your product.