Striking the right balance between speed and safety

Product Engineer, Intercom

September 6, 2017

Software companies are always faced with the question of how to reach a happy medium between speed and safety – so how does Intercom manage it?

In order to build world class software, technology companies need to ship a lot of code. At Intercom, in the past six months we’ve shipped code up to 81 times a day, rapidly iterating and getting new features out to our customers.

How do we ship code this fast while making sure the necessary safety measures are in place? And how is it different from how other companies approach the process of shipping code?

During my career as a software engineer, I’ve worked in a few companies that approached this in very different ways. The processes are shaped by the business needs of the company. What kind of product do they have? What kind of customers? How do they position themselves in the market?

This post is based on a talk I gave at our recent Building Intercom event. You can watch that talk in full below.

Focus on speed

When I first started my career I worked in a company, let’s call it “company A”.
Company A didn’t have its own product, but rather it was building software for clients.
In a sense, software was its main product.

Because we were delivering software for clients, deadlines were agreed upon in advance set in stone, and were very hard to change.

That meant that during the development phase we focused on writing the code. We wanted to get the product out and we didn’t focus on writing automated tests. Maintenance was something that usually came after the first version of the product was delivered and clients were often charged extra.

Writing code was about 80% of the allotted time for projects which typically lasted about two to three months. At the end of the coding phase, we would then do manual testing. That was usually done by a separate QA team that would go through the product to make sure it met the specification. Once they were satisfied with the final product, they gave it back to us for a quick tech review – usually by someone on the team but who was not involved on the project directly. If they didn’t find any obvious technical issues it was ready to ship. Usually, our project managers were very happy.

Once, I spent an entire Christmas in the company trying to fix a bug

This process of development was super fast – I had an allocated time of just writing code with very few interruptions in between. However, the process did not emphasise safety and things did go wrong sometimes.

In those events, things didn’t look too different to the development process. We would try to find the root cause of the issue, apply a fix, repeat the testing process, conduct a tech review, and then pray that it worked correctly.

Once, I spent an entire Christmas in the company trying to fix a bug that I shipped two days before. While my family was at home, enjoying dinner and drinking wine, I was sitting there alone in a cold office, trying to make the thing work again.

I wouldn’t recommend it.

Focus on reliability

After company A, I moved to work for company B, which had a very different approach.

Unlike company A, company B had its own product. It was developing services that are used as infrastructure for other companies. When developing services that are used as infrastructure, uptime is critical. It’s important those services always work and that there’s very little down time, because every outage erodes the trust of customers.

Company B had a very lengthy process in place to minimize the risk of that happening.

The process began by writing some code and submitting it for review, followed by running it through some tests. However, after that, the code would just sit there for about two weeks.

Why? Because of the nature of the business, we had to provide detailed documentation – what are the business impacts and operational impacts of the changes that you are about to deploy? The emphasis was on minimizing the risk of something going wrong and what to do if something does go wrong.

This document then had to be reviewed by one or two engineering peers on the team and our manager. Before shipping, we had to execute another set of manual tests to make sure nothing was obviously broken, and then finally we were ready to ship.

What did we do when things went wrong? Well, first of all we would try to immediately roll back the change. But sometimes that wasn’t very easy, for example, if you just deployed a UI change that was already seen by customers. If you roll back, that would lead to a very poor customer experience. So then we decided to roll forward, which means working on a bug fix as fast as you can to get it out.

Sometimes it wasn’t very obvious what broke the build, because you’ve just deployed numerous commits. First you needed to identify what the root cause of the issue was. Then you get down to bug fixing, and then basically repeat the process that I just previously described. You write code, you do a review test, and deploy.

This process in company B was designed to be extra safe, but while inevitably things did go wrong sometimes, I think that in the three years I worked there, I never shipped a bug that caused a major outage. However, I also didn’t ship as many features as I would have liked – the process was slow.

Intercom’s approach

Then I joined Intercom. Company C if you like.

To show just how much Intercom emphasises working fast, in my first week, I was asked to ship a feature. To be honest with you, I was petrified, when they told me, “Oh, you’ll need to ship something in your first week.” Having just come from company B, I said, “There’s no way I can do that, right? Nobody can do that.” I had this lengthy discussions with myself, “Well, maybe this Intercom thing wasn’t such a great idea.” But I did it.

What process do we have in place that enabled me to ship a feature in my first week while maintaining high safety standards?

We take big projects, and we break them into small parts.

We always try to ship the smallest meaningful parts. What is the smallest thing that we can provide for our customers that will have value for them?

Our process starts by setting a feature flag, a simple mechanism that enables us to control who has the access to the feature. By setting feature flags, we expose functionality only to a certain set of users. We have control over who sees that feature. We try never to get into a situation where we expose the feature too early or to too many people. Then we collect feedback and iterate.

What is the smallest thing that we can provide for our customers that will have value for them?

After we set the feature flag, we write some code and we submit the code for review.

While waiting for a review, automated tests kick in. Once the code has been approved and the tests have successfully passed, we are ready to merge into the master branch.

Once the code has been merged, tests kick in again. While the tests are running, our internal deployment tool receives a notification that it’s ready to deploy. Once the deployment is done, the code has been shipped. This entire process can take anything from one day to one week, depending on the feature.

Of course, we do face some challenges with this approach. Our deployment is as fast as the tests allow, and as our code base continues to grow, the tests continue to grow as well. We try to approach that by parallelizing tests and splitting them into fewer smaller chunks, so that they can run faster. As someone who works primarily on the user interface, I’ve seen some CSS changes that will break and that even the most sophisticated UI test won’t catch. So sometimes manual testing is still necessary in order to catch those changes.

Question of priorities

I have a few takeaways from these experiences. The first is that engineers like to feel productive and that means shipping code. They like to feel a part of the process rather than afraid of it.

The second takeaway is that having good tests in place gives us confidence that our code is safe to ship. Although writing tests can be tedious sometimes, it’s much easier to spend two hours writing tests than having to spend Christmas trying to fix things that should have been caught.

But the main takeaway is that companies shouldn’t sacrifice safety for speed, or indeed speed for safety. Different software companies have different customers with varying needs, leading to different incentives and as a result, different processes.

These varying approaches have their own merits and might be the best fit for those situations, and the optimal balance will change from company to company.

Here at Intercom, however, we have achieved a balance that not only serves our customers by providing a secure product with regular new features, but also keeps our engineers feeling productive. And like any balancing act, it requires constant effort and focus to maintain the equilibrium.