Product engineers and the pursuit of speed and safety

As engineers, one of the fundamental things we have to learn is the extent to which we can break things – how do we get the right balance between shipping safely and shipping fast?

Before I joined Intercom, I thought of shipping as somebody’s else concern – I built things, but other people shipped them. It was a slow process, requiring a full team merging branches and deploying to production. Above all, it seemed risky – the potential to break everything loomed over the entire process. The process was so arduous, it would even end with a celebratory cake.

You ship ambitious changes as a series of small, safe steps

Almost 3 years in as a product engineer here, I have changed my perspective. It’s almost impossible to work here and not learn how to ship safely. The core infrastructure that we run our code on was built with shipping in mind, supporting systems that need to change all the time. As a result, some level of safety is built-in, protecting us by design.

Most new hires get super excited when they figure out that merging to master branch actually kicks off a new deployment. It’s quite smooth. Nobody has to supervise, we are constantly shipping (more than 200 times a day), with zero negative impact on production. If something goes terribly wrong, the deployment is reverted automatically. No cake.

Small safe steps

Shipping with such a setup changes how you think about building software. You get in the mode of shipping ambitious changes as a series of small, safe steps. Some changes are very straightforward (simple code change), some are moderately complex (adding a nullable column), and others are a little tricky and need to be deliberately split up and handled as a multi-step process.

It can feel somewhat like the precision of surgery

Between all the tools available and your peers’ experience, you can safely engage in a delicate, precise series of code and infrastructure changes – it can feel somewhat like the precision of surgery, except you know that the patient is not at risk.

The first time I fully realized the effect of this was when I had to introduce a simple new functionality into the system. Could I be sure I would be able to turn it off if something went wrong? No problem, the mechanism was in place — I just added a feature flag as a killswitch, wrote tests for both code branches and I was ready to roll safely once again.

Some time went on, and I had to change the underlying datastore of Intercom’s Bulk API. Changing bit by bit, chopping bigger changes into smaller ones to lower the risk of the change, doing a series of surgical, safe maneuvers. This was a much bigger endeavor, but despite the scale of the changes I was implementing, the app did not go down.

Shipping value faster

All these techniques help us ship all the time, in small increments, with no fear of breaking production. The process gets so ingrained in the culture that it becomes second nature.

Can I trade some safety for speed?

However, it can be easy to end up conflating two things – shipping the code safely and shipping business value. I find great comfort in doing these surgical maneuvers, but I’ve recently started asking myself if there is a hidden cost to this approach.

Above all, can I trade some safety for speed? Is safety even important to what I am doing right now? These abstract questions helped me to uncover tradeoffs I had been doing subconsciously:

We chop up a pull request into 10 smaller ones, which costs both me and my colleagues time in reviewing it when it could be cheaper to roll the bigger change out and revert if something goes wrong.
We add a feature flag to turn off a feature, yet sometimes these features are not being used by anybody, leading to unnecessary code and tests.
Sometimes we have to perform a series of small, safe schema changes, when we could schedule a 5 minute partial outage without overly inconveniencing our customers.

As with everything in engineering, things are not black and white. How we measure the cost and benefit here is often very loosely defined, but the truth is that businesses have to overcome the competition. The market doesn’t put entirely equal or predictable weight on both speed of development and the reliability of the product. The balance for companies is in constant flux, and by extension, it is in flux for us as engineers too.

Learning the extent to which we can break things is hard, and that’s because it isn’t just an engineering challenge. It needs some business context. It is definitely harder than just sitting in the code editor, but ultimately getting that appreciation for the business context makes you a better engineer.

If you’re interested in joining the engineering team to help build Intercom, check out our current openings here ?