Having a central Ops team is unfashionable at the moment. It goes against the prevailing, fully integrated DevOps mentality. But here at Intercom, that’s exactly the approach we’ve taken and it’s working really well for us.
We enable our product engineering teams to ship code to production dozens of times a day, our availability is good, our Ops team work mostly from a strategic quarterly roadmap, we spend around 82% of our time on planned work, our pager interrupts are generally low and the source of pages are well understood. And did I mention our team is happy and has fun? :)
At Intercom, we have a single, small Ops team that is primarily focused on maintaining the reliability, performance, scalability and efficiency of our platform. This one small team is on the hook to be the first responder for all outages. They drive performance improvement initiatives across our product engineering team. They are responsible for infrastructure security. They also maintain and evolve our metrics, monitoring and logging infrastructure. The really fun thing they do is evolve and enhance our continuous deployment infrastructure so that we can safely deploy new code to production, within a small number of minutes, about 100 times a day.
I joined Intercom a little over a year ago, and during that time I believe a significant contributor to the team’s success, apart from hiring a really strong team, was the fact we spent time creating a strong set of team core values and committing to them. These core values evolved from a team discussion at our first quarterly off-site strategy and planning meeting. They were hotly debated, but we eventually arrived at something we could all agree on and commit to be faithful to. This was important to us, we wanted values that bonded us together and influenced behaviors, decisions and actions, rather than merely aspirational words to have on posters.
Ops Core Values @ Intercom
1. Security, Availability, Performance, Scalability, Cost, Efficiency – prioritize for maximum impact.
Security and Availability are always our number one priorities. When no fires are burning, we look for projects that can improve us along multiple dimensions.
The intent here is to make clear the scope of projects that we should be working on, and not working on. It’s also a reminder that we need to be really good at prioritizing our work load. We have a really broad remit and must evaluate potential projects not just on whether they fall within our areas of responsibility, but also whether they are the biggest needle movers.
2. Faster, Safer, Easier, Shipping.
Ciaran, our CTO, put it best when he said: “To me, Intercom is a place where: It’s as easy as possible to ship code to production; we are never afraid to deploy; we ship ambitious projects as a series of small, safe steps”. We fight to ensure this stays true.
This one has meaning along at least two dimensions. First and foremost, the Ops team is responsible for ensuring our continuous deployment systems are always performing well and we spend the right amount of time thinking and acting strategically on improving them. Somewhat uniquely, Intercom’s engineers ship code to production about 100 times a day. We strive to have every single commit to “master” result in an individual production deployment. This is actually an ingredient of Intercom’s “secret sauce” and contributes to safer, faster, easier to debug deployments, quicker customer feedback, improved engineer morale and motivation and helps attracts strong engineers to work here. If you’re still in any doubt as to whether this is a good thing or not, read this post by Darragh, our VP of Engineering, Shipping Is Your Company’s Heartbeat.
Secondly, it encourages us to design systems and infrastructure that are faster, safer and easier to change and upgrade. We hate having “one” of anything, as it forces a riskier big bang approach to changes/upgrades and necessitates significantly more research and planning. Even then, there’s only so much one can gain from lab testing. It will never be as good as the things you learn from subjecting something to live traffic. Instead we favor designs that are sharded from the start or leverage technologies that we’re already using and enable us to perform phased changes and upgrades.
3. Zero Touch Ops.
We fight to prevent operational interrupts. No single machine dying should cause us to be paged. Our systems run well on autopilot. We can regularly sleep through the night and enjoy our weekends when on-call.
This value empowers us to fight against anything that could become technical debt or potentially cause burnout. Burnout is a very real risk in the Ops world and we don’t want anything we built to contribute to anyone experiencing it. We diligently measure our bug counts, the pages we receive and anything else that causes us an operational interrupt. We instigate projects to reduce/eliminate these things. We hold ourselves to a high bar for automation, and know that by putting the time in upfront to automate something, we’ll save ourselves buckets of time and heartache in the future. We also strongly believe fully automated systems that page less, ultimately contribute to providing a higher quality service to Intercom’s customers.
We use AWS and other world class service providers to avoid the cost of running undifferentiated heavy infrastructure. We fight to build the smallest, simplest solutions possible, knowing that in the long term this will helps us be true to our first three principles.
Run/write less software is a philosophy that was made popular by Amazon CEO Jeff Bezos at a conference in 2006 when he said “There’s a lot of undifferentiated heavy lifting that stands between your idea and that success”. We agree. We run our infrastructure on AWS and make heavy use of a number of other AWS cloud products, as well as a small number of non-AWS PaaS products. This core value doesn’t always lead to black and white decisions. With some PaaS solutions we still find that beyond a certain scale it’s considerably more cost effective to develop and operate your own solution rather than to buy in a third party service. Even when a high quality, low cost PaaS solution exists, we’re cautious about introducing new technology we’ve not run before. We ask ourself if there’s a way that we could accomplish the same thing with the existing building blocks we have today. Check out this talk from my colleague Brian Long as he talks about how this played out with a real engineering challenge. Finally, it’s important to not let a core value of “Run Less Software” become the thing that causes you to become change-averse or avoid ever taking risks.
Will this scale?
So maybe what we’re doing does sound quite a lot like DevOps, but we’re still pretty happy with having a central Ops team, rather than having each product team manage their own outages and operational interrupts “in the moment”. This enables our product teams to move faster and means that any outage pain is not too diluted such that it will go unaddressed for too long.
We are well aware that this approach won’t scale forever, and even now, we’re actively hiring engineers to work on both our core Ops team, as well as some who will jump from product team to product team, improving performance and scalability in different corners of our product. We still plan on keeping the bulk of front line pager interrupts away from our product engineering team for at least another six months though. Who knows what we’ll do after that???
If any of these things sound fun to you, maybe you should take a look at our careers page :-)