Main illustration: Jeffrey Kam
It’s a familiar problem for all companies that scale fast – how do you keep your core technologies manageable for the increasing number of teams that depend on them?
As Intercom’s product matured, our product teams needed to expand the depth and breadth of their technology expertise in order to support the features we were building for our customers.
This surfaced as an increasingly large percentage of our product teams’ time being spent on operations or deep diving into understanding our small set of core technologies. How our core technologies were being used was also diverging, leading to needlessly increased complexity in our systems. Our product teams were being slowed down and our operational health was taking a hit.
“This is the story of how we grew and expanded our team over the past few years as the company rapidly scaled, and how we have gone about expanding our responsibilities”
Once we reached a certain size, we could justify devoting a small team to focus exclusively on enabling teams to more efficiently use the technologies we build Intercom on. So in late 2018, Team Datastores was created, initially with a fairly conservative area of responsibility.
It wouldn’t be an operations team, as such – the product teams would still own their infrastructure. Instead, it would be an internal core technologies team that would take long-term views on how we build and scale Intercom. This is the story of how we grew and expanded our team over the past few years as the company rapidly scaled, and how we have gone about expanding our responsibilities.
Our first challenge – Elasticsearch
As we formed and figured out where to start, we had an obvious first technology: Elasticsearch, which was originally introduced in order to support advanced search capabilities around custom data attributes and is the only datastore we fully manage.
Over the subsequent years, as Intercom became a more powerful product, Elasticsearch was used to enable more features. Product teams typically set up and owned their own Elasticsearch infrastructure with little guidance, technical depth, or use of best practices.
This resulted in each cluster being its own unique snowflake running on different hardware, different versions of Elasticsearch, and different configurations. Product teams were spending an increasing amount of time running Elasticsearch instead of solving customer problems. Our general availability around Elasticsearch also suffered and the learnings from availability incidents were not being applied to all Elasticsearch clusters.
Build a map of the problem space
To improve any situation, you first need to know the exact situation you’re in.
As we formed the team, we were very conscious of only taking on the most impactful technology slice we could and building ownership, stability, and momentum around it before taking on another area. We hoped this would build confidence within the team and provide evidence to our stakeholders for further investment.
To achieve this with Elasticsearch, we worked through these steps:
- Understand and document how we are currently using Elasticsearch. Build the necessary technical depth within the team needed through training and experimentation.
- Using our learnings, define what the gold standard for Elasticsearch is in Intercom.
- Roll that gold standard out to all Elasticsearch clusters; rationalize where possible.
- Automate the process of achieving and staying at the gold standard.
- Increase observability and empower teams to support their infrastructure through tooling and training.
- Move further up the technology stack and provide cross-org solutions for problems that multiple teams face. For example: AMI rollouts, node replacement, and security patching.
The positive impact we had on engineering was readily apparent, which led to more investment in the core technologies area. It also led to a queue of engineers wanting to join the team.
Widening the circle of ownership
Over that year, Intercom’s use of Elasticsearch matured to a point where we were ready to take on another technology. Stronger support around RDS Aurora was top of mind as one of our MySQL tables was already far outside of AWS’ best practices. This spawned our sharding program and our team taking ownership of our largest RDS Aurora databases. Again, we followed the same maturity trend but had the benefit of building on top of a managed service. This time, much of the “gold standard” setup we got for free from RDS Aurora.
“This prompted us to step back and consider the principles we’d use to determine whether to expand our footprint to include new core technologies”
With our success around Elasticsearch and RDS Aurora, we had the momentum to further grow the size of the team and take on more of Intercom’s core technologies. This prompted us to step back and consider the principles we’d use to determine whether to expand our footprint to include new core technologies or mature the technologies we already owned.
Criteria for expanding ownership
How did we decide to take on a new technology versus further mature something we already owned?
We arrived at these key criteria:
- Our team on-call and product health is light enough that we could start to move up the maturity stack naturally.
- We have more than one engineer who is an expert in each core technology that we owned.
- It becomes clear multiple teams are experiencing more operational toil with other core technologies. This would become evident when incident review action items wouldn’t have obvious owners, or excessive operational work began to appear on product teams’ roadmaps.
- We’re confident that the medium-term health of the technologies we already own is stable. There are no large, time-critical items on our backlog. Further investment is in the land of diminishing returns.
- The team feels whole and healthy.
Scaling to Rails
Up until this point our team had been focused on supporting datastores. There was another core technology used across all teams in Intercom that needed a permanent and well-supported home – Rails.
Intercom already ran on modern Rails but we were yet to take advantage of some of its recent advancements. One of the first and most impactful changes was replacing our database read/writer proxy layer with built-in Rails 6.0 functionality (and a few extra bits we will upstream) – this dropped over 60% of the read load on our largest writers which led to a 40% drop in writer utilization. It’s straightforward to scale read capacity horizontally, but scaling a writer vertically is difficult when some clusters are already on AWS’s largest instances. This drop in utilization gave us a significant scaling runway.
It was now obvious that our ownership of a set of adjacent technologies was helping us to raise the collective bar faster by leveraging our knowledge and context in one area to improve another. Our databases were in significantly better shape due to our Rails ownership and expertise.
It’s worth repeating; before expanding we made sure we were in a stable place with what we already owned. We remained conscious of maintaining the balance of the team, avoiding operational gaps in what we already owned, or the risk of reducing the level of slack within the team necessary to continually perform in a healthy, consistent, and sustainable way.
“It became obvious that Team Datastores no longer accurately described the work we did”
With that expansion of our responsibility, it became obvious that Team Datastores no longer accurately described the work we did. To more accurately convey that, we’re undergoing a name change, from Team Datastores to Team Core Technologies.
Areas of ownership for Team Core Technologies
As we initially formed this team, we didn’t have the set of guides outlined above. We generated these based on trial and error as our team grew in size, confidence, and scope. It now gives us the principles we follow for further expansion, all in support of our product teams.
If managing or working on a team like this sounds fun to you, we’re hiring!