data-infrastructure-hero

How Intercom’s Data Infrastructure team met growing demand with solid principles

Main illustration: Kristin Raymaker

Scaling a company is never a linear process. As your startup becomes a scale-up, teams will encounter obstacles that require them to quickly adapt to new demands.

That’s where we found our Data Infrastructure team at the end of 2020 – we provide data and tools for teams across Intercom to get insights and run crucial processes, and were more in-demand than ever. Intercom has experienced major growth over the last couple of years, and we’ve hired lots of incredibly talented people to help us on our journey. As a result, our company trajectory has changed rapidly – by the end of last year our team was experiencing higher demand than ever before. We realized that the infrastructures, practices, and processes we’d been using were struggling to operate efficiently at our new scale.

The Data Infrastructure team had reached a tipping point

The team spent most of its days dealing with minor issues that arose within our system, constantly working reactively instead of looking at the underlying problems and proactively strengthening the infrastructure – we simply didn’t have time. As manager, it meant that I often had to jump in and help out with everyday tasks rather than focusing on the team’s direction, strategy, and professional development. We’d reached a tipping point, and it was clear something had to change.

“We established a set of principles to align the team on our goals and focus our work”

When Cormac McGuire, our Group Engineering Manager, joined the team, we took a step back and looked at what needed to be done to get us back on track. We noticed several issues we’d seen block teams in the past, such as siloing of knowledge, constant context-switching, and deprioritization of important system health issues. To fix these problems, we established a set of principles to align the team on our goals and focus our work.

Why are principles integral to the way we work at Intercom?

Over the years we’ve learned that our highest performing, happiest teams deal better with demands when they’re thoughtful and deliberate about how they work. We find principles are the best way to scale a team and keep them aligned while trusting them to do what’s right for them. Our principles grow from what we’ve learned about what works well – and what doesn’t.

Here are the most pressing problems we needed to solve, and the principles we applied to each one.

Problem 1: Prioritizing speed over problem-solving

We pleased our customers, aka, our colleagues across Intercom, by delivering projects quickly, but we were not allowing ourselves enough time to understand the core problem to be solved. We often had to revisit completed projects when a prior assumption proved incorrect or we realized a scenario had been overlooked.

Principle 1: Do less, better

Working on fewer tasks means less context switching and allows for deeper focus to understand the problem entirely. The team has more space to iterate on the solution until it satisfies the goals we’ve set out to achieve.

Adopting the “do less, better” principle meant making difficult trade-offs to benefit the team long-term. First, we established a status service so that other teams could check the progress of their data instead of checking in with us. This freed up time we would have spent answering queries so we could use it to work on our systems, and ultimately speed up data delivery.

“We needed to focus on one thing until it was solved and we were sure we wouldn’t have to revisit it. Only then could we move on to the next thing”

Secondly, we chose to only focus on the reliability of our daily ELT (extract, load, transform), the process by which the latest data is pulled each night and all existing data is refreshed. We needed to focus on one thing until it was solved and we were sure we wouldn’t have to revisit it. Only then could we move on to the next thing.

Problem 2: Knowledge silos

Our Data Infrastructure team is small, so engineers would generally work on projects individually. It was difficult for other engineers on the team to review code without the necessary context, and if issues arose with existing services, only the engineer who had worked on the system had the knowledge to resolve the problem quickly.

“We had smart people doing smart things in parallel”

When that engineer was on leave, all work would stop. Our teammates soon became frustrated at being the sole person responsible for an area. In short, we had smart people doing smart things in parallel – we needed to create cohesive processes that supported our engineers better.

Principle 2: Pair up on problems

Every solution would have at least two engineers working on it. Assigning one engineer instead of two doesn’t necessarily double the efficiency or quality of the outcome, it just increases the risk of failure points. Projects always yield better results when there is more than one perspective included in the process.

Knowing there was always someone to answer questions or resolve issues within a particular area reduced pressure on individual engineers, making it easier for them to take time off or move on to new projects.

Problem 3: Under-prioritization of system health

System health issues are part and parcel of operating any service. However, without an effective system for triaging and prioritizing new issues, the on-call engineer would subjectively decide what issues to address first.

When these system health issues did arise, we were reluctant to flag them as top priority (P1) because our analytics data is not strictly customer-facing, and therefore we deemed it less critical. However, these issues had the potential to affect overall system health and negatively impact our team’s work. We realized we weren’t prioritizing them highly enough, and over time they were compounding to cause larger problems.

Principle 3: System health is always P1

Any system issue affecting our primary SLAs (service learning agreements) would be first priority (P1). We needed to rethink our approach to flagging an issue as a P1; to stop thinking of P1s only as urgent, customer-blocking emergencies, and instead as instigators of an important process.

Since implementing this principle, we’ve dealt with issues much more effectively. System health issues are flagged as P1s, and if the on-call engineer lacks sufficient context to solve a new P1 issue independently, the team pauses proactive work and redirects its efforts until the problem is fully root-caused and resolved. The incident is automatically recorded in our Engineering team Slack channel, meaning that anyone across the org with extra context or insights on the issue can input to resolve the problem as quickly as possible.

Be realistic about what your team can handle

It’s easy for small teams to take on too much, spread their focus too thin and miss important details that will create more work in the long run.

Doing less, better and placing system health as our top priority meant that we could build more robust structures from which to improve other key elements of our process, and work proactively instead of reactively. Assigning two engineers to each project has transformed the way we work. One of Intercom’s values is “we go further together”, and this has proven true time and again since we’ve adopted this approach.

Are you interested in the way we work and approach problems? We’d love to talk to you – check out our open roles.

Careers CTA - Work with us