We’re serious about delivering the best customer experience with our Engagement OS. We have truly global ambitions to bring our mission of making internet business personal to the biggest enterprise customers across the world.
This ambition is reflected in the way we design and build the infrastructure that supports the Intercom platform. We are building for the long term – that means ensuring reliability by default, and the ability to accommodate massive scale as we grow.
Intercom has been growing, and will only continue to do so – many of our longest standing customers have evolved with us over the years. As these existing customers have grown, and we’ve welcomed bigger and bigger customers, we’ve always focused on saying yes to scale.
This is the third post in a series exploring the ways Intercom has scaled key functions to support the needs of enterprise customers.
We’re growing alongside our customers
- You want to serve multiple millions of active users. Can Intercom do that?
- You need to store multiple millions of user records. Can Intercom do that?
- You have many thousands of active agents you want action in your workspace. Can Intercom handle that?
Yes to all the above.
Today, our systems dynamically scale to serve about 50,000 web requests per second at peak, 26,000 background jobs per second, and 11,000 public API requests per second – demonstrating our ability to continuously scale to meet the requirements of modern enterprises.
We want to match your ambitions. That means being able to accommodate huge workloads in a reliable and performant manner, and continuously expanding them as our customers grow with us. We want to truly partner with our customers to ensure we are solving their problems, at scale, for the long term.
At Intercom, we aim to run less software
We exclusively build on top of a very small set of core technologies. This allows us to develop teams of deep domain experts to support and enable product engineers as they build the next generation of Intercom, and provide world class observability tooling, scaling, reliability, and secure-by-default build patterns.
Our tooling allows for high availability
We work exclusively with AWS as our cloud services provider and currently provide data hosting offerings in three different global regions – US, EU, and Australia – each architected across multiple availability zones for high availability.
Our teammate app is an Ember.js frontend backed by a Ruby on Rails monolith. The Rails application is the core of Intercom and what we deploy to thousands of web, API, and asynchronous workers on dedicated per-function clusters.
“Each year on Black Friday, as many of our customers hit their busiest period, our infrastructure scales to match without human intervention”
These clusters automatically scale to service customer requests as we go through the peaks and troughs of customer traffic each day. For example, each year on Black Friday, as many of our customers hit their busiest period, our infrastructure scales to match without human intervention.
Our observability tooling allows us to closely monitor how we’re serving traffic
Our observability tooling ensures we have our finger on the pulse with how efficiently and effectively we are serving traffic on a per-customer basis. We also partner closely with AWS on new technologies and approaches designed to future proof our rapid growth.
In addition to standard metrics and logging, our Rails monolith is auto-instrumented with high-quality, attribute-rich, traces. This allows engineers to observe production without the need to write any additional code.
Our observability pipeline is based on Honeycomb Refinery and dynamically samples valuable requests to retain interesting traces (e.g. a customer-facing error) by default. We also have the ability to configure custom rules for full sampling control.
“We support 100% retention to give engineers all the data they need as we build out new features or debug production issues”
Crucially, for the most valuable transactions, we support 100% retention to give engineers all the data they need as we build out new features or debug production issues. Additionally, as we tag all traces with the customer ID of the request originator, we can deep dive into how any of our customers are experiencing Intercom.
Different datastores allow for optimization across various use cases
We run several different types of datastore to support our various data read and write requirements across the applications:
- AWS Aurora MySQL: Our source of truth datastores are largely built on top of AWS Aurora MySQL. As part of our initial scaling, we sharded the databases by function. Once individual database clusters grew to the largest instances AWS Aurora provided, we kicked off a program to build out per customer databases that reside on multiple clusters of smaller database clusters, which we can now scale both horizontally and vertically. This work was completed in early 2020 and the architecture allows us to scale our largest tables indefinitely.
- Elasticache: In front of our databases we have a memcached caching layer built on top of Elasticache.
- DynamoDB: We use DynamoDB sparingly for very high read and write use cases.
Search is an indispensable part of Intercom
Search is largely powered by many per-function Elasticsearch clusters. We’ve built many tools that automate Elasticsearch’s most laborious tasks including security patching, upgrading, and replacing degraded hardware. We’ve also built tooling that allows us to run migrations (in a similar way to MySQL) against indices.
“Our global infrastructure is designed to serve hundreds of thousands of companies, big and small”
As well as improving our engineers’ ability to iterate on the indices schema at scale, it allows us to break down large indices into smaller indices that are easier to manage and provide a higher performance and stability. It also gives us a further dimension where we can scale our Elasticsearch clusters. Like our MySQL sharding approach, this gives us many years of scaling runway.
Intercom’s global infrastructure is built for internet scale
Our global infrastructure is designed to serve hundreds of thousands of companies, big and small, and the rigor we apply to managing our infrastructure operations ensures things run smoothly.
When roadmapping, scaling is a key input that every team considers. Our regular operational reviews are contributed to by experts at multiple levels across our backend teams where we assess infrastructure metrics and review capacity requirements.
We work hard and smart to keep it that way, and that’s why our customers trust us – from the smallest startups to the world’s biggest enterprises.