Main illustration: Chelsea Wong
At Intercom, we are highly focused on delivering products that actually solve problems for customers. That intense focus means we are careful not to spend precious time and resources over-engineering our solutions.
Sometimes this focus on solving customer problems means we make simpler engineering decisions up front, in the knowledge that those decisions won’t scale forever and that we’ll need to revisit them in the future. It’s how we weigh up our options and choose to allocate the resources available to us at any given time, and it frees us to spend time on the important things. But what happens when those choices start to catch up with us?
“We were facing a database scalability problem, and decided to confront it by transitioning to sharded databases”
This is the story of how we responded when it became clear that one of those pragmatic earlier choices needed to be confronted, and how we dedicated the time and resources to finding a long-term solution. In this case, we were facing a database scalability problem, and decided to confront it by transitioning to sharded databases – horizontally partitioning our tables into many different databases, so that we could scale horizontally by adding more partitions or shards as well as vertically, by using bigger database servers. This was a huge undertaking that required not just technical skill, but also organizational flexibility.
Long-term solutions to long-term challenges
After nine years of connecting businesses with their customers, our databases had grown to scales we never could have dreamed of when making those early choices. We were running on the very biggest servers offered by Amazon RDS and had tens of billions of rows of data with tens of thousands of reads and writes per second. Running a schema migration of our largest tables was now all but impossible.
“Individual teams were spending more and more time worrying about whether we could scale rather than concentrating on delivering the next great feature to our customers”
Decisions that were eminently sensible when Intercom was much smaller were beginning to show their age, and scaling with our customers as they grew was becoming harder and harder. On top of this, responsibility for our databases was diffused throughout the engineering organization, and individual teams were spending more and more time worrying about whether we could scale rather than concentrating on delivering the next great feature to our customers.
Too often, engineering teams were doing quick fixes and short-term mitigations, sapping momentum from actually building product.
After a particularly thorough investigation of our MySQL databases, we determined that we probably had two years before the issue would be critical. In the past, when we were much smaller, we might have celebrated and just stopped thinking about the problem for a bit less than two years, but we were maturing to the point that a longer-term solution was obviously necessary. So late in 2018, a few engineers within our infrastructure group decided to do something a little bit different.
Sharding our data
Rather than asking “What do we do to get us back on track?”, we asked “How much work would it take to never worry about this again?”
The idea of horizontally sharding our data had been floated in the past but always dismissed as too hard or too slow, but by this point we needed to devote time and resources to work on it.
“We now had the maturity as an engineering organization to be able to allow a team of engineers to focus on solving this sort of problem”
So we made a proposal to our stakeholders – what if we dedicated a small team of engineers to work almost full time on sharding our data for as long as it took to solve the problem? While a significant investment on its face, the alternative involved hundreds of engineers having to consistently mitigate data issues across various projects, with all the costs of context switching and slower product development. In that light, the proposal started to look cost-effective.
We now had the maturity as an engineering organization to be able to allow a team of engineers to focus on solving this sort of problem, as well as the confidence to plan years in advance – something that just doesn’t make sense for a scrappy new startup fighting for product market fit.
Testing our hypotheses
Of course, we still needed to convince leadership that this wasn’t just going to be a very expensive boondoggle that would fail to deliver on what was promised.
To that end we brainstormed what possible solutions might look like, and eventually whittled it down to two:
- An out-of-the-box solution for MySQL database sharding such as Vitess.
- Building something in-house on top of technologies that we were already using, such as Aurora and ProxySQL.
We then took each option out for a test drive, building out a minimal version in production to kick the tyres and compare trade offs (we call this “baking a cupcake” in Intercom parlance). This allowed us to speak in much more concrete terms about what we would build and how we would build it, and let us take a lot of the early risk out of the equation with just a few weeks’ work for two engineers. While Vitess is swiftly becoming an industry standard, the conclusion of our “bake off” was to build our own minimal solution in-house, tailored exactly to our needs, rather than using a small slice of a complex tool that we would need to learn and master to be confident in production.
“The paper trail established at this point would serve as the North Star that kept the project on track”
Beyond the testing of hypotheses, the key at this stage was to document our learnings and decision-making processes. After all, we were laying the foundation for years of work – teams would change with time, and design decisions that we thought settled might end up back on the table. The paper trail established at this point would serve as the North Star that kept the project on track.
Using this documentation, we presented our case to senior management and our stakeholders: exactly what we wanted to build, how we would build it, and what we expected to gain from the investment. They approved the project and a team was built around the project, which would also serve as the “expert team” for datastore operations.
Benchmarking our clusters
Rather than dive immediately into building the sharded database solution, we instead stepped back and, with reference to the documentation already laid down, performed more extensive validation of our proposed design, beyond what was possible in the exploration phase. We set about constructing a real database cluster which would hold tens of thousands of shards and millions of rows of data, similar to what we would eventually run in production.
Once complete, we benchmarked the database by throwing various shapes of read load at it with tools such as sysbench, and documented where we saw performance start to break down, establishing our safe limits. The results were positive and suggested we would be able to run in production, so we moved on to the next phase – actually building the thing.
Perhaps surprisingly, this phrase was one of the quickest of the entire project. All the preparatory work really paid off – our hypotheses had all been correct and our design was broadly as originally planned, so a small implementation team powered through the building phase in the space of a few weeks and with no real hiccups. We even came up with a few refinements to the original plan.
“How do we move all the data from the old, unsharded database table to the new, sharded database tables without losing data?”
The true challenge came afterwards: how do we move all the data from the old, unsharded database table to the new, sharded database tables without losing data and without subjecting our users to significant downtime?
Unfortunately, there’s no silver bullet at this point. Our solution was to do the legwork: the data was copied across, writes were replicated to both databases, and we wrote all of the logic for doing these operations directly into our application code to keep the data layer as simple as possible. Then, when reading the data, we would check that both datastores agreed with one another.
Since this is a significant undertaking, we once again stepped back and identified a smaller, more manageable table with fewer read and write paths, and tested our approach there. Once that was complete, we were confident about undertaking one of our largest, most critical database tables. This final step took many months to complete since the stakes were so high and we had multiple terabytes of data to migrate, but the process was effectively the same as before.
Mature organizations must take long-term decisions
In the end, it took almost exactly 18 months from writing the first line of code in the early tests to the “go live” date.
Along the way, we found a lot of ways to deliver incremental value. We opportunistically cleaned up a lot of technical debt that had had accrued over the years. The first table we tested our migration process on was a significant source of load for our primary datastores, so by migrating it we improved latencies and gave ourselves more headroom to scale into in the future.
“Delivering value at every step of the journey and always finding a quick way to prove that the road ahead was safe allowed us to complete a massive project like this while avoiding the risk that the effort would be wasted”
Lastly, many other teams saw the value of what we were doing and decided to build new features with our sharded database from day one. By the time we shipped there were more than 20 tables using what we had already built, including some of our biggest recent releases such as Series.
Delivering value at every step of the journey and always finding a quick way to prove that the road ahead was safe allowed us to complete a massive project like this while avoiding the risk that the effort would be wasted.
Being able to take a step back and think in strategic terms, over the course of years, allowed us to deliver more value than we possibly could have if we had to just worry about getting to the next week or the next month.
At Intercom, one of our R&D principles is “Think big, start small”. This entire project was a perfect encapsulation of that, but the more mature you get, the bigger you’re able to think. More than anything, a project on this scale demonstrates our ever-increasing organizational maturity – which affords us the opportunity to make big plays with big rewards.