We’ve told you all about our products and features and the launches we’re excited about. Now, we take you behind the scenes and introduce you to the work of the people who make it happen.
Over the years, we’ve covered a lot of ground on our podcasts. Every week, we give you an insight into the world of product management, design, support and marketing with Inside Intercom; explore how leaders of the industry are using CX to grow their businesses with Scale; and show you the world of Intercom co-founder Des Traynor and Intercom SVP of Product, Paul Adams, as they share their latest thoughts on how to build great products.
And now for something completely different. For the very first time, we’re releasing Engineer Chats, an internal podcast here at Intercom about all things engineering. Previously hosted by Jamie Osler, a Senior Product Engineer at Intercom for over seven years, it’s now up to Principal Systems Engineer Brian Scanlan to pick up the baton and keep the chats going.
This week, besides Jamie and Brian, you’ll also hear from:
- Mike Stewart, former Senior Principal Engineer at Intercom
- Patrick O’Doherty, former Senior Security Engineer at Intercom and now an engineer at Oso
- Intercom co-founder Ciaran Lee
- Meena Polich, Intercom’s Senior Counsel supporting R&D
From the process of disambiguation and the worst outage we ever had to our obsession with speed and how legal and engineering teams can work better together, Engineer Chats will give you a peek behind the engineering process at Intercom.
If you’re short on time, here are a few quick takeaways:
- Disambiguation, or the process of narrowing down a wide solution space in each problem, is not just good for ambiguous projects. It can be used for the entire building process at engineering and even product management.
- The core of algorithms and systems is data models. When tackling a technical design for a system, make sure you always understand the data models first.
- Automation in infrastructure can lead to pretty serious blunders. And while these issues aren’t fun for anyone, you can use them to look for other blindspots and build a more robust system.
- Your default operating cadence should be to run – it’s important startups don’t compromise on speed. If you can do something this week instead of next quarter, jump on it.
- The legal team isn’t there to slow R&D down. Their priority is making sure that, as the company grows and increases in complexity, it continues to do so within the confines of the law.
If you enjoy our discussion, check out more episodes of our podcast. You can follow on iTunes, Spotify or grab the RSS feed in your player of choice. What follows is a lightly edited transcript of the episode.
Liam Geraghty: Hi there, and welcome to Inside Intercom. I’m Liam Geraghty. If you are a regular listener, you’ll know that we interview makers and doers from the worlds of product management, design, startups, and marketing. We also have two other podcasts – Intercom on Product, where Intercom co-founder Des Traynor and Intercom SVP of Product, Paul Adams, discuss their latest thoughts on how to build successful products at scale and Scale by Intercom – where we explore how businesses are driving growth through customer relationships.
One podcast you definitely did not know we made is one called Engineer Chats, and that’s because it’s an internal podcast at Intercom. It was hosted by Jamie Osler, a former Senior Product Engineer here. In each episode, Jamie sat down to talk with a variety of different folks on a variety of different topics related to engineering.
Today, we bring you a sonic window into all things engineering at Intercom. We’ve taken the best bits from the show, from the story of the worst outage we have ever had to how legal and engineering teams can work better together. First up, disambiguating: the act or process of distinguishing between similar things and meanings to make the meaning or interpretation more clear or certain. Mike Stewart, the former Senior Principal Engineer at Intercom, sat down to talk with Jamie in October 2020 about that word and why he uses it so much at work. Here’s Jamie.
Disambiguation all the way down
Jamie Osler: Something I’ve seen you do with great results when approaching a project that’s a little wooly and not super well defined in terms of what success means and how best to approach it is what you sometimes refer to as disambiguating. Could you tell us what you mean when you say that?
Mike Stewart: Yeah. Disambiguating. That’s a word I never used much before I came to Intercom, and I have used it so much since I got here. I probably should have used it in previous places before, but it is such a good word. It is not even just for wooly projects or ambiguous projects. I almost think this is a very general verb as part of our entire building process that goes way past engineering and into a lot of stuff that PMs do of figuring things out.
“You have a wide solution space… it’s the process of winding that down based on evidence and decisions and calls”
If you go right back to the pre-project state, obviously we have teams, they have areas of ownership, and they figure out roadmaps around them, right? The team is responsible for our entire marketing / engage / outbound / surface area, and they own being successful within that. That is an ambiguous problem. The process of figuring out where we sit within that and of all of the things we could do and the ways that we could do them and narrowing in – maybe not a hundred percent figuring out – and closing down that solution space to get a tighter and tighter view of, within all the things you could do within the engage space, these are the ones we think are the most important, the ones customers are looking for the most, the highest return on investments – that is a process of disambiguation. You have a wide solution space, ambiguity about where you should go within the many different places you could go within that solution space, and it’s the process of winding that down based on evidence and decisions and calls.
When I play that to an engineering project, there’s the same sort of thing a couple of stages down in the pipeline. Once we’ve decided to build a new messenger with a public platform where you can build apps and embed them in a messenger, there is the entire solution space of what that means, all the different shapes that could take, how it could manifest, and how you could build it. Disambiguation all the way down until you get to the point where the ambiguity that you are thinking of is like, “We know we want to embed an iFrame that has a certain interface, the developers move back and forth, and then, how do we actually implement that, tech design it, and write the codes to do it?” Those are the even more zoomed-in levels. You are still working through ambiguity there. So, I think that disambiguation is at the entire product development process.
“I almost think of this as one of those videos of the universe that goes from zooming all the way to the earth as a dot in a galaxy and all the way through the human scale and the micro scale”
Jamie Osler: You have really narrowed that down as well. Maybe you could disambiguate that a little bit.
Liam Geraghty: Mike has a great way of visualizing the process of disambiguating.
Mike Stewart: Yeah. I almost think of this as one of those videos of the universe at different orders of magnitude that goes from zooming all the way to the earth as a dot in a galaxy and all the way through the human scale and the micro scale. There is an interesting structure at each of those levels, and in the same way, I think there is interesting ambiguity at each of the zoom levels as things get more and more defined.
The techniques are different when you’re, say, writing code and figuring out, “Hey, what are my concepts in code, and how should I structure this code?” versus when you’re figuring out, “Hey, how should I tech design this?” and what are the data models and the moving parts versus what’s the solution versus the roadmap? I am abstracting it very far as it’s all disambiguation. Being very deliberate about what it is you’re attacking and at what zoom level is the most important principle in my head. And it’s where engineers can very naturally get sucked in to the deeper zoom levels of disambiguation, of figuring out how to build something, because that is something that is often more comfortable or an easier nut to crack.
Being one with the data models
Liam Geraghty: To connect all of this with a concrete example, Jamie presents this one.
Jamie Osler: When we were looking at how the billing system sent data to Zuora and how it tried to ensure that state was synchronized between the two, we had to have an understanding of how the current system did it so we could get that kind of disambiguation of the current system in place and break it down to its core ideas and principles and see which ones of those were relevant going forward. As part of that, you wrote up a document that explored how Zuora’s modeling of rate plan data over time worked. And I think that was something that a lot of people wouldn’t have dug into at that level. What triggered you to think that would be a useful thing to do? And when do you know when to do that investigation, when not to?
Mike Stewart: Yeah, for sure. In terms of the zoom levels we were talking about before, this, for me, is right in that high-level, tech design zoom level. To recap, in billing, we were at the point where, “Hey, we pretty firmly understand that we have got these two systems. We have our own Rails app, and we have this external Zuora system. We know that, at least, for a decent chunk of the future, we are going to have these two systems. We are not going to change that constraint. We have a lot built on the two of them, so it’s not feasible to move off. We need to have the two systems in sync, and we need to have them agree with each other, and all of these problems that stem from us being unable to have them agree with each other, we need those to go away. We kind of understood that was the mission.
“You can not devise an algorithm independent of a data model. And I think the same is true when you start talking about systems and product features”
And then, it was a case of what technical solution is the way to accomplish that? In terms of techniques, when I am thinking about tech design and this high-level sort of tech design or system design phase, what I almost always do is go to the models. There is a lot of surface area that you can try to understand. There are a lot of things that are important, like, how is your code structure, what’s moving around, and what workers do you have, and what’s trying to do what? But the fundamental concepts, the core concepts in the system, are almost always the same as the data models in the actual database; the schema of them in your database or the public schema in your third party, or whatever they are. They are the core concepts in the data models that are involved. And some famous computer scientist – I have no idea which – has definitely expressed the sentiment that the core of algorithms is data models. You can not devise an algorithm independent of a data model. And I think the same is true when you start talking about systems and product features. The data models are the fundamental of any design.
So, in this situation, the first thing we did when we landed in billing was to understand our own data models. Because for you and me, Jamie, landing in there was like the wild west. Like most of Intercom, we had never seen the inside of this, it was a brave new frontier. So, first of all, we had to understand, “Hey, what the hell are all these tables involved in our own system?” We got to that understanding relatively quickly with the aid of the previous team in San Francisco and built out that mental model.
“I am never comfortable moving forward with a technical design unless I fully understand the data models”
Then, the next major piece that was missing that I think we almost came to attack too late was, “Let’s really understand the data model of Zuora, the system that we’re digging into.” The effort you were talking about, I think it was only maybe a week or so of time where I was basically firing up console, manually poking the data models in Zuora, changing something, running some commands to see what happened, and exploring a sort of black-box style to understand the data model. And at the end of understanding that, we could say that, “Hey, there is this big stack of models. The really important ones are down here, right at the leaf. They are like the rate plan, charge segments, or something, that store the guts of the data.” And once you properly understand the core concepts and data models, then you can start building, you can start designing a system. And that is particularly true when we talk about replication systems like this was, whose fundamental job is reliably shuffling one set of data models and translating it into the semantically equivalent thing in another set of data models.
Your original question, not to lose sight of it, is how do you know when you should do that? For me, that’s a really simple one. I am never comfortable moving forward with a technical design unless I fully understand the data models. And I will tell you one place where I was burned by not deeply following that principle was later, coming to Salesforce, I had some understanding of the core concepts and data models that Salesforce was a big, big world. So, there was a lot of time pressure. And I did not go to the same depth of understanding of the data models as I did for Zuora. And I think the same was true for everyone on the team. We didn’t get to the same level of depth of data models. And we sort of feel the results of that in that we built something good, but a year later, after more context with these data models, we realized, “Hey, we did not understand them correctly the first time. We didn’t correctly map the translation between Salesforce and our own system, and we have more work to do to repair that lack of knowledge.”
Jamie Osler: That is super useful. That was a great chat about the way you disambiguate projects.
Mike Stewart: I hope it was a great chat, Jamie, and I hope we got some useful content here.
Jamie Osler: Hashtag content.
The bright side of a gloriously bad outage
Liam Geraghty: Earlier this year, if you are a user of Facebook, WhatsApp, or Instagram, you will no doubt remember that outage in October. It was Facebook’s longest global outage in its history. It all came down to a faulty configuration change on their end. So, outages are not fun for anyone. Someone who particularly dislikes them is Intercom Principal Systems Engineer, Brian Scanlan.
Brian Scanlan: I hate outages, which is why I have dedicated my career to fighting them.
Liam Geraghty: Brian sat down to chat with Jamie about them in November 2020.
Brian Scanlan: Part of the reason why I like outages, why I am drawn towards them, or I spend my time on them is because it has been pretty good for my career so far. And that’s because I decided to take an interest in it, get involved in running them, thinking about them, being part of them, and following up on them.
Liam Geraghty: Brian recalled some notable outages at Intercom.
“I remember wanting to be sick in a bin when I realized that Elasticsearch was empty. I was like, ‘Oh, this is so bad’”
Brian Scanlan: One of the most traumatic outages I was involved with, even though I was not actually there during the outage, was the great Elasticsearch outage of January 2019.
Liam Geraghty: Someone who was there was Patrick O’Doherty, a senior security engineer here at the time.
Patrick O’Doherty: I remember wanting to be sick in a bin when I realized that Elasticsearch was empty. I was like, “Oh, this is so bad.”
Brian Scanlan: This was a particularly spectacular one. I wasn’t there was because I was at my 40th birthday drinks with some friends. It was like a Friday evening. And because we are not scared of shipping code to production on a Friday, I approved a pull request adding a subnet to our VPC AWS that Friday evening.
Jamie Osler: In between drinks?
Brian Scanlan: No, it was actually on the way. I was sober at the time. When that subnet was attempted to be added to our network inside of Amazon, the automation that we wrote in… We use a tool called Terraform to manage our low-level infrastructure inside of AWS, and we had a bunch of team modules – think of it as reusable code that we wrote to try and simplify a bunch of infrastructure inside of AWS with all of the settings and stuff that we want applied.
“At that point, when the configuration was applied, it had completely destroyed or taken our network offline”
And so, this automation very diligently took the description of the subnet that we wanted to be added. But at the moment of application, AWS’s APIs rejected it because there was an overlapping IP subnet, or rather the subnet that was being configured overlapped with an already existing one. And so, at that point, the Terraform application process just kind of gave up. It stopped. Which, in a bunch of cases, is a completely reasonable thing to do. But unfortunately, the way we had implemented our Terraform module meant that it was removing all of the information about the routing tables that existed on a subnet and adding them back in while it was configuring all of these subnets. So, in effect, it had removed all of the routes, which are how a network knows how to get to the internet and other networks, which is pretty important. So, at that point, when the configuration was applied, it had completely destroyed or taken our network offline. That’s just the start.
Jamie Osler: I mean, that is bad, right? That’s not good.
Brian Scanlan: Yeah. So, that took Intercom entirely offline. And then, it took a while to get to the point where we could roll back. By we, I mean, not me. I was enjoying my drinks at this point. And so, the team figured out a way of getting into our Terraform provisioning infrastructure and rolling back to configuration change.
“Figuring out what on earth happened and where that data went to also took a long, long time. We are talking about an eight-hour outage here”
But unfortunately, in the meantime, other automation kicked in. This time, some automation that was owned by AWS. We use this technology called OpsWorks, which is a managed version of Chef, to manage our Elasticsearch hosts. It had this functionality called auto-healing built in we had enabled by default into our host-level configuration. And if the hosts were uncontactable by the OpsWorks backend, OpsWorks’ workflow system would attempt to auto-heal the host in question because it figured something had gone wrong there. The OS is down, maybe ran out of memory – something bad has happened, so let’s try and fix it. So, this OpsWorks control plane decided to fix our entire infrastructure by replacing the hosts.
Unfortunately, we had been running Elasticsearch and still do with what is known as ephemeral storage. That’s host-based storage – we are not using a magical cloud-based system that stores your data in some third-party system or from a system off the host. It’s just on a physical host. And if the physical host gets destroyed, the data is gone. And so, that’s what happened to every single Elasticsearch host. Every single Elasticsearch cluster lost every single piece of data, which is pretty bad because huge amounts of Intercom are built on top of Elasticsearch. It’s not the primary data store. We tend to write data to one data store, like, say, DynamoDB for our users, and then copy that data over to Elasticsearch for searching. And we can restore it, but the process of getting all that data back via backups and having to redrive all of the changes since our previous backups took a long, long, long time. Also, figuring out what on earth happened and where that data went to also took a long, long time. We are talking about an eight-hour outage here.
“We didn’t just go, ‘Well, now we know about these two problems, let’s fix those.’ We went off and looked for other kinds of areas of automation that could bite us in bizarre situations”
This was a big deal because it happened late on Friday, it took a whole huge number of people to get things back stable. We kind of knew about these problems, having to redrive or refill our Elasticsearch clusters and scratch. We didn’t know about some of the dangers latent in our own automation and some of the automation at AWS.
That was interesting because, following up to this, we didn’t just go, “Well, now we know about these two problems, let’s fix those.” We went off and looked for other kinds of areas of automation that could bite us in bizarre situations. So, we ended up doing a lot of things to be really good at being able to restore Elasticsearch clusters from different states, being able to redrive data at different times into our Elasticsearch clusters should we ever fall behind or have similar disaster-type situations. And, you know, overall, we learned a lot from this gloriously bad outage, and the process was actually pretty good afterward, what we learned and how we disseminated that information.
Patrick O’Doherty: I can’t remember who it was, but about an hour later, somebody thanked me for causing this incident because they were like, “Wow, you really shook a lot of stuff out of the tree here. This is going to be a really fun incident response”. That was basically the gist of it. It was like, “Oh, wow. We are digging up stuff here.” And it was. Our use of Terraform and our general maturity towards how we use tools while staying conscious that tools can hurt us, as well. Respect power tools. Like infrastructure, power tools are dangerous. They can move quickly and catch you by surprise, and I think we learned our lesson that day.
Brian Scanlan: I also got like an Inside Intercom talk out of this. Also, I wasn’t at the incident because I was at the pub for my birthday. It was great. It was the perfect incident.
At the speed of light
Liam Geraghty: In December 2020, a Christmas I’m sure we’ll never forget, Intercom co-founder Ciaran Lee joined Jamie to talk about speed and why Ciaran cares about moving fast.
Ciaran Lee: I am an extremely impatient person. That’s one thing. If I can do something quickly or do it slowly, I personally would just rather do it quickly. Intercom might seem like an old company coming up on 10 years, but I honestly do believe that we’re just getting started. We have so much to do. We are so ambitious. We can kind of see a picture of what we would like to be, this all-in-one tool that everyone with an internet business can use to talk to their customers. And we are only scratching the surface there.
One thing I really like from Stripe, a cool company I look up to, was a great blog post by Patrick McKenzie where he described that they set their default operating cadence to run. They default to moving uncomfortably fast, always asking if we can do the thing quicker this week instead of six months from now. And I just really like that personally. That sort of attitude has served us really well. And I think it’s a fun thing to always think about. Can you go faster?
“It’s cool if we hit a hundred percent availability on a quarter, but maybe we should ask ourselves, ‘Hey, are we not being risky enough?'”
Jamie Osler: If you make going fast critical to your company, and it’s something you do all the time, you tend to break less.
Ciaran Lee: Yeah. Move fast and break things within acceptable parameters. It’s okay to have outages. It’s okay to have bugs – obviously, certain categories of bugs you want to have less than others, but we have availability budgets. It’s cool if we hit a hundred percent availability on a quarter, but maybe we should ask ourselves, “Hey, are we not being risky enough? Could we take a little more risk to move quicker?” You should be at a deliberate point in the spectrum. And for sure, we have a big responsibility. We have lots of customers, hundreds of thousands of people logging in whose job it is to use our Inbox, to talk to their customers each day. We can’t be like breaking their stuff by moving too fast or changing it so quickly that they don’t even know how to use it anymore. That would be wrong. We have constraints, but within those constraints, we should absolutely move as fast as we can.
Where legal comes in
Liam Geraghty: And we are moving as fast as we can through this episode. Next up, Intercom, Senior Counsel, Meena Polich. Meena is on our legal team with a focus on product and engineering. In January 2021, Meena and Jamie discussed how legal and engineering teams can work together.
“We are here sort of marching lockstep with the company and all of our clients to get where we need to go responsibly without slowing anybody down”
Meena Polich: It’s really important for us to understand the product. How can we possibly counsel the company on what regulations are going to impact us or what laws we have to follow if we don’t actually understand what we’re selling? At a very basic level, from a strategic standpoint, we need to understand not only what we sell now but what we want to sell and how we want to position ourselves and grow. In that way, we can start building projections of the things we are going to need to keep an eye on from a legal perspective. And just making sure we are here sort of marching lockstep with the company and all of our clients to get where we need to go responsibly without slowing anybody down. From a more tactical approach, understanding the company values and product is extremely helpful for negotiating with customers and even vendors. It puts me in a much better-leveraged position when I understand what we are trying to do. And then, I can explain to our vendors, “Because we are trying to do this, we need you to be able to do this.”
Conversely, when I am negotiating with customers, a lot of times, the people on the other side of the table are other lawyers or procurement agents, who are as technical, if not less so, than myself. And so, being able to explain what the product does as the lawyer that’s saying, “Look, I know what your concerns are from a legal risk-management perspective, but here’s how the platform actually works. Here’s how the product actually works in practice. And that is why it’s not going to tip off this risk that you’re concerned about. It’s not going to trigger that risk that you’re concerned about.”
“My first priority is helping R&D understand that I am not here to derail the amazing progress we’re making”
Jamie Osler: I guess this works both ways, right? If R&D has a better understanding of the kind of high-level legal overview of where the areas of concern might be, it helps them avoid unintentionally doing things or building products that would be risky or in violation of those laws.
Meena Polich: Yes, absolutely. And that is the most important thing to take away from or try to focus on while building the legal relationship with R&D. My first priority is helping R&D understand that I am not here to derail the amazing progress we’re making, and my team is not here to stop us from continuing to go to market with excellent products. Our team is here to make sure that, as we grow and it becomes harder to keep tabs on everything every individual in the company is doing, we continue to do so ethically and we continue to do so within the confines of the law. And when we can, we try to manage that risk.
That is one of the reasons why compliance by design is so important. If we keep in mind compliance requirements or compliance expectations and we design towards those, a lot of times, the design changes we make are going to be things that would actually benefit our bottom line. There may be an initial cost in terms of resource allocation, but in the long run, and not even the super long run – in a lot of cases, within six months to a year of pushing that feature out – we will see an incredible benefit in terms of growth of revenue and the types of leads we generated and attraction of customers because they will trust us.
Liam Geraghty: My thanks to Jamie Osler, who created Engineer Chats, its new host Brian Scanlan, and to all the guests today who kindly let us put their internal chats external. If you enjoyed today’s show, why not leave us a review or give us a shout-out on social. We love to see and hear what you think. That is all for today. We will be back next week with another episode of Inside Intercom.