Off Script: Reinventing customer service with AI

The latest breakthroughs in AI are already profoundly changing customer service. But what does it take to build systems that can truly leverage it?

This is what we’re exploring in Off Script, our new series of candid conversations with Intercom leaders about the extraordinary technological shift being driven by artificial intelligence.

People have been trying to develop computer systems capable of understanding natural language for decades. But from manually crafting linguistic rules to creating probabilistic models and neural networks trained on vast amounts of data, these techniques have consistently struggled with the complexity of language. We’ve known this for a while – natural language isn’t an easy nut to crack.

Maybe that’s why ChatGPT caught so many people off guard. Here it was, a system that seemed to actually be capable of understanding what you were saying. And it wasn’t just translation – it also seemed capable of summarizing text and understanding instructions. It didn’t take long to recognize this technology for what it is – a revolution unfolding before our eyes. Over a year and a half on, we believe it’s likely to be the biggest economic and societal shift since the Industrial Revolution.

“There’s an inherent structure and repetition to customer queries that make it highly suitable for AI to enhance efficiency and customer satisfaction”

The domain we operate in, customer service, happens to be a prime candidate for the application of this AI technology. There’s an inherent structure and repetition to customer queries that make it highly suitable for AI to enhance efficiency and improve customer satisfaction. We’ve seen it firsthand – when we shipped the alpha versions of our AI-powered Agent, Fin, over a year ago, it resolved about 28% to 30% of customer support questions. Now, that number is closer to 45%.

This doesn’t mean you can just slap AI into a product and call it a day. It takes time and thoughtful work to develop products that can overcome limitations and significantly improve the customer experience. What should be prioritized when building these AI products? How can you develop robust chatbots that can handle customer queries? And how do you elevate your prototype to an industry-ready product?

In this episode of Off Script, our VP of AI, Fergal Reid, talks about the evolution of machine learning, the challenges of applying it to customer service, and what it takes to build exceptional products.

Here are some key takeaways from the episode:

The limitations of ChatGPT can be overcome. Hallucinations, for example, can be reduced with techniques that provide contextual clues, like retrieval-augmented generation (RAG).
Building a robust, industrial-strength customer service chatbot for customer service takes time – you must ensure it can handle unexpected queries and real-world scenarios.
To develop great AI products, focus on what it takes to excel on a few specific tasks rather than attempting to do many things that your product can’t deliver.
Ship, measure, and iterate AI prototypes rigorously – value increases in ways that are invisible from a UI perspective, so prioritizing improvements in key metrics is crucial.
At Intercom, we believe AI-powered chatbots will become highly autonomous, expert systems capable of handling complex tasks across various channels.

We publish new Off Script episodes on the second Thursday of every month – you can find them right here or on YouTube.

What follows is a lightly edited transcript of the episode.

Off Script: Episode 3
Fergal Reid on Reinventing Customer Service with AI

Eoghan McCabe: Modern AI and the momentum behind it is now highly likely to represent the biggest economic and societal shift since the Industrial Revolution. It will, in a short number of years, probably do just about everything we call work. Modern LLMs and their progeny will directly do knowledge work, and they’ll start with the jobs that require the least sophistication. Arguably, that’s most of the customer service work. And so Intercom is very much in the middle of the action.

Our AI team is led by a man named Fergal Reid. He’s been building customer service AI solutions with us for at least six years, long before it was hot and cool and in the way it is now. In this episode of Off Script, Fergal’s going to take a step back and talk about the recent history of machine learning, why the direction it’s headed in is so relevant to customer service, and get into the details about how these AI customer service systems will need to be built. He’ll compare the thin wrappers that are now proliferating with the deep AI-first system that Intercom is uniquely offering.

Fergal is a character, and probably smarter than the rest of us, too. He’s the guy to learn from in this space at this moment. And I hope you enjoy this one as much as I know we will.

A leap in natural language processing

Fergal Reid: As a machine learning expert, you’re used to being bitten by systems many times. I think a lot of people who are experts in machine learning have underestimated the power of large language models because they are so used to getting taken in by a system and thinking it’s amazing and then, “Oh, now I see what the trick is. This wasn’t that good at all.”

“People have been trying to build computer systems to try and understand natural language for decades”

The day ChatGPT came out, we weren’t expecting it. I started to become pretty shocked with its ability. It did an astonishingly good job at what seemed like very simple synthesis. So I tried to ask it questions about a lot of different topics and get it to do something that clearly wasn’t in its training data set. Asking really oddball queries. And it did a really good job on those. I was like, “Whoa, hang on. That’s new. This is much better than we thought these large language models were going to get this fast.” You don’t get to see the start of a technology revolution that big many times.

These breakthroughs belong in the sub-part of machine learning or artificial intelligence called natural language processing. And people have been trying to build computer systems to try and understand natural language for decades. They’ve been trying to build machine translation and computer systems that will consume text in English and translate it to French for literally decades. And for a long time, people started out trying to handcraft rules, like writing “if” statements. But it’s really hard to write rules to translate reliably from one language to another.

Ten or 20 years ago, people started to get interested in statistical machine learning techniques to approach this translation problem instead. So, don’t handwrite all the rules – instead, get a whole lot of data, documents from the UN, something like that, long documents that have been translated into many different languages and feed them all into a machine learning system and have it try and learn how to speak one language from the other. There was a really impactful paper called “Attention is All You Need,” which really attempted to come up with a new way of trying to do this sequence-to-sequence translation.

“This whole ChatGPT revolution is really the promise of machine learning and artificial intelligence starting to come true”

The idea was that if I’ve got to translate from one language to another, I sort of need to know the words but also the context of the words. It’s kind of like, if there’s a sentence about “bank,” and another word in that sentence, “river,” well, that tells me something about the meaning of the word “bank.” To get good at this, you’ve got to pay attention to the word “river” when you’re trying to find out the meaning of the word “bank.”

People came up with these attention-based algorithms, and they were all wrapped up in a machine-learning technique called a transformer. What OpenAI did was they took these transformers and really scaled them up. They trained them on more data than anyone had done before. And something really surprising happened. You take this machine learning technique designed for translation and for learning relations and attention between words, you push enough data through these transformers, you make them bigger and bigger, and they seem to start to kind of understand things. They seem to be able to do not just one task – not just translation. They actually seem to be able to do stuff like understand instructions and things like summarization, taking a long document and making it small.

That’s remarkable. That’s a breakthrough in terms of algorithms, but more so in terms of quantity, in terms of this core idea of machine learning, which is that you don’t need to tell the computer exactly what to do. You instead teach it how to learn and then put tons of learning information through it, and it’ll start to do remarkable things. And there is a sense in which this whole ChatGPT revolution, or whatever we want to call it, is really the promise of machine learning and artificial intelligence starting to come true. Systems are doing truly remarkable things when we just push enough data through them.

How far can we push?

People sometimes ask me what’s possible with this new technology, and the honest answer is that we don’t know yet. We certainly know many things that are possible now that weren’t before, but it’s not easy to set limits. And the reason it’s not easy to set limits is because there are the things this technology does out of the box in ChatGPT as OpenAI have deployed it, but what does it fundamentally unlock? And the reason why it’s hard to tell what it fundamentally unlocks is because when you get a new piece of technology, there’s the engineering challenge. There’s a new capability it gives you, but then it’s like, “Well, how far can we push that capability when we go and learn how to use this as a building block?”

“It was a very short hop from having a steam engine that actually works and does something useful to, ‘Your entire transport infrastructure now runs on steam’”

The history of new technology has always been like this. If you take the early days of the Industrial Revolution, steam engines, right? The first application of steam engines is pumping water out of a coal mine. And that’s cool. When that comes out at the time, people are like, wow, there’s something new here. But if you run a shipping company and you’re doing barges, it doesn’t seem very applicable to you. Those steam engines are big and heavy. Someone would need to build an entirely new rail infrastructure before this would affect you. That’s probably going to take 100 years to happen. And well, no, if it’s valuable enough, people will build that infrastructure.

I think we’re seeing that with AI at the moment. We’ve had this huge capability unlocked, and now we’re in the middle of an infrastructure rollout. Everyone is figuring out, “This core new thing, how do I turn that into a product for my application?” If we need to scale computation, we’re going to do that because the core value is there. Humanity has done this before. The country put substantial portions of the GDP towards building rail once we had steam engines. It was a very short hop from having a steam engine that actually works and does something useful to, like, “Your entire transport infrastructure now runs on steam.” Things can happen fast when the value is there. I think we’re going to see that with AI.

Looking beyond limitations

There’s a very interesting thing about this technology, which is that sometimes I feel experts almost get tripped up looking at it. Sometimes, a random human person off the street looks at ChatGPT and they’re like, “Wow, this is pretty good. This is able to do something that the computer wasn’t able to do before. That’s remarkable.” And sometimes you have an expert in machine learning, and they know a little bit more about how this technology is supposed to work underneath the hood, and sometimes that trips them up because they’re like, “Well, it’s just doing token prediction. That’s all it’s doing.”

But you should pay attention to the fact that, yeah, okay, it’s only trained to do token prediction, to predict the next word in a sentence. But it’s learning to do remarkable things from this. Pay attention to the remarkable things it’s learning to do. Sometimes, when you understand something, you can fail to pay attention to what it’s actually doing.

“You can look at this technology and be like, ‘Well, it seems useful, but it’s got some limitation.’ But maybe you can work around that limitation”

People in this space talk about token prediction a lot. And I’d say there are two things you need to know about tokens. The first thing is that you can just think of them as words. They’re just a more efficient way of training the model than having it train on either letters or words. A token is kind of like a syllable of a word, roughly, and it’s efficient to have these models learn things, not at the level of a word or of a letter, but somewhere in between. Because it can understand there’s a relationship between the singular of a word and the plural of a word, but it doesn’t have to reconstruct everything from letters.

The second thing you need to know about tokens is that when somebody says, “Oh, these models are just doing token prediction,” don’t pay too much attention to that. That is how the models are trained, but it’s not a good guide to understanding what they can do. There are many products where you can look at this technology and be like, “Well, it seems useful, but it’s got some limitation.” But maybe you can work around that limitation.

“Fin is able to give people answers to customer support questions without making things up nearly as much as ChatGPT will”

We saw this with our bot, Fin. And when we set out to build Fin, we were trying to use GPT-4, and we were really worried about hallucinations. OK, what’s a hallucination? The AI model, when asked to make a prediction, will do its best job of making a prediction. If it doesn’t know the answer or it doesn’t remember the answer, it will just do its best job. And sometimes, that best job will be wrong. We refer to that as a hallucination. But we discovered that the hallucination problem wasn’t nearly as bad as we first thought. We just had to use it in a certain way, a way of retrieval-augmented generation, which is this idea of not just asking the AI system what it thinks the answer to a question is. Instead, give it a document or some context and then say, “Hey, what’s the answer to the question in this context? Don’t use anything else.” And that’s essentially RAG. It’s like you go and retrieve a bunch of context, and you use that context to augment the generation that the language model does.

It took a while for us to crack that, but once we did, we were like, “Wow, this is really working.” Fin is able to give people answers to customer support questions without making things up nearly as much as ChatGPT will. And so, I think it’s very difficult to set limits on what this technology will or won’t be able to.

Taking a bet on AI-first customer service

So, we have really taken a bet on AI at Intercom. And we’ve done this because we work in customer service. Customer service has always felt to us like the prime domain for the application of this AI technology. Why is that? Why is customer service so much in the sweet spot of this AI? Well, the first thing to realize is there’s a lot of inherent structure in customer service. Most customer service conversations kind of start the same, and maybe end the same. The middle can be complicated and different. But even in that middle, there’s a lot of repetition. There are a lot of customer support representatives answering the same questions day in, and day out. And once we have machine learning systems that are good enough at understanding language and context to actually begin to address these conversations, well, that repetition just makes it a prime candidate for AI. AI is very good at learning how to do something again and again and again. And so, that’s really what the company is betting extremely big on. I think it’s the right bet. I think you can’t be in customer service and ignore AI. It’s just too big.

If you have tens or hundreds of thousands of customer support conversations happening each month, no manager can read them all. No manager can look at them all. But AI can. And that’s new; that’s transformative. You can detect trends here that no human would be able to. You can detect increases in customer satisfaction and decreases in customer satisfaction. You can detect when an answer that used to be right is no longer helping people. And we think that there’s massive potential here for next-generation tools. You can’t ignore it.

How hard is it to build an industrial-strength bot?

People often wonder how hard it is to build a bot like Fin. And the hard piece is trying to make it industrial-strength, something that you can use in a business setting. If you want to just take ChatGPT, take all the trade-offs that were made with ChatGPT, and turn that into a customer support chatbot, that’s kind of easy. You can probably do that in a matter of weeks. So, why have we spent the last year with this really big team working on Fin? Well, it’s because the trade-offs of ChatGPT have never been completely right for customer service. ChatGPT will talk to you about any topic you want. In customer service, you want to be able to constrain what your agent does and does not talk to users about. Also, you want it to use information that’s in sources that you trust. You don’t want it to use all the information on the internet.

“If you want to build a self-driving car that goes through a city environment, it’s surprisingly complicated. Building a support chatbot is a bit like this”

We often use the analogy of a self-driving car. If you want to build a self-driving car, it just goes on a closed circular loop. Well, it’s not easy, but it’s doable. It’s been doable for decades. If you want to build a self-driving car that goes through a city environment, it’s surprisingly complicated. And so, building a support chatbot is a little bit like this. If you want to build something that answers a question and it’s the exact question the bot’s being trained on, that’s pretty easy. But if you want to build something that will perform gracefully and robustly when people ask it weird questions or a question that’s similar to what’s in your knowledge base but not the same, that’s suddenly a lot of work.

“When we shipped the first alpha versions of Fin over a year ago, it resolved about 28% to 30% of questions. Now, that number is closer to 45%”

A real problem with AI systems is that it’s very easy to build something that seems like it’s working when everything’s going well but breaks really badly when something unexpected happens. So, when you’re evaluating an AI system, you can’t just stay on the happy path. You’ve got to treat it the way it’ll be treated in the real world. Ask it a question that it shouldn’t answer. Ask it a question about politics. Ask it a question about your competitors. That’s when you’ll see if the thing is actually industrial strength, once you treat it the way your users are going to treat it once it’s live.

When we shipped the first alpha versions of Fin over a year ago, it resolved about 28% to 30% of customer support questions. Now, that number is closer to 45%. For each percentage point, there is typically an A/B test, a machine learning research and development process. And so, overall, it takes a lot of energy. It takes a lot of time to build something from a toy or a prototype to something industrial-strength that can be trusted in difficult application areas.

One thing that’s really challenging at the moment when building AI tools is it’s very easy to have them address a very large number of jobs badly. Because it’s so easy, you go to market to just say, “Hey, it’s going to help you do x and y and z badly.” And people will look at that, and they’ll be like, “Wow, I want that. It does all these things for me.” They’ll try and use it, and your initial stats will be great, but they won’t retain. I think that to build AI products at the moment, you want to exercise discipline. You want to pick a small set of tasks that you can really over-deliver on so that people can use those as part of a workflow or productivity task. And it’s actually an anti-pattern at the moment to try and build AI products that are too broad, that do a lot of things poorly. Start out by picking a job, doing something great, and communicating to users: “This is a product that does great at this job.” And resist the urge to pull your product into territory that your product cannot deliver.

Thick versus thin wrapper

Fergal Reid: In AI product development, there’s a lot of discussion around “thin wrapper.” Is your product a thin wrapper of ChatGPT? And absolutely, it’s pretty easy to build a product that’s a so-called thin wrapper that is essentially just ChatGPT but applied to a specific area. Anywhere you’ve got a text box, you can stick ChatGPT into it. But the problem with that is that you have to accept all the same trade-offs of ChatGPT. And it’s quite easy to get to market with a bad thin product in that way. And I don’t want to knock it. Sometimes, there are certain limited application areas where that’s a good thing to do. But very frequently, the exact trade-offs that ChatGPT was built with are not what you want for your product.

“What are the five or 10 things that I need AI to do?”

If you look at Fin, for example, Fin is built using the OpenAI models as components. Fin is really the five or 10 different prompts underneath the hood, each one that does a specific task well. So, one task is to search in a knowledge base to answer a question. Another task is to disambiguate the user’s query. What we have done is we have taken Fin, and we’ve used OpenAI’s models as building blocks from which to build this overall system. And I would say that’s the right way to do it. For any sort of non-trivial software development application, you want to say, “OK, from an engineering perspective, how do I need to build my product here? What are the five or 10 things that I need AI to do?” Then insulate and isolate with good engineering practices each one of those things separately and use OpenAI’s models as tools to deliver each of the building blocks you need. Once you’ve done that, they can be tested, they can be refined, they can be A/B tested independently of each other. And that’s how you can make a great product experience. And I guess you could call that a thick wrapper in contrast to these thin wrappers.

Moving the needle

So, what is the right structure for fast innovation in AI? I think the industry as a whole is still figuring this out. At Intercom, we have a centralized AI team that has a skill set that marries together the product development tactics that are needed with the technical knowledge. What do we mean? What product development tactics do you need? Well, you need to be very scientific. You need to have a very quantitative mindset. You need to build prototypes fast and get them in front of your customers as soon as you possibly can, and then, very dispassionately, scientifically measure them.

Why do I emphasize this measurement? Well, if you’re building a standard SaaS feature that’s extremely UI-heavy, an expert designer is going to be able to look at that, and they’re going to be able to say, “Yes, I am confident that UI is going to solve the customer problem.” If you want to build an AI feature, it’s going to perform differently for every individual customer depending on their data. And so, it’s not enough to just eyeball it because you might end up with a simple text box, and all the complexity is hidden in what happens to that text box.

“You need to re-orient yourself from celebrating the shipping of new, visible UI to instead celebrating when metrics slowly climb up”

Often in AI products, value increases in a way that’s invisible from a UI perspective. An example of this is our bot, Fin. Fin’s resolution rate over the last year has steadily climbed. And the visual experience of Fin hasn’t changed that much. All that work is underneath the hood. It’s the iceberg underneath the ocean. I’m always trying to go to our high-level stakeholders and say, “Look, the UI hasn’t changed, but the product is way better.”

At Intercom, we celebrate shipping. And if you celebrate shipping, it’s very easy to celebrate shipping UI because UI is like clearly new, visible stuff you’ve shipped. You need to re-orient yourself from celebrating the shipping of new, visible UI to instead celebrating when metrics slowly climb up, fighting for one percentage point improvement over the next. Because each percentage point improvement can mean tens or hundreds of thousands of real users that asked a question and got their answer. A beautiful UI of the bot saying, “Sorry, I can’t help you with that,” doesn’t move the needle.

AI as the ultimate support agent

One thing we think a lot about at Intercom is what the future of customer service looks like in terms of AI breakthroughs. Well, the models are getting smarter all the time. The leap from GPT-3.5 to GPT-4 was astounding in terms of the ability to understand and reason about things. We think that as these models get more and more powerful, they’ll be able to unlock the solution to not just informational queries, not just simple actions, but to quite complex debugging and chain of reasoning. And we think that might come pretty soon. We’re very excited about that. We see a future world here where the best customer support agent in the world is not a human – it’s a bot. It’s a bot that has learned and trained everything needed to be an expert at customer support.

“In a future world where we want AI systems to be able to take more and more actions on our behalf, the stakes do get higher”

And then, over time, it’s going to get increasingly agentic. It’s not just going to be sitting inert in the inbox, answering customer support questions – it’s going to be a core part of your business. Maybe it’s going to show up on Slack. When it doesn’t know the answer to something, it’s going to go and ping someone on Slack. It’s going to handle the routing to an expert human when it’s unsure or when it needs escalation. All of that stuff is buildable, probably even with the technology we have now, and certainly with the technology that’s coming soon.

In a future world where we want AI systems to be able to take more and more actions on our behalf, the stakes do get higher. Some of those actions are going to be destructive. If you issue a refund to this one customer, it’s fine, but if you issue a refund to customers who don’t ask for it, or customers who shouldn’t get it, it’s a major problem. The stakes always rise. And yeah, at the moment, we’re automating informational question answering. And I think we’re heading for a future where AI systems are going to be trusted to take those actions in a very autonomous way, but the quality is going to have to increase. You’re going to need some control. We’re spending a lot of design time trying to figure out what that looks like. How do you set a policy that helps it do the right thing, that helps it make the right decision? We think there’s a substantial design challenge there and that’s the kind of thing we’re thinking about a lot at the moment in Intercom.

“The advice I would give is that you can’t ignore AI”

From a unit economics point of view, AI-powered customer support is amazing. From a speed and latency of user experience point of view, it’s amazing. And the quality bar is rising all the time. And so, eventually, the end user expectation is going to change. We think there’s an amazing AI agent customer support product that can be built, something that’s going to be able to handle almost all of your informational queries regardless of what source of data you give them. It’s going to be brilliant at email. It’s going to be brilliant at messenger. And maybe, someday, it’s going to be brilliant at voice.

The advice I would give is that you can’t ignore AI. Is it ready for your specific business yet? Maybe, maybe not. But if you are a B2B SaaS app or an e-commerce platform or anything where you can tolerate very occasional error, this thing is going to change how customer support works, and it’s past time to be at least aware of that. It’s going to change for all of us. How soon that is and the exact shape, we’ll have to see.

Off Script: Episode 3 Fergal Reid on Reinventing Customer Service with AI