Our grand experiment with GPT and generative AI

ChatGPT has taken the world by storm, and we could not be more excited. Today, we’re revealing the customer service features we have built using this revolutionary AI.

In December, our Director of Machine Learning, Fergal Reid, and I sat down to chat about the launch of ChatGPT: the good, the bad, the promise, the hype. The possibilities to automate and streamline processes for support reps seem endless, but the success of generative AI in this space will ultimately depend on its ability to deliver real value for customer service teams and customers alike. If not, well, it’s just a toy – a fun one, but a toy nonetheless.

To test this, we quickly got down to work. We sketched out a few AI-powered features we thought could be useful, went into production, and put a beta version in front of 160 customers.

In today’s episode, Fergal and I share what we’ve learned in the past few weeks, where we’re going next, and how it has changed our perception of what’s possible in this space.

Here are some of the key takeaways:

  • The ability of ChatGPT models to process natural language in multi-sentence conversations continues to improve and unlock new possibilities.
  • In product development, customers are always the ultimate arbiter – you may build amazing technology, but if it doesn’t solve a problem for them, it’s not worth it.
  • The ability of GPT-3.5 to edit and change text makes it very valuable for customer service, and it can already handle tasks such as summarizing text and adjusting tone.
  • With advances in ChatGPT, more features may be added to maximize efficiency and free frontline agents to focus on the more complex issues that drive customer satisfaction.
  • While we’re starting to investigate potentially game-changing uses such as smart replies, the model still lacks an understanding of the business context for it to work.

Make sure you don’t miss any highlights by following Inside Intercom on Apple Podcasts, Spotify, YouTube, or grabbing the RSS feed in your player of choice. What follows is a lightly edited transcript of the episode.

A breakthrough in language understanding

Des Traynor: Hello once again, Fergal. How are you?

Fergal Reid: Good, thanks, Des. Been a busy last sort of six or seven weeks here at Intercom, so I’m very excited to talk about that today.

Des: Yeah, just six or seven weeks ago we sat down to chat. You’ve had, I guess, six or seven weeks of actual engineering time building against the AI revolution that launched in late November. What have you learned? Has it changed your perception of what’s going to be possible in the world of customer service?

Fergal: Yeah, I think it has. When we talked last, we talked a lot about ChatGPT and that was maybe a week after it was launched. You can split hairs about whether the big difference here is ChatGPT or the family of models OpenAI has built – we’ve been working mostly with GPT-3.5 or with Text-Davinci-003, to be really specific.

Des: They’re names of this specific module.

Fergal: Yeah, they’re the names of this specific model. And actually, there’s a lot of confusion over these names and what the different things are. But basically, we feel that the GPT-3.5 series of models, Davinci-002, Davinci-003, this sort of thing, which have come out in the last year, and then Davinci-003, which dropped at the exact same time as ChatGPT, were breakthroughs and really have enabled us to start trying to build different, qualitatively better functionality.

“That’s a big unlock because there are so many tasks that we want to do that are best described in natural language”

Des: What is your belief about what’s possible now? Where are we headed in the world of customer service?

Fergal: I think in customer service, and even beyond, these models enable us to deal with natural language in a better way than we could before. I guess I could give a little history of natural language processing. It was simple things like regular expressions and so on for a long time. Then we had text that got really good at looking at keywords that were in data a lot. And then maybe three, four years ago, neural networks started to get really good at understanding, “Hey, what’s the meaning of this sentence?” But now, I would say they’re starting to get really good at “Hey, what’s the meaning of that sentence in a deeper…” heading much closer towards how humans can do it, and understanding what’s going on in a conversation of multiple sentences. What’s the person talking about? Stitching together the contents of sentence one with the contents of sentence three to figure out that someone just said, “Oh, I have a question about Salesforce.” And the teammate says, ” Well, what’s your question? How can I help you?” And then it says, “Yeah, I really need help with my integration.” And the systems are getting much better at understanding that that integration is about the Salesforce integration, and have some idea of where the conversation should go next.

“Suddenly, machines are able to look at those and make much more sense of them”

Our Resolution Bot and the machine learning tech that’s already deployed are pretty good at that stuff within a given sentence. But now, the tech is getting better to the point where it’s good across multiple sentences and much better at context. As humans who communicate and want to talk to each other in natural languages, it’s just so natural to us. That’s a big unlock because there are so many tasks that we want to do that are best described in natural language. There are so many documents and instructions and articles about how to do something that we write and communicate with each other in natural language. And now, suddenly, machines are able to look at those and make much more sense of them. And each time that capability gets better, a whole lot of products unlock a whole lot of stuff that wasn’t possible before. And we really feel that what’s happened is a big thing. That’s our opinion until we build stuff and put it in front of our customers and see what our customers think of it.

Des: And so that’s what we want.

Fergal: I mean, that’s what we’ve been trying to do.

Des: What is software but codified opinions, right?

Fergal: Right.

It’s up to the customers

Des: So, what have we built? What have you been working on? Let’s talk through it.

Fergal: So, in product development, you always want to check your opinion. Your customers are always the ultimate arbiter of whether something is good or not. You might think you’ve got the most amazing technology and the most amazing product experience, but if it doesn’t solve a problem and if it isn’t used, you’re wrong. And so, we really wanted to cut through the hype here and convince ourselves, “Okay, what can we build for customers quickly, what can we get in front of them, to work with them to see whatever the value is?” And so, we went and sketched out features that we could build and get into production quickly that would use some of this new tech and help us to figure out if it was valuable or if it was a toy.

“You could just press a button or use a keyboard shortcut to basically say, ‘Hey, I want a summary of this feature, put it on my composer so I can lightly add to it’”

The first thing we decided to do was build a feature that essentially did summarization. And there’s a reason why we decided to do this. My team, the machine learning team here at Intercom, the Inbox team, discovered that there were two common customer jobs that we’re just seeing a lot. In a lot of businesses, before a support rep hands the conversation over, they have to write a summary of that conversation. If they don’t do that, the end-user has to repeat themselves or the receiving rep has to go and scroll up and read a lot of stuff. And so, the support rep handing over has to write a summary and that’s a real job.

About a year and a half or two years ago, my team tried to look at the best neural networks at that time, T5 and all these big networks, and figure out if we could use them to build an adequate summarization feature. And unfortunately, we concluded there was just no way. Conversations are just too gnarly. The flow of a conversation treads around between these different parts in a way that was really good for humans – humans can easily look it up and it’s fast and they can scan it – but even the pretty big neural networks we have in the Resolution Bot struggled on that sort of task. And one of the first things we saw when we were playing with the recent DaVinci-003 model, GPT-3.5, was that suddenly, it seems to be great at summarization. And we’re like, “Wow, that looks amazing.”

“We’re going to try and be really real with people. We’re going to help our customers figure out which bits are toys”

And so, we built a feature and did a couple of rounds of iteration with a summarization feature in the inbox. You could just press a button or use a keyboard shortcut to basically say, “Hey, I want a summary of this feature, put it on my composer so I can lightly add to it.” It’s not perfect. You might need to add a little bit to it, but it’s a huge time saver. And we’ve had over 160 customers in our beta using these features and they’ve held summarizations a real winner. It doesn’t completely change the game for a support rep yet; it picks off one core job, but it delivers on that one core job.

Des: And reduces it. What would you say the reduction is? If it would normally take three minutes to write – was it down to 10 seconds to add the summary or something?

Fergal: Yeah.

Des: It’s like 90% of the work removed.

Fergal: Exactly. And we’ve had some customers be extremely excited about it because they may have a really long email thread or a really, really long conversation history, and it just saves a bunch of time. It’s a bit like if you’re reading an academic paper or something. Sometimes just getting a gist helps you find the exact details you want. I do think we’ve hit something really good there, and that’s one of the features that we’ve worked on.

“It’s easy to come out with the hype machine; it’s easy to come out with press releases: ‘We’ve changed the world.’ And in practice, the person who gets to decide that are our customers”

We’re going to try and be really real with people. We’re going to help our customers figure out which bits are toys. Not everything we’ve built and we’ve put in beta is game-changing, but summarization is one of the ones that we feel strongest about. It’s really ready. This tech does something transformative – it’s new, it’s exciting, and it delivers real customer value.

Des: One thing we’ve always tried to be, as it relates to AI, is sober because we’re trying to do our customers a favor. It’s easy to come out with the hype machine; it’s easy to come out with press releases: “We’ve changed the world.” And in practice, the person who gets to decide that are our customers. So, when we release summarization, we’re taking their word for it that it’s really valuable. That’s the thing that matters, right?

Fergal: Exactly. And look, that’s something we agonize over. Sometimes you lose out to folks who are willing to just hype up. We were trying very hard not to do that because once you start doing that, you end up believing your own hype.

Des: The narrative gets ahead of the software. That’s a real risk.

Fergal: And you try to avoid that. We’ve been really conscious about that with this type of tech, which is that it almost works a lot, and it comes really close to doing something magical and transformative, but it sometimes fails. And so, we’re trying to keep ourselves honest here about, “Okay, is this really good enough yet?” We know it’s not perfect, but is it good enough yet? And what’s it good enough for? And summarization is something we feel good about. That’s a feature we feel delivers real value.

You can lose by pitching something that looks good but doesn’t actually work in production, and you can also lose by being too conservative. And in the past, with Resolution Bot, we’ve had times when we were too conservative. We were like, “oh, we really don’t want this to backfire unless we’re pretty sure it’s got the answer.” And then some customers have come to us and told us, “Oh, the user’s not getting any help for a while, give them something even if you’re wrong.” And we A/B test and tune those trash flow and so on. There have been times of being too conservative. And so, we’re taking an approach here of rapidly getting new beta features out to our customers. Our customers are extremely excited about this tech.

Effortless text editing

Des: How many features are in the beta? Five, six?

Fergal: So, the first thing we did was summarization. We did that because it was just a straightforward, easy-to-integrate, well-understood job. After that, we went to look at the composer. Because we’ve got telemetry and metrics, we know that about half of the time that an agent spends in Intercom, they spend in the composer writing text or editing it. They’re organizing their thoughts, too, but they’re spending a lot of time writing and reshaping text. And when we looked at that, we were like, “Okay, this is very good at editing and changing text.” We started off with some small features there, some like MVP features to get them live and see how that goes. And so, we started with text editing and text reshaping features. Maybe the simplest one to explain is simple editing. Saying, “Hey, make this text that I’ve just written more friendly,” or “more formal,” because this tech is now good at adjusting tone. Previously, there wasn’t really anything that you could use to reliably adjust the tone. We did a lot of iteration on the UX and we’ve come up with UX where there’s a toolbar and you can just select text. In our first version, you couldn’t select the text and we iterated it. The customer told us it was useless – they didn’t want to change the tone of everything in the composer. Now, you can select a little bit.

“It strikes me to know we might be looking at a world where the new context menu is like ‘expand upon,’ ‘summarize,’ ‘make it happier,’ and ‘make it more formal’”

It’s almost like editing an image. And we started to think there’s an emerging paradigm here. I remember that once upon a time, a word processor where you could bold and italicize text was like “wow”. And we wonder if, in the future, people are going to think about that in terms of tone. It’s like, “Oh, of course, I want to go and quickly edit the tone.” If you’re tired at the end of the day, you’ve written a draft and you’re going, “I wasn’t friendly enough, it’s going to affect my CSAT,” you just go and click a button and edit the tone and it gets more friendly. And it’s easier to press that button once or twice than it is to go and-

Des: Go and rewrite it.

Fergal: Rewriting text is work.

Des: It strikes me to know we might be looking at a world where the new context menu is like “expand upon,” “summarize,” “make it happier,” and “make it more formal.” Those will be the transformations you’re trying to do. It’s not so much that you’re focused on the optics of the text as much as you are on the tone.

Fergal: Totally. Look, when we go back and forward in this, we’re like, “is this a toy? Have we built a cool toy or is it something amazing?” And I think it’s going to vary by customer, but the bold case for that particular feature is, “Hey, I’m tired at the end of the day and I care a lot about tone because my CSAT is a big metric for me and this is a way of doing it.” It’s a way of giving a more delightful customer experience.

Des: Take “sorry, here’s your refund.” You’d say,” please make that sound more empathetic,” or whatever.

Fergal: We’ve experimented around empathy. What we’ve actually shifted to is “make things more formal, make things more friendly.” That’s the kind of spectrum that seems to work really well, so we’ve gone with that. And I guess it suits Intercom. A lot of people are trying to give a very personal, very friendly support experience.

“A lot of the time, when you write something down, it comes out wrong. So you can just say, ‘Hey, rephrase that’”

To be completely transparent, we’re still not sure exactly where it sits in the spectrum of toy to valuable. Some customers say it’s very valuable and so we’re continuing to evaluate that. But we have that in beta. We want to tell our customers that these are the sort of things we’re building and investigating.

That’s one feature. The next thing we started looking at is a rephrasing feature. And again, these language models are very good at taking a constrained piece of text and editing or changing it. You start to see that with summarization. A lot of the time, when you write something down, it comes out wrong. So you can just say, “Hey, rephrase that.” And again, it’s that sort of fast UX where you just highlight it and click it. It’s just a little bit easier than rewriting it yourself. There’s a little bit of latency when you do that. So, we’re still evaluating. But some customers, again, really like it, it really works for them in their business and we expect that latency will go down over time as these models get better and better. So that’s text rephrasing. Those are sort of the first features that we’ve gone after in the composer.

Now, the bigger ticket stuff comes next, and we’re starting to investigate the things that are potentially more game-changing. One thing we’re trying to do with that is what we call the expand feature. We were inspired by things like co-pilot for programmers. In co-pilot, you can write a comment and it fills out the full function and just saves you a bunch of time. And we were like, “Oh, can we build something that’s a little bit like that for customer support?” And the idea is that maybe you write a short summary of what you want and then highlight that, say expand, and your composer fills that. We’ve done this, we’ve shipped it, and customers clearly see that this is valuable and not a toy – if it works. But it works much better in some domains than others. If you’re answering questions that generic information from the internet would do a good job of-

Des: Like if you had to reset your phone or whatever.

Fergal: Yeah, exactly. It works very well for that. However, if you’re trying to do something where you’re writing a shorthand and actually there’s a lot of context specific to your business about how you answer this type of question, then it can hallucinate and it will say something you need to edit out. Still, some customers really like it and it works really well for them. But we really think that that’s sort of a version one. If you’re using this, you need to check it and see how well it works for you and your business. However, we have a project that’s constantly evaluating new things for that, where it’s like, “Hey, can we take the previous replies that you’ve given on the same topics?” So you give us three-word summaries of what you want to do, like “refund customer thanks,” and we’ll go and find the last five things you’ve said about refunds. We’ll also go and see if maybe you’ve got a macro about refunds. We’ll also look at the context of the conversation beforehand.

“What we’re experimenting with is: can we get over the hump? Can we start to make something that’s really transformative by getting that context in?”

Des: If there’s anything in the help center, all that sort of stuff.

Fergal: We haven’t gone quite as far as pulling in articles and stuff from the help center. We just looked at what you and the user said two turns ago, put them all into a prompt that will then go and say, “Okay, with all of this information, please go and take that three-word shorthand, and anti-summarize it, turn it into a big thing.”

Des: Yeah, totally. So it’ll take, “Sorry, here’s a refund, thanks,” into, “We really apologize for the inconvenience. We’ve issued a refund and you should see it in three to four days. And we regret…”

Fergal: In the style that you usually use – you personally, the individual agent – and taking in any relevant macros that you have as well. That’s really where we’re at. And that last piece is not in production yet. The V1 is in production. The V1 has been used by hundreds of beta customers. But what we’re experimenting with is: can we get over the hump? Can we start to make something that’s really transformative by getting that context in? And that’s still ongoing. I would say we’re optimistic but not certain yet. This is changing week by week for us, so we’re very excited about that. And that’s the expand feature version one at the moment. But we can see version two and version three coming down the line.

Driving support efficiency

Fergal: The last feature that we experimented with in our beta was giving our customers direct access to GPT. So, no prompt, not telling the model anything, just saying, “Hey, put whatever you want in there.” And we really did that as a fast-moving beta experiment. We didn’t give our customers in beta much guidance about how to use that. We confused some of them and it didn’t go so well, but some customers found novel use cases, including translation, where it was delivering real value to them. Now, these models are not the best at translation, but maybe that’s an interesting AI product development tactic there, which is like, “Hey, if you’ve got beta customers, maybe give them a little bit more power than you might expect and they’ll tell you what they need.”

Des: See what’s emerging. See what’s expected even.

Fergal: Exactly. And expectations, I think, are going to change fast in this. Maybe that tells us we need translation because there are very well-understood translation models out there.

“Maybe it’s got a source beside it, and suddenly, that five minutes of hunting for the answer turns instantaneous. And that’s where it starts to get really game-changing”

Des: So it seems like all these features are all efficiency maximizers for support teams. They reduce a lot of the undifferentiated, whether it’s the intros and outros or whether it’s just rewriting something they might not have the energy to do to make it happier or more formal. They’re all different ways to save frontline support agents a lot of time. Ultimately giving them more time to focus on the harder bits of the conversation, which are the technical lookups or the deep dives. Is that where this will be best deployed? Is that our best thinking so far? When you think about where else we can roll this GPT-style technology across the support experience, what else are you thinking about?

Fergal: Our larger customers have a lot of support reps who spend day in, day out in the composer. And so, if we can make them faster and more efficient – a 10% or 20 efficiency gain is absolutely huge.

Des: Of course, yes. We have customers with thousands of seats, so it’s genuinely transformational.

Fergal: Exactly. Game-changing. And that’s an area that we’ve been very attracted to. And this tech is just getting better and better. It’s not the only place, but we’re really bullish about that. Some of our customers will very nicely share videos with us of their actual day-to-day. And you see this workflow where it’s like, “Hey, I’m trying to answer a question, and I don’t know the answer. I need to go and look up an internal help desk article or find a similar conversation, and I’m navigating around.” If we can short-circuit that to the point where it’s like, “Hey, here’s the AI. Maybe you give it a few words…” Or maybe we get beyond that. We have other prototypes I’ll talk about in a few minutes where maybe the answer is just there waiting for you. Maybe it’s got a source beside it, and suddenly, that five minutes of hunting for the answer turns instantaneous. And that’s where it starts to get really game-changing. I think that’s where we’re going pretty soon.

Des: Yeah, that makes sense. Small gains on large teams are still massive, and then obviously, large gains in any particular workflow, the summarization thing, are also massive. I think some people have this weird binary world where until we’ve automated all support, we haven’t done anything. My personal take is I don’t think we’ll ever actually automate all support. What I think we will do is literally gut the undifferentiated part of support, the “pointy clicky,” “intro-y outro-y” stuff where you’re doing the same thing every day.

Fergal: And hopefully, you’ll get rid of their frustrating parts. You’re navigating around, you’re trying to search, and you know the answer’s in here somewhere. You know you’ve answered this question five times last month, but you can’t find it.

“Honestly, these features are crossing the threshold of usefulness much faster than I would’ve anticipated”

The last feature that was live in beta is an articles-based expander. This is something we’re seeing almost becoming a standard feature very rapidly. Anywhere you’re writing a text article, it’s going to become standard that you want the ability to call out to a large language model and say, “Hey, help me complete this, expand this. Here are my bullet points.” And so, we ship that in our beta for the Intercom articles product. Again, it’s still early. All this stuff is early – it’s been six weeks to eight weeks, but sometimes it’s magical. Sometimes you can go and write four or five bullet points to describe the content of an article, and then, in the prompt, we give it the standard format of an Intercom article, so it knows how to go and put those in the headings and so on. It’s magical when it works and how often it works and how well it works for people. You still need to check the content. It can put stuff in there, but we think there are ways to get that down. And honestly, these features are crossing the threshold of usefulness much faster than I would’ve anticipated. So yeah, we’re experimenting with that.

The last frontier

Des: So then, further afield, what’s your take on the trajectory of all of this? Where to from here?

Fergal: Those are the things that we’ve had in beta. We’ve had hundreds of customers using them, and we’ve got a real signal on customer value. I’ll tell you exactly where we are now in production with machine learning. In the last day or two, we have had a feature that our own internal CS team is using: in the past, we’ve had a smart replies feature where it will mine your common greetings. These are the things that don’t have information, that are not answering the user’s question – they’re just oiling the wheels, making it fast and snappy and easy to say, “Oh, thanks. You’re welcome. Is there anything else I can do?” And this tech is wonderful for that sort of thing. Linguists call them phatic expressions.

In the last few days, we’ve shipped a version of that to our Intercom CS team where they see this grayed-out text prefilled in the composer, but it’s relevant to the specific conversation. So, if they previously said, “Hi, can I help you,” and the user said, “Oh yes, I want some help with the articles product,” it would then suggest, “Oh yes, let me look up the articles product for you.” It won’t look it up for you yet, but we’ll do that. Three or four days ago, we were like, “Okay, I’m going to ship this internally. We’re not sure if it’s going to get annoying and if people will get blind to it because they see it often and it’s only helping for a subset,” and we’re always really cautious about that. But so far, the internal response from our CS team has been great. And so, we intend to keep working on that. Maybe we need to put another system that limits how often it shows. That’s one thing we’re working on.

I mentioned the expand piece earlier, and now we’re working on, “Hey, can we do that even without the shorthand?” Can we figure out what you’re about to type next based on what the users just said? And we’ll go and look in your knowledge base, try to find a relevant context, and give that to the model. The model itself isn’t good enough to do this. It doesn’t know your business, but maybe we can augment it. Maybe we can use a combination of more traditional machine learning text with the model and get something good. We have prototypes and we’re working on this, but we haven’t shipped them to our customers yet, even in any beta form, because we’re still evaluating whether that’s good enough to be transformative or whether it gets boring and annoying. Where that threshold is is not clear. We’re a little bit more bullish about the expand-style thing where you have to prompt it because the user can learn when to do it. They can learn how to query it. We all had to learn how to use Google, and we expect users will get much better at dealing with these systems too.

That’s roughly where we are. We’re moving fast and we’re shipping things quickly to customers to really check and get real value here. We’re trying to be careful to avoid falling into the hype trap. We believe there’s huge potential here, but it’s too easy to stick a landing page up and say, “Get it here. It’ll answer everything.” And that’s not good. People will just go blind and turn off.

“Everyone has seen this and is like, ‘ChatGPT is really good. If I could get tech like that to help with my customer support, that’s huge.’ But it isn’t going to do it off the shelf. It doesn’t know your business”

Des: I think you damage your reputation if you say, “this thing does something,” and it clearly doesn’t, but you did it for clicks or whatever. It feels like the real product everyone’s waiting for in this new space is the end user-facing bot that answers most questions correctly all the time. Thoughts on that? Weeks, months, days?

Fergal: Obviously, that’s a huge area for everyone. I wouldn’t underestimate the composer too – a portion of questions will always flow over to the composer. And if we can reduce that time for those, that’s huge. But absolutely, one of the huge prizes in this area is whether we can take the conversational understanding experience that we have seen with ChatGPT and make it work for your individual business while avoiding hallucinations? There are a lot of people investigating that. We’re investigating that too. We have prototypes that are interesting and promising, but we don’t yet know for certain whether we’ve crossed that threshold where hallucinations are rare enough that this is worth doing and it’s valuable. We are starting to see some opinions crystallize on that internally, but we’re not ready to share where we’re at with that quite yet.

Des: Totally fair. Well, I guess we’ll check in again in six more weeks or so.

Fergal: It’s been a very fast-moving time. Look, this is a very exciting field to work in. Customer expectations are very high. Everyone has seen this and is like, “ChatGPT is really good. If I could get tech like that to help with my customer support, that’s huge.” But it isn’t going to do it off the shelf. It doesn’t know your business. You can’t really fine-tune it today. Even if you could fine-tune it on your specific business, it probably wouldn’t do it. We need to find clever techniques, and I think companies like Intercom are well-positioned to try and do that. And yeah, there are a lot of interesting tech and language models out there. I’m really excited to see all the innovation in this space.

Des: Cool. Thanks very much.

Fergal: Thank you. Thank you.