Intercom on Product: How ChatGPT changed everything

In a recent episode, our Director of Machine Learning, Fergal Reid, shed some light on the latest breakthroughs in neural network technology. We chatted about DALL-E, GPT-3, and if the hype surrounding AI is just that or if there was something to it. He told us things were starting to scale. And just like that, we’re at it again.

ChatGPT, OpenAI’s prototype artificial intelligence chatbot, launched last week and it has been making the rounds in the halls of the internet, inspiring amazed reactions from diehard techno-positivists to perpetual tech-skeptics. The bot is powered by GPT-3.5, a text-generating AI, and according to OpenAI, it can generate text in a dialog format, which “makes it possible to answer follow-up questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests.”

While it’s still early to see it applied for real-world uses, it’s undoubtedly very promising. In fact, to Fergal Reid, the change in capability that we’ve seen over the last year suggests this space could be “as big as the internet.” And this is why we decided to bring you a special episode about these latest developments in the world of AI, what they mean, and whether it’s time to apply it in real-life scenarios such as customer support.

Here are some of our favorite takeaways from the conversation:

By pushing the scale and training these models with more and more data, these bots started exhibiting qualitative changes like learning abstract concepts without supervised learning.
Right now, ChatGPT performs best on problems where it’s easy to validate the answer or creative contexts where there’s no such thing as a correct answer.
While we’re seeing dramatically better reasoning capabilities from these models, they still have issues with hallucinations – if they don’t know something, they make it up.
If you prompt these models with the prompt “let’s think step by step,” accuracy rates go up and you get better inputs than just having it instantly give the answer.
Our technology interfaces are gradually becoming more conversational, and we’re just starting to see the quality of natural language understanding get good enough to unlock them.
There are many exciting applications of this tech in support such as agent augmentation, but there’s work to be done before it can be deployed.

If you enjoy our discussion, check out more episodes of our podcast. You can follow on iTunes, Spotify, YouTube or grab the RSS feed in your player of choice. What follows is a lightly edited transcript of the episode.

ChatGPT’s big debut

Des Traynor: Hey, Fergal.

Fergal Reid: Hi, guys. How’s it going? Thanks for having me back.

Des Traynor: Good. It’s good to have you back. We had you only like five weeks ago on the podcast to talk about stuff that was happening with AI. And you’re back again because more stuff happened.

Fergal Reid: It’s been a busy five weeks.

Des Traynor: It’s been a busy five weeks and a busy seven days. Seven days ago was Wednesday, the 30th of November, and I got an email with an invite to an open beta for a thing called ChatGPT. What happened?

“It went viral, it went wild, and everyone got really excited”

Fergal Reid: What happened? So, it’s an interesting question. OpenAI released their most recent machine learning system, AI system, and they released it very publicly, and it was ChatGPT. And it’s pretty similar to their current offering, GPT-3, GPT-3.5, but it was packaged differently, you didn’t need to put a credit card into it, and I think everyone just saw that “Wow, there’s been a huge change in capability here recently.” And it went viral, it went wild, and everyone got really excited. And around the same time, they released their most recent GPT-3.5 model, like davinci-003, which does a lot of the same things, and it’s maybe slightly less good at saying, “Hey, I’m a large language model and can’t do that for you.” But it’s similar in terms of capability.

Des Traynor: Let’s do some quick definitions to ground everyone. OpenAI is obviously the institution doing a lot of work on AI and ML. You said GPT: what’s that stand for?

Fergal Reid: I actually don’t remember. General purpose transformer or something like that [Generative Pre-Trained Transformer].

Des Traynor: But does that name mean anything?

Fergal Reid: Yeah, I think the key piece is the transformer. For a long time, people were trying to figure out, “Hey, what’s the best way to train neural networks that deal with text and natural language processing tasks?” And it was a long time, there were these LSTMs [long short-term memory] that kind of combined the short-term structure of your text with the long-term structure of your sentence, and sequence models, and everyone was working on those.

“As you push more and more training data, they seem to exhibit qualitative changes in terms of what they can do. So, it’s like, ‘Hey, this seems to kind of understand it'”

And then, Google published a pretty revolutionary paper, “Attention Is All You Need”, with a pretty big thesis: “Hey, instead of these traditional sequence models, here’s a new way of doing it, a new model,” which they call the transformer model or the transformer architecture. When you’re looking at a specific word, the model will learn other parts of the sentence that you should also look at in conjunction with that word. You can learn things a little bit more efficiently than with sequence models, and you can train it faster, more efficiently, and scale it further.

So, everyone started using transformers for all sorts of sequence data. And then, one thing OpenAI really contributed to was this idea that you can take these transformer architectures and really push up the scale. You can add way more training data, and way more compute to them. And perhaps very surprisingly, and I really think this is the key thing, as you push more and more training data, they seem to exhibit qualitative changes in terms of what they can do. So, it’s like, “Hey, this seems to kind of understand it.” Or I can say “make this happier,” or “make this sadder,” which is a very abstract concept. Where did it learn that? We didn’t give it this supervised learning where you code in a definition of sadness or happiness. It just started to learn these abstract concepts and these abstractions from masses of training data.

Basically, OpenAI and some others have just been pushing that scaling piece more and more and more. There are other things as well. With GPT-3.5, they train it a little bit different to try and align it more. But basically, the big thing here is lots of scale, lots of training data, and actually, kind of simple models. You can do remarkable things that 20 years ago, people would’ve said, “Well, a computer will never do this; it’ll never be able to write me a song,” and now it’s like, “What sort of song would you like?” “Make the sound happier.” So, yeah, it’s a remarkable time because a lot of things we thought were the domain only of human intelligence just need tons of training data and a big model.

Can ChatGPT be creative?

Des: And then, what happened since last Wednesday was that Twitter – and then seven days later, the general internet or the media – caught onto this. I’ve seen all sorts of frankly outstanding uses in terms of I just could not imagine this is possible. I saw “write me instructions for copying a DVD in a style of a Taylor Swift song where she’s angry because she broke up with her boyfriend” or something like that. But it actually has a go at it. And then, I’ve seen others like, “how do you install Intercom on the iOS ” and it gets that relatively correct, too. And everything in between. And the crazy thing I’ve seen is, for any of these things, you can double back and say, “Now, give me that in the style of a 1940s gangster and say it in German,” and “Now translate German into Spanish, but also add in more anger,” or whatever. And it does all these things immediately, within pretty much zero-second delay, and in all cases, you can see what it’s going for.

One personal example I use is when you’re trying to tell your child a story before bedtime, you can run out of angles. There are only so many different ways that, for example, three dragons could go into a forest and get lost. However, GPT-3 is actually great for giving me 10 more stories. What I’ve noticed is, for the longest time, with the story of AI, even as recently as years ago, people would say, “It’s great for specific stuff, but there’s no way it can tackle creativity.” Is it fair to say it feels like we’re actually in the inverse world here?

Fergal: Yeah. I mean, when people are talking about AI, it’s always, “well, the first things that it’s going to do is those rote, manual tasks.” And then humans are going to have all this time to go and do these highly creative things-

Des: Go into a forest and-

Fergal: Make art all the time, beautiful poetry. And then, it’s like, “Oh, wow. Those manual tasks require really hard vision and processing things to solve. But creativity, where there’s no wrong answer, and there’s no penalty for getting it wrong… Yeah, the poem isn’t quite perfect, but it’s okay, and the rendered DALL·E 2 image might not be exactly what you had in mind, but it’s still a beautiful image and you can choose 1 from 10, that stuff works.

“This thing seems like it’s very good at that sort of intuitive piece, and it’s very good at fooling our intuitive piece. So when you look at it at a glance, it looks correct”

Des: And you can see what it’s going for as well. I think one thing people don’t realize is it’s giving you back what was probably in your head because you’re going to see it anyway. When I say, “Give me instructions to open a bank account in the style of a Rage Against the Machine Song,” I see, “Yeah, we’re going to fight to open the account, and we’re going to rage all night.” And I can see what it’s doing. I’m not even applying an accuracy scale there, I’m just like, “Ah, you had a go,” and you’re giving it credit for that.

Fergal: Yeah, I think that’s probably true. To what extent are we good at judging near misses in terms of nonfactual information? Maybe we’re just not that good at it. Maybe we don’t care deeply about it. And I mean, we’re going to have to get into this issue of factualness, but even when you ask it a factual question… Let’s say you ask it a customer support question. I asked one recently about two-factor authentication, “How do you reset your Intercom two-factor authentication?” And the answer I got was like, “Wow, that’s a great answer.” And I look at it and “hang on, that’s not how you reset your 2FA.” And it’s a beautiful URL, it’s got the reference to our help center article, and that’s been made up too.

“I think that most people, ourselves included, who are having their minds blown, are having them blown by the idea of plausible at first glance”

People talk about humans and human brains, and we have this intuitive part that’s really good at recognizing patterns, and then we have the logical, analytical, reasoning part that’s slower and more precise. This thing seems like it’s very good at that sort of intuitive piece, and it’s very good at fooling our intuitive piece. So when you look at it at a glance, it looks correct, and until you really apply your slower systemic reasoning, it can be hard to see that. And I think that intuitive piece, that speculating, is probably what we rely on more to judge creative endeavors, art, pictures, and sonnets. At least initially. And so, it’s very good at generating things that are plausible at first glance, but then maybe, when you actually take time to think about it, you-

Des: See the problems. And being plausible at first glance is really important because I think that most people, ourselves included, who are having their minds blown, are having them blown by the idea of plausible at first glance. You’re giving it a lot of credit for that despite the fact that it might not have a lot of real-world applicability. You’re never going to hang that painting in a museum, and you’re never going to actually read that whatever sonnet, and you’re never going to win an award for that novel.

I see a lot of folks like content marketers saying things like, “This is going to change my job forever.” And I’m like, “Yes, but maybe not in the way that you think. If you think your job is going to be simply typing in prompts and hitting tab, it’s possible that your job might not exist.” Similarly, I see managers on Twitter saying, “Oh, that’ll make performance review season so much easier.” In all these cases, I’m like-

Fergal: There’s something wrong with that.

“It is possible that the really big contribution this tech makes to humanity is an honest conversation about the amount of work work we can eliminate”

Des: Exactly. You’re all saying the quiet bit out loud here, if your job actually involves you writing spurious BS that could be-

Fergal: Why are you doing it in the first place?

Des: What are you doing? Exactly. I get that in the case of say, content marketing, there might be reasons why you just need to rank for certain words, but don’t mistake that for the craft of actually writing.

Fergal: I mean, it’s possible this is a good thing. It’s possible that bullshit jobs, things that the person feels like they have no value, like these performance reviews, can just hand it off to GPT. And then, after a while, everyone kind of realizes that’s what’s happening, and the person on the other side goes, “Well, I’m going to hand it off to the GPT to analyze it.” And maybe then we can have an honest conversation about what’s the kernel that’s actually really valuable and how to eliminate the work work.

Des: Why are we doing all this performative bullshit?

Fergal: Yeah, it is possible that the really big contribution this tech makes to humanity is an honest conversation about the amount of work work we can eliminate. And that could be great. That could be massively transforming.

The problem with chatbot hallucinations

Des: Talking about actual applications, something that’s on my mind, at least my experience of it directly, and even what you said about the 2FA use case, is you can’t deploy it directly today in a lot of areas where there’s a definitive right answer, especially if the risk of giving the wrong answer is pretty high. So you don’t want this thing consuming medical records and spitting out diagnoses because I can guarantee you the diagnosis will be really well written, really believable-sounding to a layperson, and would possibly have a low probability of accuracy. We don’t know the probability of accuracy, but it’ll vary based on the inputs.

Fergal: It would certainly scare me a lot if someone came to me and said, “Hey, Fergal, we want your team to start using this for medical diagnosis. It would be great.” That would be extremely scary.

“One thing is that this tech absolutely has problems with what a lot of folks call hallucinations, where if it doesn’t know something, it just makes it up”

Des: But there are other maybe less grave, but equally inaccurate use cases, where you could use it to diagnose a conclusion in a legal case. Again, I’m sure it would sound good, and it would wrap it in all the right boilerplate language, but it would still ultimately not really know what it’s saying. I’ve asked it to give me ideas on how to build a modern email client to compete and win in the productivity space. And it reads really fine, but it’s only when you scratch it that you realize there’s actually nothing there. It’s just nice-sounding word after nice-sounding word without particularly sharp opinions. That, to me, makes me wonder about the ways we could make this more applicable.

Fergal: Before we get into that, there are two things that I think are helpful to tease out here. One thing is that this tech absolutely has problems with what a lot of folks call hallucinations, where if it doesn’t know something, it just makes it up. That’s pernicious, and there are a lot of domains where a 1% probability of hallucination is a deal-breaker. And we would all love if that probability was zero. But at the same time, the accuracy has gone up versus where state-of-the-art was a year ago, versus where it was three years ago. It’s absolutely better at giving you the right answer a lot of the time, too. It’s dramatically better at “understanding.” I struggle to say, “Oh, it’s just doing pattern recognition, it doesn’t understand anything,” or at least, I struggle to say that without, “What do you mean by understanding?”

We’re definitely on a trajectory where, while it will still make things up, and that’s a big problem, it’s getting better and better at giving you the right answer when it has the right answer. And so, what does that curve look like? It’s difficult to unpack at the moment, but we’re getting dramatically better models that are much better at doing the right thing while still sometimes doing the catastrophically wrong thing. We should pay attention to both of those things. Yeah, this is very difficult to deploy in a lot of production settings at the moment, at least without some clouding or some affordances around it, but it’s also getting much better. If you ask it something that’s really well covered on Wikipedia, it’s getting better.

An ultimate example of this is computer programming. You can ask it for a programming challenge it hasn’t seen, and if you ask it to generate a whole module or system, it kind of struggles, you sort of have a breaking point. But if you ask it to write a function, even a new, made-up, out-of-sample one, it might give you the wrong answer, but the chances of it giving you something useful have gone way up.

Des: You were saying before, it basically passes the first stage in our programming interview, some sort of array-based question. It just nails it.

“Everyone starts talking about how the dog’s grammar isn’t very good, and that’s very important, but don’t lose sight of the fact the dog is talking”

Fergal: Yeah. Exactly. We have a problem-solving programming challenge for engineers coming to Intercom. I had to sit them myself a few years ago, and we try very hard to make sure that’s not available on the internet. And if it is, we try and iterate and change it. And we’re not way up to speed, so I can’t guarantee it isn’t out there. But this thing generated a solution that just nailed it, and that is a “senior engineer at the whiteboard for half an hour” sort of problem. And it just gets it in one shot, one go.

Des: Zero seconds.

Fergal: Zero seconds. And that’s very impressive. And like half the rest of the world, I’ve also been playing with ChatGPT or GPT-3.5, and I’ve given it lots of other programming competition questions or programming questions, which I’m pretty sure are out-of-sample, and it does a very good job. And that’s a qualitative change in accuracy. You’ve got to check your code and make sure it’s not wrong, but that’s very interesting and exciting.

Very exciting as well is the idea that it’s got at least rudimentary introspection capabilities. If it writes a bug, you can be like, “Hey, there’s a bug. Can you fix it?” And sometimes, it gives you a beautiful explanation of it. And all these models are trained to do is token prediction; predict the next few words. At least traditionally, because I guess it’s changed a little bit in the last year, but the bulk of the training is just to predict the next token, predict the next word. And there is something amazing happening here – by just doing that at scale, you get to some level of understanding.

I don’t want that to get lost in the wider discussion about hallucination, which is real, and people maybe didn’t pay enough attention to it last week. But there’s this metaphor, and I don’t remember who came up with it, of a talking dog, and someone tells you they want you to go meet their new talking dog, and you’re like, “Dogs can’t talk.” But you get to the dog and the dog has a conversation with you. Everyone starts talking about how the dog’s grammar isn’t very good, and that’s very important, but don’t lose sight of the fact the dog is talking. The hallucinations thing for me is that. This feels like a big change – maybe not one we can put in production, but who knows where it’ll be in a year, two years, or three years.

“This is like the self-driving car thing, right? You’ve got to be ready to take over at any point”

Des: Yeah, the hallucination thing, for me, doesn’t render it useless at all. And let’s be pessimistic and say that given a five-paragraph description of a patient, it can give you a 70% accurate diagnosis immediately. And in most of those diagnoses questions, there’s some quick test that can verify whether or not that’s true, as in, “Sounds like you have X, here’s the quick test for X,” and it turns out whether it was right or wrong – that’s still a massive productivity change. If we assume the thing is still flawed but try to take the benefit of the 70% accuracy, there are possibly still things it can do that’ll be massively valuable.

Fergal: I have two thoughts on that. The first thought is someone would need to study that because it’s possible that this thing is net negative, that the new system with the human in the loop, the doctor and the AI, has a higher probability of a catastrophic error because the tired, overworked doctor sometimes doesn’t do their diligence, but there’s an appealing yet incorrect system in front of them. This is like the self-driving car thing, right? You’ve got to be ready to take over at any point. There may be areas in that regime where the system as a whole with the human is actually worse than just the-

Des: People can actually overtrust.

Fergal: People can overtrust. What do they call it? Normalization of deviance. People study this in the context of nuclear reactor disasters and stuff. What went wrong? “Oh, we got used to this shortcut, and the shortcut wasn’t always valid,” et cetera. That’s one thing I would say. But then, the counterpoint, when we’re thinking about medical things, is that some portion of the world doesn’t have access to a doctor. So I don’t know where to draw that boundary. It’s a hard boundary to draw. Eventually, on the trajectory, this stuff will probably get better and better, and good enough that, eventually, as a whole, the system does outperform whatever people currently have.

Training chatbots step by step

Des: You were saying that when it generates code, you can say, “Hey, that’s boogie.” Another example I saw that was popular on Twitter for a while was “Talk me through your thinking line by line,” or whatever. It’s almost like you’re telling it how to think about things, or you’re giving it new information and then not forcing it to reconsider its opinion. What’s happening there?

Fergal: I think there’s something fascinating happening there, and we’ve got to talk right at the cutting edge here. This is speculating and I’m a spectator – I’m not doing this work. I think Google published a paper pretty recently about how large language models can self-improve, so I think there’s something fascinating there that’s worth unpacking.

The first thing is that maybe about a year ago, people discovered that while these models would get things wrong a lot, you could prompt them with the classic “let’s think step by step.” You would have a model and could ask it a simple maths question like “Alice and Bob have got three chocolate bars and they give three to Eve,” or something like that. “How many do they have left?” These things struggle with basic maths, so it would often get things like that wrong. But you could say something like, “Let’s think step by step,” and that forced it to output its reasoning step by step along the way. And accuracy rates went up when you did that, which kind of makes sense. It’s trained to complete text. And so, step by step, each step is designed …

Des: It’s almost like you’re not multiplying out the probability of failure. Because then, if you’re running each step with a probability of it being 90% correct, and at five steps, all of a sudden, the probability is only 50% correct.

Fergal: Perhaps. I mean, it’s difficult to speculate on what exactly is going on internally, but possibly something like that. But there was a very interesting paper recently where it was like, “Hey, we know we can improve the accuracy by saying, ‘let’s think step by step.'” And we can use that to get better outputs than just having it intuitively, instantly give the answer. You can use that to build a new training data set and retrain the model to improve its accuracy. That, for me, is fascinating because these things can self-improve, at least to some degree.

“There’s a very interesting world here where language models and NLP is starting to look a bit more like the AlphaGo world. I think it’s a very exciting time and it’s very hard to say what the limits are here”

I saw a demo recently on a Microsoft event where they showed Copilot or one of those models, maybe davinci, they didn’t specify, doing something with a Python prompt where they gave it a natural language problem, a bit like our Intercom programming problem, and then asked the system to synthesize code and put the code into a Python prompt, and when it got it wrong, the system tried to execute the code and saw it was wrong, so then it took another go and another until it got it right. There’s a very interesting world here where language models and NLP is starting to look a bit more like the AlphaGo world. I think it’s a very exciting time and it’s very hard to say what the limits are here.

I think there are a lot of things that, for a long time, people in linguistics or something would’ve said, “In AI, we’ll never be able to answer these on a grand scheme,” or something like that. Like “The tractor went down the road and turned into a field. Please explain what happened in that joke.” Computers were bad at that historically. “The magic tractor went down the road and turned into a field.” A slight modifier like that changes the meaning. And it’s getting really good at that in some domains. You can ask it basic semantic questions or ask it to speculate. Up until about two or three years ago, whenever I saw a new machine learning system, it always looked magical and amazing at the start, and whenever you got into it and underneath the hood, you were like, “Oh, it’s just logistic regression.” Once I understood that, it was much less impressive. And I’m struggling to do that here. Maybe that’s because it’s so hard to understand the complexity of the model. But these things feel like qualitatively different capabilities than we’ve had.

AI bots versus Google

Des: Before we get into support, which we’ll deep dive on, I’ve seen comments saying this is as big a moment for the internet as Google. I’ve also seen the, I would say, cold water take, which is, “don’t be fooled, generating random song lyrics is a gimmick at best.” And there’s obviously a spectrum of appetite depending on whether or not you’re a techno-positivist or whatever. What’s your take on the Google thing? Is this potentially as big as Google? Is this a threat to Google? Thoughts on how Google might react?

Fergal: So, I’ll be super speculative here, entering into total futurism and stuff. I’m very bullish on AI and machine learning. I feel that the change in capability that we’ve seen over the last year, and certainly if you extrapolate forward another year or two, is as big as the internet. The potential. And we’re going to have to figure out how to productize these things. A ton of work will have to be done on how you constrain them to answer from a knowledge base and so on. But the sum total of new capabilities that we’ve gotten and are likely to get feels, to me, as big as the internet. I might be wrong, but that’s where I would-

Des: That’s the order of magnitude. So, bigger than Google.

“I think it’s a Sputnik moment – people will look at this and go, Wow, something’s arriving here”

Fergal: Yeah, I think so. Not just ChatGPT, which just came out last week. But the total progress feels like we’re seeing dramatically better capabilities at reasoning, elementary reasoning and reasoning that can be wrong, but sometimes quite compelling. I would not have believed it if you had told me of its success in programming challenges five years ago. So I think there’s something big here. There’s a lot of productivity that can be unlocked, and it’s very hard to say where that’s going to stop. And also, I think there are feedback loops here. I feel this is a Sputnik moment. With ChatGPT, you can say, “Hey, the tech isn’t that much better,” or “it’s getting overblown,” but don’t underestimate the ability of low friction being able to go in and play with something. Everyone can do that. And I think it’s a Sputnik moment – people will look at this and go, “Wow, something’s arriving here.”

Des: Sputnik reference here, sorry.

Fergal: This was, my God, back in the fifties. Russians put this satellite in space that orbited the earth and broadcasted radio signals. And people all around the world could suddenly tune in their radio and get this signal coming from Sputnik. And this is the narrative that’s generally told in the west. People suddenly woke up and were like, “Wow, there’s a capability change here that we were not aware of.” And then, supposedly, this caused the space race and the Apollo and all that stuff. So I kind of feel that maybe the reaction is still playing out, but I see so many people who were not really paying attention to this who are suddenly excited about it. Maybe the hype will die down. We’re in the middle of it, so it’s difficult to predict. But if this isn’t it, something else will be soon.

Can ChatGPT power customer support?

Des: What about customer support? Intercom is a customer support platform, and the potential that GPTChat, GPT-3.5, or any of these technologies can make support better, faster, cheaper, more successful, or more end-to-end is something we’re always all over. I know you’ve been thinking about this from a support point of view. Earlier, we talked about how there are environments where an incorrect answer is very, very bad, and there are environments where it’s actually quite tolerable. We have 25,000 customers. Some are banks, which probably can’t afford one. Other people would happily afford one because it means they can support all their customers faster. What do you think about this technology as it applies to support?

“We made a conscious design decision very early on that it would never say anything that hadn’t been explicitly curated by the team”

Fergal: Yeah. We try and pay a lot of attention to changes in developments in this space. We were looking at GPT-3 pretty early, and our initial thoughts were that the accuracy was not quite there yet. The hallucination problem is a big problem to just nakedly say, “Hey, it has consumed the Intercom help center. Let’s ask questions about resetting my two-factor authentication.” It just failed. We’ve been looking at the GPT-3.5 family and some other models recently. We have resolution bots in production. It’s not using language models that are as large – they’re maybe medium language models, embeddings, and so on. And it gets very good accuracy at the sort of thing it does. We made a conscious design decision very early on that it would never say anything that hadn’t been explicitly curated by the team. I think that worked well for a lot of businesses because it might deliver the wrong answer sometimes – we try carefully to control that – but it’s always going to deliver you a relevant answer or an answer that’s not going to mislead you.

Des: Yeah, and specifically, the way in which it gets it wrong is it might give you a wrong correct answer. The thing it gives you will be something that somebody in your company has said: “This is a correct, cohesive piece of text.” It just might not be the right one for the question.

Fergal: And we encourage our customers to always write the answer in such a way that, “Oh, to reset your account, do the following thing.” So if it is delivered wrongly, at least the end user is not disoriented.

Des: Yes, they don’t go and do it for no reason.

Fergal: They can go like, “Oh, this is a stupid bot. It gave me the wrong answer,” as opposed to, “I am misled, and I’m now going to waste a bunch of time…” So initially, with GPT-3, we were like, “Oh, it’s really cool but difficult to see the end-to-end usage of this.” It’s been a couple of years, and I’m not aware of anyone who has deployed GPT-3 in a total end-to-end way to answer the customer’s questions.

Des: End-to-end meaning no agent in the mix. Because the risk there is that there’ll be an unknown unknown. If someone goes to your business and asks a question that you didn’t see because GPT dealt with it, gave it the wrong answer, and the customer goes off and does the wrong thing, no one actually knows what’s happened except for the bot. And the bot doesn’t even know it’s wrong because it doesn’t know if it’s spoofing or not. So you end up in a potentially dangerous world.

Fergal: Exactly, and we’ve quite carefully designed the resolution bot to avoid getting into those situations. We calibrate it, we check that, when it says something helped the customer, it did help the customer, and we have ways of checking that between explicit and implicit customer feedback. But it’s conservatively designed.

“The probability of giving the wrong answer and totally making stuff up is too high, at least to use it for end users in a naked way”

At some point, these open domain question-answering things or something you could build on the top of GPT-3.5 will get good enough that, for a certain portion of our customers, that equation changes where it’s like, “Hey, I’m not answering medically critical things,” and the inaccuracy rate has fallen. It was 90% accurate; now it’s 99% accurate; now it’s 99.9%. How commonly it gives you the wrong answer will eventually fall below the critical threshold where it’s like, “Hey, just being able to take this out of the box is worth it. I don’t have to go and curate these answers.” So that will probably come. When will that come, is it here today, or has it come in the last few weeks with davinci-003 and ChatGPT is obviously something we’ve been assessing.

And it’s certainly a work in progress because you always have to go and play with the prompts. When you interface with ChatGPT or GPT-3, we could take an end user’s question and ramp it in something that says, “Hey, you’re a very conservative customer support agent. If you don’t know something or you’re not completely sure, you always say, ‘I don’t know,'” and you reason with it step by step, and you’re super conservative, and maybe we can wrap it to get the benefit of the deeper natural language understanding, which these models have, and the deeper ability to synthesize and rewrite text, which can be beautiful. It can be really nice. Maybe we can get those benefits and constrain the hallucinations and the errors enough.

Des: Is that another version of walking through this line by line?

Fergal: Yeah.

Des: Is that whole field what people call prompt engineering?

Fergal: Prompt engineering. We’re joking that the machine learning team at Intercom is going to be a prompt engineering team, and we’re joking about that as we play with it. But there are people who really sweat the prompts and have gotten really good at prompt engineering. It’s a real thing, and it makes it difficult to say, “Oh, this new tech is definitely not good enough,” because what will the best prompts be in six months? That said, we don’t think it’s here yet. All the prompt engineering we’ve done on davinci in the last week can get it to be more conservative, but not enough. The probability of giving the wrong answer and totally making stuff up is too high, at least to use it for end users in a naked way.

Support agent augmentation

Des: We talked earlier about the doctor augmentation question. Is there a version of it where you can do it from the agent augmentation question?

Fergal: Well, at Intercom, we’ve been thinking about this area very deeply for an extended period, and in the last few months, we have had internal discussions about the future of the customer support inbox and generative models – models that generate stuff as opposed to just classify things – and we believe that their time is coming for support augmentation, and I think that seeing ChatGPT explode recently and all the excitement about it is evidence of that. It’s evidence that these things are getting good. And there are a lot of things you can do in the inbox or in a context like the inbox to constrain and sand off the rougher edges of these things.

An example might be to curate the responses it’s allowed to give and use the generative model to predict what should happen, but only actually allow the suggestion to present to the teammate, like a macro or a conversation response, and hopefully provide a beautiful interface to make it easy for them. Alternatively, to have it go and search for a new knowledge base, and there are techniques you can use to try and constrain it to that. And then, maybe show, “This is the answer that our bot wrote from your knowledge base,” and side by side with that, “Here is the original source article,” so that the customer support rep can look at them side by side-

Des: And see if it adds up.

Fergal: Yeah, and see if it adds up.

“They have to go and find the article themselves, then they have to read it and check the answer, and then they have to copy paste it and reformat it. So maybe there’s a productivity boost”

Des: So there’s an angle where the AI explains its epistemological basis for how it concludes this. And in that world, if you’re a support rep, you don’t even need to know if it’s actually right – you just need to know if the logic stacks up. Obviously, it’d be better if you knew if it was right, as well. But if it says, “Hey, I read how to reset a 2FA article linked here. I suggest that this is how you reset 2FA,” you’re probably, “That’s the right article to read.”

Fergal: The problem is that when they get it wrong, they’re so good at seeming right that they’ll-

Des: Invent the idea of the article.

Fergal: Yeah, yeah, totally. And so, you might need to go beyond that. You might need to have the untrusted part of the interface, which is maybe the composer, and it pre-fills something, and there’s also a trusted part of the interface beside that, maybe just above it, that shows the original source article, the relevant paragraph. And so, you can look at both.

Obviously, we study customer support flow very carefully and closely, and we absolutely have some support agents where it’s like, “Okay, I got the question,” and they have to go and find an article themselves. Some expert ones know it, they’re instantly there, and they know exactly where to go. Maybe they’ve got a macro that does it, but then maybe someone who’s newer in the company and they’re still being trained in, or maybe it’s only part of their job, they have to go and find the article themselves, then they have to read it and check the answer, and then they have to copy paste it and reformat it. So maybe there’s a productivity boost. Maybe you can make someone twice as efficient or something.

Des: All that agent behavior will also inform the system. If you put it live and agents are forever going “Wrong, right, wrong, right,” all that feeds back in, and then it gets better. Or, if they’re rewriting the answer to be more accurate, I assume we can learn from that. And then, very quickly, the system converges on all the right answers.

“There are a lot of trade-offs. It’s very easy to say we want a system that will learn in production. But then it’s like okay, who has to maintain that? Who has to debug that?”

Fergal: We could certainly build a system that does all of those things. GPT-3.5 won’t nakedly do it. If you decide to build on it as a building block, not even an assessment, is that the right system to build on? Its capability is very good, but it’s not the only generative model in town. But whatever we build on, and we’re getting really into the roadmap, we would potentially build a learning loop. With most of our tech at the moment where we do that, we absolutely gather feedback. There are some parts of the resolution bot like predictive answers, where it predicts things to end users, where it actually does use what the users say, like, “that helps” as a training signal, and potentially we can end up building that.

There are a lot of trade-offs. It’s very easy to say, “We want a system that will learn in production. But then it’s like, “Okay, who has to maintain that? Who has to debug that?” Sometimes it’s easier to get it to a stable stage and then lock it. So, it depends. We did metrics and analytics whenever we upgrade. We’re getting into the details of our models and how we check the accuracy and calibrate them, and stuff.

Des: I know our inbox has this feature where, based on what you’ve said before, if I jump in the inbox, before I’ve said anything to try and start a conversation, it’ll say, “Hey, I’m Des, co-founder of Intercom, thrilled to be chatting with you.” Whatever my most common thing is, that’s automatically pre-written for me.

Fergal: Yep. Smart replies.

Des: Am I right in saying that it’s just the mini version in some sense of what we’re describing here? Because we were really just going for salutations and maybe ends and maybe handoffs, and the common boilerplate of a support conversation should be there for you. And that, alone, is a productivity boost. But the idea that we could get one degree sharper, and somewhere in the middle of all that boilerplate is, “Here’s the meat of the answer,” is where you’re talking about going, right?

“We believe its time is coming, and we’re trying to figure out the best ways to make people more efficient and to leverage it in a production setting that actually works for people”

Fergal: Yeah, totally. And again, to separate things out – there’s just the change in the world, an increased capability, GPT-3.5, and then there’s the stuff that we’re working on as we grind away on this problem and try to deliver things that will make it better for our customers. I think the capabilities have really improved, but we’re still figuring out if we can use this. Is there a shortcut to where we want to go? Maybe we can use these capabilities as building blocks, there are loads of ways to potentially use them as building blocks. But in terms of the direction we were going on already anyway, there are a lot of things agents do such as greetings where it’s very obvious. We don’t ever want to annoy people. We don’t ever want to have an agent read through a bunch of text and then be like, “Oh, that’s useless. Why did you do that?” It reduces their trust in the system. It slows them down. We want to help them out.

So, for smart replies, we started out with greetings. It was just an obvious thing to do. We can very easily tell when you’re probably going to want a greeting – you come into a new conversation and no one’s said anything to the end user before. It’s very obvious. That was a low-hanging piece of fruit. People really liked the user interface. It’s easy, and it’s low friction. Now, we only get to make a single suggestion there, and there are some times when it’s just hard for the system to tell. At the moment, we have this macro flow, and people use macros a lot. They have to choose which of the macros. Should we be suggesting those macros to people proactively? Maybe we don’t want to be pre-filling the composer, maybe we want to just show some macro suggestions that are contextual. There are a lot of flows that are repetitive. We’ve been working on things like flow-finding, trying to understand the common steps people go through.

I guess the big message is we do believe that this sort of generative tech needs to be shaped and made good so that it’s not annoying, so that it’s not giving you wrong things and misleading you, and certainly not pushing more work or stress on you than you would have without it. We believe its time is coming, and we’re trying to figure out the best ways to make people more efficient and to leverage it in a production setting that actually works for people.

AI-ML beyond support

Des: We’re talking about support. What other industries do you think will see the value of this in the early days? It feels like support is a target-rich environment for this type of tech, but are there others?

Fergal: Obviously, we’re bullish on support. There are so many things that are written. It’s like, “Oh, the agent pretty early on recognizes that this is a problem of the following sort,” like resetting my account or something like that. There’s so much structure in that area. There’s a combination of real customer problem structure meets technology that’s very good at dealing with natural language and reshaping it. We can see a button you can press to make what’s in the composer more formal, or a button to make it more apologetic, right? We think it’s a very, very exciting area at the moment. I don’t want to go into everything totally speculatively. But even before this, the machine learning team was all in in this area. We’re big believers in support.

Outside support, anything where there’s a structure in the task and a human approver who’s able to discern when an answer is right or wrong. This is going to seem like weird intuition, but in computer science or cryptography, we pay attention to certain types of problems where it’s easy to verify an answer is correct, but hard to go and find that answer. Complexity classes, all that sort of stuff. But yeah, people are interested in problems like that. I can’t help but think there’s similar intuition here. You have a challenge where it’s pretty easy for a human to verify whether an answer is correct or not but it’s laborious for them to go and look that up and fish that out. Or maybe the team doesn’t care whether the answer is correct enough because there is no such thing as correct, like, “Write me a poem about X, Y.”

Des: That class of problem where either validating the answer is very cheap but creating it is very expensive, or there is no valid answer.

Fergal: And also, the answer might be different in six months or a year. It could be that in a year, the answer could be something more like, “Anytime where a computer can check whether the answer is correct or not.” Or it could be that anytime the domain is sufficiently simple, the machine learning system will definitely give you or very likely give you the right answer. It’s an evolving thing. I think it’s hard to set limits at the moment.

“What are we shipping in January?”

Other domains like computer programming, for example. The person sitting there at their terminal has got to review the code anyway, and they’re able to do that, and there can be a subtle bug somewhere in your code. Sometimes it’s easier to write the code yourself than identify a subtle bug. But a lot of the time, if you look at the workflow of a computer programmer, it’s like, “Oh, I know how to do this, but I don’t remember exactly how to use this library. I’m going to Google for it. I’m going to go to Stack Overflow.” And the idea is that when you see answer number three on Stack Overflow, you’ll be like, “Oh yeah, that’s right. That’s what I want.” There’s a whole workflow like that which occupies a lot of the programmer’s time, and then Copilot comes along and there’s an end run around that. And then reformat the code to fit in. That’s extremely powerful.

We started talking about, “What is Copilot for customer support?” We have prototypes and there’s a lot you can play with. Maybe you don’t answer the full question, you just give it the two or three-word answer, it writes it out, and then you modify it, and you’re like, “Make that more formal, make that longer, make that shorter.” It feels like there’s a lot we can do there.

Des: And what are we shipping in January?

Fergal: Going to have to censor this part of the conversation. We’ll ship something.

Des: Yeah, I bet. Okay. This has been great. We’ll check in, I guess, in two more weeks when all the world’s changed again. But if not, it could be a few months. Thanks very much.

Fergal: By the time this is up on the web, I’m sure it’ll be out of date and look foolish. But that’s the nature of this business.

Des: Absolutely. That’s why you’re working on it.

Fergal: That’s why we’re working. It’s exciting.