What really matters when evaluating AI Agents for customer service?

When considering AI Agent options, some teams don’t focus on the right evaluation criteria. Not because they aren’t rigorous – most are – but because they are only rigorous about a subset of the criteria.

I’ve had a lot of exposure to our customers and prospects running a “proof of concept” (POC) to evaluate an AI Agent for customer service over the past few years. It’s understandable that many spend the majority of their evaluation time on performance, paying close attention to accuracy scores, resolution rates, and benchmark tests on curated datasets. But performance indicators alone aren’t enough to guarantee success outside of the evaluation.

If your POC only proves that the AI “works,” you’re missing the bigger picture. Here’s what else you should look for to ensure you’re making the best long-term decision.

How does it handle your real-world setup?

Performance is important. No one is disputing that. But performance in a POC should reflect the reality of the messiness that is typical in most support environments.

Getting an answer right is crucial, but the best performing Agents will also demonstrate sophisticated behavior, proving they’re capable of interacting with real human customers, not just curated datasets. Pay attention to how it behaves when it doesn’t know an answer – does it recover or does it spiral? Does it stay on track when navigating a complicated request with multiple steps? How does it handle handing over to your human agents?

When building test scenarios, be thorough. Include a variety of query types to really put the Agent through its paces:

Multi-turn queries that require the Agent to carry context across a conversation, not just answer isolated questions.
Vague or fragmented inputs, like typos, grammatical errors, and incomplete questions, because that’s how customers actually write.
Edge cases and sensitive scenarios, like billing disputes, frustrated customers, and questions that sit at the boundary of what the Agent is trained on.
Different phrasings of the same question. An Agent that handles one version well but fails on a rephrasing has a knowledge problem, not a performance problem.
Queries that require pulling from multiple knowledge sources. Real issues are rarely answered by a single help article, and an Agent that can only handle single-source questions will hit a ceiling fast.
Multilingual conversations, if your customer base requires it. Performance can vary significantly across languages and it’s better to discover that in testing than in production.

It’s worth spending time on this preparation. Any Agent can look impressive in a demo environment. But what really counts is how it holds up as part of your team, serving your customers.

What does it feel like to interact with the Agent?

Two AI Agents can achieve the same quantitative scores (resolution rates, containment rate, etc.) and deliver completely different experiences. Yes, resolution rate tells you how often the Agent finishes a conversation, which is the end goal, but it tells you nothing about how the customer felt during it.

Look for indicators that the AI Agent is enjoyable to interact with:

Is the tone natural and on-brand, or does it feel robotic and generic?
Does it build trust early in the conversation, or does it create friction that makes customers want to immediately request a human?
When it doesn’t know the answer, does it handle that gracefully?
When it hands off to a human, is that transition seamless, or does the customer feel abandoned?

As George Dilthey at Clay put it when evaluating their AI setup: “Keep what’s important to your business up front and center. For us, that was transparency and control over the customer experience.”

That framing is exactly right. The Agent represents your brand in every conversation. Customers don’t experience “accuracy,” they experience conversations. An Agent that’s technically accurate but tonally off-brand will erode customer trust over time.

Assess the experience dimension explicitly. Have people on your team (and ideally real customers if you have a group you can test with) interact with the Agent under conditions that reflect your actual support environment. Ask them how it felt, not just whether it worked.

Can you keep improving it after launch?

This is the dimension most teams don’t evaluate at all, and it’s possibly the most important one.

Choosing an Agent that works today and guarantees that you’ll be able to continuously improve your customers’ experience in the long term requires looking at more than what’s right in front of you. Yes, you need to know that the technology works, but you also need to be confident that you’re buying a system that gets better over time.

That means evaluating three things before you commit:

The feedback loop

Can your team easily review conversations and identify where the Agent is underperforming? Can you pinpoint specific gaps (missing knowledge, incorrect tone, poor handoff decisions) and act on them quickly? The faster the loop between “something isn’t working” and “we’ve fixed it,” the more value compounds over time.

The speed of iteration

When you identify a gap, how quickly can you address it? This is partly a question of tooling (how easy is it to update knowledge, refine guidance, adjust behavior?) and partly a question of team capability. The teams getting the most out of AI are the ones that have changed how they operate and made continuous improvement a part of their everyday work. They’ve committed to going all-in for the long term, not just the first few weeks when launching their AI Agent.

The vendor partnership

The vendor behind the Agent matters just as much as the solution itself. You’re choosing a partner for transformation that will help you evolve how your business delivers customer experience. Ask:

How does customer feedback influence the product roadmap, and can they show you examples?
If you have feedback on limitations or weaknesses, do they engage transparently or get defensive?
What kind of support will you get post-launch?
Are they shaping where AI customer experience is going, or reacting to what others are building?

How a vendor responds to those questions tells you more about the long-term relationship than any benchmark result.

What a good POC proves

If your POC only proves “the AI works,” you haven’t done enough.

A strong proof of concept tests performance in realistic conditions, evaluates the experience from the customer’s perspective, and validates the system that will support continuous improvement after launch. It shows you that you’ve chosen something that will set you up for long-term operational success.