Learning CenterQA Across AI and Human Support

Run One QA System Across AI and Human Support Conversations

Insights by Intercom
Run One QA System Across AI and Human Support Conversations

Most support teams run QA in one of two ways: manually grading a handful of human agent conversations each week, or not grading AI conversations at all. The result is two separate quality standards, one loosely enforced for humans and one entirely absent for AI. As AI agents handle a growing share of conversations, this gap becomes a business risk. Customers don't distinguish between a bad answer from a human and a bad answer from an AI. They just remember the experience.

This guide covers how to build a single QA system that holds every conversation to the same bar, regardless of who (or what) handled it.

The Case for AI-Human Quality Alignment

Pure automation without quality oversight creates a false sense of efficiency. An AI agent that resolves 60% of conversations but gives incorrect answers to 5% of them generates a steady stream of customer frustration that never shows up in a resolution rate dashboard. Teams only find out when CSAT drops or when a customer escalates to social media.

Pure human QA doesn't scale either. When teams manually review 3-5% of conversations, the coverage is too thin to catch systemic issues. And as AI handles more volume, the human sample becomes an even smaller fraction of total interactions.

The answer isn't choosing between AI QA and human QA. It's building one system that covers both. When every conversation is evaluated against the same criteria, you can compare quality across channels, identify where AI outperforms humans (and vice versa), and allocate coaching and content resources where they'll have the most impact.

How Unified Quality Monitoring Works

A unified QA system applies the same evaluation framework to all conversations. The structure has three layers:

Scorecards define what quality means. Each scorecard contains criteria like answer accuracy, policy compliance, tone, and resolution quality. These criteria apply equally to AI and human conversations, though some dimensions (like empathy and rapport) may weigh differently for human interactions.

Monitors define what gets reviewed. You set rules for which conversations enter the review queue: random samples for baseline measurement, targeted filters for high-risk segments, or full coverage for critical topics. Monitors run continuously, evaluating conversations as they close.

Review workflows define what happens next. Flagged conversations route to team leads or QA specialists who classify failures, apply fixes, and track resolution. For AI failures, fixes typically mean updating knowledge content or adjusting guidance rules. For human failures, fixes involve coaching, process updates, or documentation improvements.

Building Effective Cross-Channel QA

1. Start with shared criteria

Define quality dimensions that apply to both AI and human conversations. A strong starting set includes:

  • Answer accuracy: Did the response contain correct information?
  • Policy compliance: Did the agent follow your rules for refunds, data handling, and escalation?
  • Resolution quality: Was the customer's problem actually solved?
  • Communication clarity: Was the response easy to understand and appropriately toned?

For human-only criteria, add dimensions like customer rapport and de-escalation skill. For AI-only criteria, add knowledge source accuracy and escalation trigger correctness.

2. Set up parallel monitors

Create monitors that cover both conversation types:

  • AI quality baseline: Random sample of AI-handled conversations, scored against the shared scorecard.
  • AI risk monitor: Targeted at AI conversations with low CX scores, sensitive topics, or high-value customers.
  • Human quality baseline: Random sample of human conversations, scored against the same shared scorecard.
  • Human coaching monitor: Targeted at new hires, agents returning from leave, or agents handling unfamiliar topic areas.

Using the same scorecard across all monitors makes quality scores directly comparable. You can answer: "Is our AI agent more accurate than our human team on billing questions?" with data instead of assumptions.

3. Separate evaluation from reporting

Evaluation should be automated for coverage. AI scoring handles the first pass for every conversation. But reporting should blend automated scores with human review outcomes.

Build reports that show:

  • Overall quality by agent type (AI vs human) on the same dimensions
  • Failure rate trends over time, segmented by agent type
  • Top failure criteria by agent type (AI may fail on accuracy while humans fail on process compliance)
  • Cross-channel quality for customers who interacted with both AI and a human in the same conversation

4. Close the loop differently for AI and humans

When a monitor flags a failed AI conversation, the fix path is usually systemic: update a knowledge article, adjust a guidance rule, or refine a procedure. One fix potentially improves every future conversation on that topic.

When a monitor flags a failed human conversation, the fix path is individual: coach the agent, update training materials, or clarify the process documentation. The fix improves that agent's future conversations.

This difference is why unified QA is so valuable. AI fixes have higher leverage (one fix, many conversations), but human fixes address judgment calls that AI can't replicate. Tracking both in the same system lets you allocate effort where it has the most impact.

What Your Human Agents Need to Know

When you introduce AI-evaluated QA alongside human QA, your team needs clear expectations:

  • Transparency: Share the scorecard criteria and explain how AI scoring works. Agents should know what's being measured and why.
  • Calibration: Run sessions where your team reviews AI-scored conversations and compares their judgment to the automated scores. This builds trust and surfaces criteria that may need refinement.
  • Context: When an agent receives a handoff from an AI agent, they should see the full conversation history, including what the AI attempted and why it escalated. This context shapes how the agent completes the interaction and how that interaction gets evaluated.
  • Fairness: Don't hold human agents to AI-speed standards or AI agents to human-empathy standards. Use shared criteria for shared dimensions and agent-specific criteria for unique strengths.

Measuring Collaboration Quality

MetricWhat it tells youHow to interpret
Cross-agent quality parityGap between AI and human scores on shared criteriaConverging scores indicate effective calibration
Handoff qualityCX score for conversations that transferred from AI to humanLow scores suggest context loss during handoff
Fix leverage ratioNumber of conversations improved per fix, by agent typeAI fixes should show higher leverage than human fixes
Coverage rate% of total conversations evaluated by your QA systemTarget 100% for AI, 20%+ for human

Common Failure Modes

  • Running separate scorecards for AI and human conversations. This makes quality scores incomparable. Start with shared criteria and add agent-specific dimensions on top, not as replacements.
  • Monitoring AI conversations but not acting on results. Dashboards without fix workflows produce data, not improvement. Every flagged conversation should have an owner and a resolution path.
  • Over-weighting AI failures relative to human failures. AI conversations get 100% coverage, so failures are more visible. Human conversations sampled at 5% will appear to have fewer issues simply because fewer are reviewed. Normalize by coverage rate before comparing.
  • Ignoring the handoff moment. The transition from AI to human is where quality most commonly drops. Customers repeat themselves, context gets lost, and resolution times spike. Monitor handoff conversations separately and optimize for context continuity.

Key Takeaways

  • Build one QA system with shared criteria that covers AI and human conversations equally.
  • Use automated scoring for coverage and human review for depth.
  • Fix AI failures systemically (update content) and human failures individually (coach agents).
  • Monitor the AI-to-human handoff as its own quality dimension.
  • Compare quality across agent types using the same scorecard to make resource allocation decisions with data.