Skip to main content

Getting started with human teammate QA in Monitors

Learn how to set up Monitors for human teammate QA, including creating custom scorecards, configuring monitors, and submitting reviews.

Written by Alissa Tyrangiel

Monitors for QAing human teammates let you evaluate and improve the quality of your teammates' conversations at scale. You define which conversations get reviewed, attach a custom scorecard to score them, and assign reviewers - all automatically.

A Monitor selects which conversations to review. A Scorecard defines how each one is evaluated — the specific criteria you care about, such as accuracy, tone, or policy adherence. Together, they give you a consistent, scalable way to measure and improve conversation quality across your team.

Setting up human agent QA takes two steps:

  1. Create a Scorecard that defines your quality criteria.

  2. Set up a Monitor to select which conversations get reviewed, who reviews them, and how reviews are routed to your team.

Note: This article covers Monitors for human agent QA. For Monitors evaluating Fin AI Agent conversations, see Monitors and Custom Scorecards. Monitors for human agent QA is available as part of the Pro add-on.


How to create a Scorecard

Go to Fin AI Agent > Analyze > Monitors and click Scorecards.

Create your own by clicking + New scorecard:

Start by selecting who's evaluated. Click Human teammates assigned and choose either teammates in specific teams or individual teammates.

Next, add your scorecard criteria. First, click + Criteria > Create new.

When creating a new criteria, work through the following steps:

1. Name the criteria

Give the criteria a short, clear name (for example, Sentiment or Answer accuracy). This name appears in reports and will be used as a reference.

2. Describe what is being evaluated

Add a clear description explaining what the criteria checks and how it should be scored. The description is the prompt the AI uses to score this criteria, and the more precise it is, the more accurately AI will evaluate conversations. It also helps human reviewers apply the same criteria consistently.

Tip: For help writing effective descriptions, see how to write effective Monitor and Scorecard Criteria.

3. Choose how the criteria is scored

Decide whether the criteria should be automatically scored with AI, or manually scored by human reviewers. You can mix AI-scored and human-scored criteria within the same scorecard.

Note: Scorecard criteria titles and descriptions are reusable. Once you have created a criteria, you can add it to multiple scorecards. Previous rating scores cannot be reused and will need to be set from scratch in each scorecard.

4. Define rating options

Add the possible rating values a reviewer or AI can select (for example: Good, Okay, Poor). Each criteria must have at least two rating options. For each rating option, you will:

  • Name the rating (short and clear)

  • Describe when it should be selected

  • Assign a score (for example, 100%, 50%, 0%) or mark it as Not scored

The score you assign determines how that rating contributes to the overall review score.

4b. Define rating reasons (optional)

For each rating option, you can define a list of rating reasons which are predefined labels that explain why a particular score was given. Rating reasons help reviewers and AI categorize scores consistently, making it easier to identify patterns across conversations.

When AI scores a criteria, it automatically selects the most relevant predefined reason where one applies. If no predefined reason fits, the AI generates a clear explanation so that every score has meaningful context.

6. Enable Auto-review (optional)

You can automate the entire QA process for a scorecard by toggling on Auto-review scorecard.

When enabled:

  • If AI scores all criteria in the scorecard, the manual review step is skipped entirely.

  • Teammates can still manually override any AI score if they spot a discrepancy.

Tip: Auto-review works best on scorecards where all criteria are AI-scored. If any criteria requires a human, those conversations will still appear in the Unreviewed queue.


Configure your scorecard

After adding scorecard criteria, configure how they affect the overall review result.

Marking a scorecard criteria as critical

You can mark criteria as Critical. If a critical criteria receives a failing rating, the entire review is marked as failed — regardless of how the other criteria scored:

  • The review shows as Fail in scorecard views, even if the weighted score would otherwise have met the pass threshold

  • This overrides the pass threshold and all weights

  • Not scored ratings exclude the criteria from the overall score and do not trigger failure

Critical criteria are useful for non-negotiable standards such as compliance requirements, safety or policy adherence, and escalation handling.

Scorecard criteria weighting

Each criteria can be assigned a weight to define its relative importance.

  • Weight must be an integer between 0 and 100

  • Higher weights increase the impact of that criteria on the overall review score

Weights only apply to criteria included in the review score. Use weights to reflect what matters most — for example, a higher weight on Accuracy than Efficiency if correctness is more important than speed.

Note: Weights are relative to each other, not fixed to a scale of 100. The total can add up to any number — what matters is the proportion each criteria contributes. Criteria with a weight of 25 out of a total of 50 contributes the same as one weighted at 50 out of 100.

Adding a pass threshold

You can define a pass threshold — the minimum overall score required for a review to be considered passing. For example, if the pass threshold is 80%, any review scoring below 80% is marked as failed.

This is evaluated after weighted scoring, provided no critical criteria has already failed the review.


How the overall review score works

  1. Each criterion is rated using its defined rating options.

  2. Ratings contribute their assigned score (or are excluded if marked Not scored).

  3. Included criteria are combined using their assigned weights.

  4. If any critical criteria receives a failing rating, the overall review score becomes 0%.

  5. The final score is compared against the pass threshold to determine whether the review passes or fails.

Here's an example of how three criteria combine into a final score:

Criteria

Rating selected

Rating score

Weight

Accuracy

Good

100%

60

Tone

Okay

50%

30

Efficiency

Good

100%

10

Overall score = (100x60 + 50x30 + 100x10) / (60+30+10) = 85%


How to create a Monitor

Monitors define which conversations get reviewed. You set the criteria, choose the reviewer, and attach a scorecard to evaluate quality. Once live, Monitors run automatically and surface matching conversations for your team to action.

You need at least one scorecard before you can get the most out of a monitor that evaluates human conversations.

To access Monitors, go to Fin AI Agent > Analyze > Monitors. Click + Monitor to get started. You can also choose a template for Fin monitors, Teammate monitors, or General monitors.

Step 1: Choose how conversations are evaluated

Give your Monitor a name, then choose how conversations are evaluated. This is where you can link the scorecard you created to evaluate human agents:

Associate a scorecard with the Monitor to automatically evaluate every matched conversation against defined criteria. Once selected, the scorecard runs as soon as a conversation is added to the Monitor, and results appear in the Monitor for reporting and review.

Tip: Attaching a scorecard is what makes a Monitor truly useful — without one, conversations are flagged but not scored.

This is also where you can select your reviewers. All conversations that match the Monitor are automatically assigned to the selected reviewers, so reviews are routed consistently without manual coordination.

Note: If the attached scorecard has Auto-review enabled, the reviewer status will show as Auto-reviewed. These conversations will bypass the manual Unreviewed queue unless the AI detects a failure or cannot confidently score criteria.

Step 2: Choose conversations

Your Monitor can target:

  • A random sample — for example, a weekly sample of customer service conversations for baseline QA

  • A targeted set based on specific signals or risk — for example, all conversations where a customer shows signs of financial vulnerability

You can narrow down conversations using:

Note: A single conversation can appear in multiple Monitors. Each Monitor runs independently, so if a conversation matches more than one Monitor's criteria, it will be flagged in each. Clicking through to a conversation shows exactly why it was flagged by that Monitor.

Step 3: Choose a Monitoring mode

Select how the Monitor runs:

  • Continuous: runs ongoing, matching new conversations as they close and adding them automatically

  • One-time: backfill only, matching conversations from historical data. New conversations that close after setup are not included

  • Scheduled: runs on a recurring daily or weekly cadence, letting teammates review conversations on a regular schedule

Step 4: Select the start date

Choose when the Monitor should begin evaluating conversations. This lets you run QA on historical conversations from a specific point in time, while continuously surfacing new matching conversations from that date forward.

Note: When first creating a Monitor, you can backfill up to 90 days of historical conversations. From that point, the Monitor continues capturing new matching conversations automatically.

Step 5: Choose when conversations are added

A conversation must be closed before it can be evaluated by a human agent QA monitor.


Submitting reviews

Conversations can be reviewed and submitted from various views.

In all views:

  • The conversation list shows the overall review score (percentage or Fail) alongside the individual criteria ratings as columns. This makes it easy to scan performance across conversations and spot failures or low scores.

  • When you open a conversation and go to the Score tab, you can see the assigned scorecard, review status, overall score, and the selected rating for each criteria. This view shows exactly how the final score was determined. When a criteria is scored using AI, you can hover over the rating in the Score tab to see a tooltip showing the rating selected, the criteria description, and the AI's reasoning for that score — all in one place.

There are several ways to access and submit reviews. Click a monitor to view all reviews associated with it or on the Inbox page click on Assigned to me to directly view all the reviews you're in charge of.

or click on Reviews received to directly view all the reviews that have been submitted for you as the teammate being reviewed.

To complete a review:

  1. Open a conversation from Assigned to me view.

  2. Go to the Score tab and fill in each scorecard criteria.

  3. AI-generated scores can be overridden by clicking the rating.

  4. Once all criteria are scored, submit the review, or leave it for further action if needed. You can also add notes to the the review to give context on why a review got that score.

Note: If you previously used additional review statuses such as Fix needed or Won't fix, you can still filter by these in existing monitors. New monitors only support Unreviewed and Submitted.


Reporting

Monitor reports help you track and measure conversation quality. You can use these metrics to build reports that highlight quality trends and identify areas for improvement.

All Monitor metrics are available in the custom report builder, so you can combine them with other Intercom data to create tailored views of conversation quality.


To build a custom report using Monitor metrics, go to Reports > + New report > Create your own and select the metrics you need from the Monitors category. You can filter by scorecard, time period, or any other attribute to focus on the segments most relevant to your team.

Scorecard evaluation

Metric Name

Description

Evaluated scorecards

Number of scorecard evaluations.

Scorecard fail rate

Percentage of scorecard evaluations that failed.

Scorecard fails

Number of scorecard evaluations that failed.

Scorecard pass rate

Percentage of scorecard evaluations that passed.

Scorecard passes

Number of scorecard evaluations that passed.

Scorecard score

The review score assigned to scorecard evaluations.

Scorecard criteria evaluation

Scorecard criteria evaluation are qualitative data points used to categorize or filter your metrics.

Metric Name

Description

Evaluated scorecard criteria

Number of scorecard criteria evaluations.

Scorecard criteria fails

Number of scorecard criteria evaluations that failed.

Scorecard criteria passes

Number of scorecard criteria evaluations that passed.

Scorecard criteria score

The review score assigned to scorecard criteria

Reporting attributes

Attribute name

Description

Monitor

The QA monitor

Review status

The current status of the review. For human QA monitors the values can be Unreviewed or Submitted.

Reviewed by

The reviewer who completed or is responsible for the review.

Reviewee

The teammate being evaluated in the review.

Scorecard

The evaluation template applied during the review.

Scorecard outcome

The final outcome of the scorecard evaluation. Example values include: Pass, Fail, N/A, Not complete, and Not scored.

Scorecard score

The quantitative score produced by the scorecard evaluation.


Permissions

To edit scorecards and monitors and score conversations, teammates need both of the following permissions:

  • Can access Fin AI Agent and Automation settings

  • Can create, edit and internally share Reports

Teammates who don't have both permissions can't see human QA Monitors and can only see reviews of their own work via the Reviews received view. They can't override AI-scored criteria on their own reviews.

Note: Teammates need both permissions because human QA combines two product areas — scorecards live in Fin AI Agent, and review data feeds into Reports. Granting only one permission will leave the teammate unable to access the feature.


FAQs

How are multi-teammate conversations evaluated?

Only the teammate assigned to the conversation is evaluated. If multiple teammates participated, only the assigned teammate's replies are scored — the rest of the conversation is used as context only. The full conversation thread is sent to the LLM, with each part annotated by author, then use a targeted prompt instruction to tell it to grade only that specific teammate's replies and treat everything else as context only.

Which plan do I need to use Monitors for human agent QA?

Monitors for human agent QA are available as part of the Pro add-on. It's not included with the standard Essential, Advanced, or Expert plans — you'll need the Pro add-on attached to your subscription to access scorecards and human QA Monitors. Pro is priced on conversation volume rather than seats — starting at $99/month for up to 1,000 Pro conversations, with tiered pricing for additional volume.

Are there limits on how many Monitors or scorecard criteria I can create?

Yes, each workspace has the following limits:

  • 20 live Monitors that use natural-language flag criteria (the field where you describe in plain English what conversations to flag). Monitors that only use precise filters (Resolution State, Topic, CX Score) don't count toward this limit.

  • 20 AI-scored criteria across all your scorecards. Human-scored criteria don't count toward this limit.

Does AI scoring cost extra per conversation reviewed?

There is no additional per-conversation charge for AI scoring, it is included in the Pro add-on — each conversation is counted once toward your Pro volume regardless of how many AI-scored criteria evaluate it or how many Monitors flag it.

Are there limits on how many conversations I can have per month?

No, Monitors don't have a separate monthly review limit — every conversation that matches a live Monitor will be evaluated. What you're billed for is your Pro conversation volume, not the number of Monitor reviews. If you want to limit the volume of conversations that go to human review, configure your Monitor's sampling settings — you can cap reviews to a random sample (for example, 10 conversations per day) rather than reviewing every match. You can also set a hard cap on your overall Pro conversation volume to keep billing predictable. Once that cap is reached, Pro conversations stop being metered for the rest of the billing cycle.

Do I need to pay for each teammate I review?

No, Pro is priced on conversation volume, not seats. Once your workspace has the Pro add-on, you can review any number of teammates' conversations — what you're billed for is the volume of conversations your workspace handles, not the number of teammates being reviewed or doing the reviewing.

What permissions do I need to set up and use human agent QA?

To create scorecards, edit Monitors, and score conversations, you need both:

  • Can access Fin AI Agent and Automation settings, and

  • Can create, edit and internally share Reports

If you only have one of these permissions, you'll be able to see reviews of your own work via the Reviews received view, but you won't be able to create or edit anything.

Does a failing critical criteria zero out my review score?

No, the weighted score is still calculated as normal — but the review is marked as Fail regardless of what the weighted score would have been. The critical override applies to the pass/fail outcome, not the numeric score.

What does "Not scored" mean, and how is it different from a 0% rating?

A "Not scored rating" tells us to skip the criteria entirely — it won't contribute to the overall review score and won't trigger a critical failure, even if marked critical. A 0% rating still counts: it contributes weight × 0 to the overall score, and if the criteria is critical, it'll fail the review. Use Not scored when a criteria doesn't apply to the specific conversation (e.g. a tone criteria on a conversation that ended in one reply).

Why don't my criteria weights need to add up to 100?

Weights are proportional, not absolute. Two criteria weighted 25 and 75 produce the same scoring outcome as two criteria weighted 1 and 3 — what matters is the ratio between them, not the total. This means you can adjust one criteria's weight without having to manually rebalance the others.

When does auto-review skip the Unreviewed queue entirely?

Auto-review skips the Unreviewed queue only when all of the following are true:

  • The scorecard has Auto-review enabled

  • Every criteria in the scorecard is AI-scored (no human-scored criteria)

  • AI was able to confidently score every criteria

If even one criteria is human-scored — or if AI couldn't score a criteria confidently — the conversation goes to the Unreviewed queue for manual review.

Can I change a Monitor's type after creating it?

No, once a Monitor is created as Continuous, One-time, or Scheduled, the type can't be changed. If you need a different type, archive the existing Monitor and create a new one.

What happens to existing reviews if I edit a scorecard?

Existing reviews stay scored against the version of the scorecard that was active when they were created. They aren't re-scored against the new version.

New conversations matched after the edit are scored against the updated scorecard. This is why you'll occasionally see older reviews referencing criteria that no longer exist on the current scorecard.

Why did my reviewer change to someone else after I edited the criteria?

When any teammate updates a criteria on a review (either AI-scored or manually scored), the reviewer for that review is automatically set to whoever made the most recent edit. This applies to all scorecards, including auto-reviewed ones — editing an auto-reviewed conversation will replace Auto-reviewed with your name.

The review status isn't changed automatically.

Can the same conversation appear in multiple Monitors?

Yes, a conversation can match the criteria of more than one Monitor — each Monitor runs and evaluates independently, so the conversation can have multiple sets of scorecard ratings from different reviewers. When you open the conversation, you'll see which Monitor flagged it for each set of ratings.

Can I reuse criteria across multiple scorecards?

Yes — once you've created a criteria (name + description), you can attach it to other scorecards from the + Criteria menu. However, rating options and scores don't carry over — you'll need to set the ratings, scores, and weights from scratch in each scorecard you add the criteria to.


Coming soon

  • Teammate coaching tips: AI driven coaching tips for teammates who are being reviewed, as well as for managers reviewing those teammates.

  • Calibration workflows: Calibration helps reviewers align on evaluation standards by assessing shared examples and comparing outcomes, improving consistency and fairness in feedback and quality measurement.

  • Dispute workflow: Teammates will be able to dispute their reviews.

  • Evaluation against the knowledge base: Score conversations against your support content and policies, helping ensure teammates follow internal processes.

  • Sorting and rearranging columns in human QA monitors.


💡Tip

Need more help? Get support from our Community Forum
Find answers and get help from Intercom Support and Community Experts


Did this answer your question?