Skip to main content

Fin AI Agent Monitors [beta]

Quality assurance for Fin conversations—at scale.

Written by Alissa Tyrangiel
Updated today

Note: Monitors is available as part of the Pro add-on. This feature is currently in closed beta. If you'd like access, you can request access via this form. We’re gradually expanding availability and will follow up if a spot opens up.

What are Monitors?

Monitors help you continuously evaluate and improve Fin’s conversation quality at scale.

They give you a structured way to define which conversations should be reviewed—whether that’s a random sample for baseline quality, or a targeted set based on higher-risk or higher-impact signals. This replaces ad-hoc sampling and spreadsheet-driven QA with a repeatable system that scales as volume grows.

Monitors work with Custom Scorecards:

  • Monitors define what gets reviewed

  • Scorecards define how each conversation is evaluated

Scorecards can include criteria that are:

  • Reviewed manually

  • Evaluated using AI

  • Or a combination of both

This ensures quality is assessed consistently, while still allowing flexibility in how reviews are performed.


How teams use Monitors

Teams use Monitors to maintain ongoing visibility into quality and focus attention where it matters most. Common use cases include:

  • Reviewing a random sample to understand overall quality trends

  • Focusing on higher-risk or higher-impact conversations, such as:

    • Low CX scores

    • Policy breaches

    • Legal threats

    • Other business-specific indicators

  • Tracking conversations tied to a specific initiative, like a feature launch, pricing change, or product update

Monitors make it easier to detect patterns, surface issues earlier, and generate insights that can be shared with product, support, or leadership teams.


How to create a Monitor

To access Monitors, go to Fin AI Agent > Analyze > Monitors.

To create a new Monitor, click + Monitor. Pick one of the templates or Start from scratch.

Choose conversations

Give your Monitor a name and choose which conversations the Monitor should review.

This can be:

  • A random sample (for example, a weekly sample of Fin conversations for baseline QA)

  • A targeted set based on specific signals or risk (for example, all conversations where a customer shows signs of financial vulnerability)

You can narrow down conversations of interest by:

Select the start date

Choose when the monitor should begin evaluating conversations. This allows you to run QA on historical conversations from a specific point in time, as well as continuously surface new matching conversations.

Choose when conversations are added to the monitor

You can control when a conversation is matched to a monitor. This determines when the monitor evaluates the conversation — and, if a scorecard is attached, when that scorecard runs. Select one of the following options:

  • Fin is done – Conversations are added once Fin has fully completed handling them (resolved, escalated, or followed up with no customer reply).

  • Conversation is closed – Conversations are added only after the conversation is closed, either by a teammate or by Fin.

Use this setting to align evaluation timing with your workflow — whether you want to assess Fin immediately after it finishes, or only once the conversation is officially closed.

Choose the reviewer

All conversations that match the Monitor are automatically assigned to that reviewer, so reviews are routed consistently without manual coordination.

In this example, we're creating a Monitor that flags conversations with "Vulnerable customers", starts finding matches from Now (today), and assigns them to Alissa for review.

Attach a scorecard (optional)

You can associate a scorecard with a monitor to automatically evaluate every matched conversation against defined criteria. Once selected, the scorecard runs as soon as the conversation is added to the monitor, and results appear in the monitor for reporting and review.

Test your monitor before turning it on

For monitors that use natural language flag criteria, use the Test monitor tool to validate your criteria against real conversations before you create or update the monitor. It shows which conversations would be flagged and highlights mismatches so you can refine the wording and reduce false positives or misses.

Tip: We strongly recommend testing every monitor with flag criteria before turning it on.

In the Flag criteria section, click Run test or click the Test button on the top right

Review sample conversations

For existing monitors, this list is automatically populated with recent conversations that were flagged and not flagged by the monitor. You can also paste additional conversation URLs or IDs to test specific edge cases.

Check the results

For each conversation, review the Monitor result (Flagged / Not flagged) and mark whether it’s Correct. The evaluation summary shows your overall pass rate and highlights mismatches.

Refine and retest


Update the Flag criteria description and rerun the test until the results accurately reflect what you want the monitor to capture.

Once the Monitor has been created, it will start finding matches and appear on your Monitors page. You can always edit your Monitor configuration later, if needed.


Creating and configuring Scorecards

A custom scorecard defines what “good” looks like for your team by explicitly setting the criteria you care about—such as accuracy, tone, or policy adherence.

You can have multiple scorecards for different Monitors. Simply choose which scorecard you want to associate with a Monitor from the Monitor set up screen:

To create a scorecard

Go to the Fin AI Agent > Analyze > Monitors. Click Scorecards. You can use the out-of-the box Fin Quality Scorecard or create your own by clicking +New scorecard:

Create a new scorecard attribute

Start by adding scorecard attributes. Click +New attribute

When creating a new attribute, you’ll:

1. Name the attribute. Give the attribute a short, clear name (for example, Sentiment or Answer accuracy). This name appears in reports and will be used as a reference.

2. Describe what’s being evaluated. Add a clear description explaining:

3. Choose how the attribute is scored. Decide whether the attribute should be:

  • Automatically scored with AI, or

  • Manually scored by human reviewers

You can mix AI-scored and human-scored attributes within the same scorecard.

4. Define rating options. Add the possible rating values a reviewer or AI can select (for example: Good, Okay, Poor). Each attribute must have at least two rating options. For each rating option, you’ll:

  • Name the rating (short and clear)

  • Describe when it should be selected

  • Assign a score (for example, 100%, 50%, 0%) or mark it as Not scored

The score you assign determines how that rating contributes to the overall review score.

5. Choose whether to include it in the review score

You can toggle Include in review score on or off.

  • When enabled, this attribute contributes to the overall review score.

  • When disabled, the attribute is recorded for analysis and reporting, but does not affect the overall score

In this example, we've created a scorecard attribute that evaluates escalation ease:

Configure your scorecard

After adding scorecard attributes, you can configure how they affect the overall review result.

Marking a scorecard attribute as critical

You can mark an attribute as Critical.

If a critical attribute receives a failing rating, the entire review fails:

  • The overall review score becomes 0%

  • This overrides all weights

  • “Not scored” ratings exclude the attribute from the overall score and do not trigger failure

Critical attributes are useful for non-negotiable standards such as:

  • Compliance requirements

  • Safety or policy adherence

  • Escalation handling

Scorecard attribute weighting

Each attribute can be assigned a weight to define its relative importance.

  • Weight must be an integer between 0 and 100

  • Higher weights increase the impact of that attribute on the overall review score

Weights only apply to attributes that are included in the review score. Use weights to reflect what matters most. For example, you might assign a higher weight to Accuracy than to Efficiency if correctness is more important than speed.

Adding a pass threshold

You can define a pass threshold for the scorecard. The pass threshold determines the minimum overall score required for a review to be considered passing. For example:

  • If the pass threshold is 80%, any review scoring below 80% will be marked as failed.

This is evaluated after weighted scoring, provided no critical attribute has already failed the review.

How the overall review score works

  1. Each attribute is rated using its defined rating options.

  2. Ratings contribute their assigned score (or are excluded if marked Not scored).

  3. Included attributes are combined using their assigned weights.

  4. If any critical attribute receives a failing rating, the overall review score becomes 0%.

  5. The final score is compared against the pass threshold to determine whether the review passes or fails.

Where to view scores

Once reviews are completed, scores are visible in both the conversation list and within each conversation.

In Monitor, the conversation list shows the overall review score (percentage or Fail) alongside the individual attribute ratings as columns. This makes it easy to quickly scan performance across conversations and spot failures or low scores.

When you open a conversation and go to the Score tab, you can see the assigned scorecard, review status, overall score, and the selected rating for each attribute. This view shows exactly how the final score was determined.


Managing reviews

Each monitor gives you a clear view of the conversations it has matched and the scorecard scores. This makes it easy to move from detection to review to action, without leaving Intercom. Click on a Monitor and you can see:

  • All matched conversations in one place

  • Review status (not reviewed, reviewed, fix complete, needs action)

  • Any AI applied scores and where manual scoring is still needed

  • Who is assigned as the reviewer

Manual reviews can be completed directly from the Monitor conversation view by clicking on the conversation and filling in the scorecard. AI‑generated scores can also be overridden by human reviewers if needed.


Best practices for writing Scorecard Attribute descriptions

Start with the core principle: Attributes compete. The AI looks at the full list and selects the single best match for each attribute. Your job is to make that choice obvious.

1. Use clear, concise names

  • Keep names short and specific. Someone reading the list should immediately understand the purpose without opening the description.

  • Bad: Customer Communication Issues

  • Better: Tone – Rude or Dismissive

2. Write comprehensive descriptions

Descriptions carry most of the classification signal.

  • Explicitly describe all conversation types that belong.

  • Include keywords, common phrasings, and examples.

  • Think through edge cases and include them.

  • Clarify what “good” and “bad” instances look like.

The description should make it easy for the AI to recognize real-world phrasing, not just abstract definitions.

3. Make attributes clearly distinct

Attributes within the same scorecard should not compete conceptually.

  • Avoid semantic overlap.

  • Ensure each attribute has a clear boundary.

  • If two attributes could reasonably apply for the same reason, refine one of them.

It’s fine if a single conversation fits multiple attributes across the scorecard. What matters is that within each attribute set, the values are clearly separable.

4. Evaluate quality systematically

When reviewing your taxonomy, assess each attribute on:

  • Clarity / Conciseness

  • Description Comprehensiveness

  • Attribute Distinction

  • Overlapping Attributes (if any)

  • Final Score + Commentary

This structured review forces you to tighten definitions and reduce ambiguity — which directly improves classification performance.


Best practices for writing Monitor (Flag) criteria

Monitors do not compete. Each monitor runs independently as a yes/no check. Multiple monitors can flag the same conversation — and that’s fine.

Because of this, precision matters more than distinction.

1. Describe observable behavior, not inferred intent

  • Focus on what appears in the conversation.

  • Avoid: “Customer is frustrated”

  • Prefer: “Customer uses phrases such as ‘This is unacceptable,’ ‘I’m extremely disappointed,’ or ‘This is ridiculous.’”

The AI performs better when evaluating explicit signals rather than emotional interpretations.

2. Include concrete examples

  • Examples dramatically reduce ambiguity.

  • Use explicit phrasing patterns: “e.g., ‘cancel my subscription,’ ‘close my account,’ ‘delete my data’”

  • Examples anchor the model to real-world language.

3. Add explicit exclusions

Reducing false positives is critical for monitors.

Example: “Customer uses profanity. EXCLUDE: mild language such as ‘damn’ or ‘crap.’” If something should not trigger the monitor, say so clearly.

4. Use quantifiable thresholds

  • Avoid vague wording.

  • Bad: “Fin gives a short response.”

  • Better: “Fin response is fewer than 50 words.”

  • Specific thresholds improve consistency.

5. Break multi-step logic into numbered criteria

If your monitor depends on sequence or pattern, structure it clearly:

  1. Customer expresses frustration.

  2. Fin responds without acknowledging emotion.

  3. Customer repeats complaint.

This makes the logic deterministic and easier to evaluate.

6. Keep simple monitors simple

  • If the rule is straightforward, don’t overcomplicate it.

  • Example: “Fin suggests next steps (e.g., ‘Please try clearing your cache,’ ‘Log out and back in,’ ‘Click this link’).”

  • Clarity beats complexity.


Coming soon

We’re expanding Monitors with more powerful ways to detect issues, measure quality, and take action. Upcoming improvements include:

  • Structured sampling: Define a fixed sample size per Monitor, including recurring weekly or monthly samples, to support consistent QA scoring.

  • Advanced reporting: Filter reports by Monitor, scorecard, attributes, and scores directly within the reporting platform.

  • Balanced reviewer workload: Assign Monitors to multiple reviewers or teams to distribute manual review work evenly.

  • Scorecard rating transparency: See why an AI rating was applied to a scorecard attribute, or require reviewers to provide a reason for manual scores.

  • Out-of-the-box procedure monitors: Automatically track whether procedures are triggered and completed successfully, flagging connector failures, execution errors, user frustration signals, and escalation handling quality.

  • Real-time alerts: Get notified when conversations in a Monitor cross defined thresholds or fail a scorecard.

  • Pre-deployment scoring: Test changes in preview by evaluating conversations against scorecards before going live.

  • Human agent QA: Apply Monitors and scorecards to teammate conversations—not just Fin.


FAQs

If a conversation is added to a monitor and evaluated, what happens if it reopens later? Will it be evaluated again?

No, a conversation is evaluated only once per monitor. Conversations are added to a monitor based on the setting you’ve selected in the monitor configration (for example, “Fin is done” or “Conversation is closed”). When the conversation reaches that state, it’s matched into the monitor and evaluated. If the conversation later reopens because the customer sends a new message, it won’t be re-matched or re-evaluated under the same monitor version. The original evaluation is the only one recorded.

Do monitors work for Fin Voice?

No, at the moment Fin Voice is not supported

Are tickets evaluated by monitors?

No, neither customer tickets nor tracker tickets get matched into monitors.


💡Tip

Need more help? Get support from our Community Forum
Find answers and get help from Intercom Support and Community Experts


Did this answer your question?