Skip to main content

How to write effective Monitor & Scorecard criteria

Best practices for defining clear, actionable criteria to evaluate teammate performance using monitors and scorecards.

Written by Dawn
Updated today

Writing effective criteria is what separates a Monitor that surfaces real issues from one that floods your queue with noise. This guide covers best practices for both Monitor flag criteria and Scorecard attribute descriptions. Monitors currently evaluate Fin AI Agent conversations only.

Note: Monitors is available as part of the Pro add-on.


Monitor flag criteria vs. scorecard attribute descriptions

These two types of criteria work differently, so they need to be written differently.

Monitor flag criteria

Scorecard criteria descriptions

Purpose

Decides which conversations get reviewed

Defines how each conversation is evaluated

Logic

Yes/no - each monitor runs independently

Competitive - AI selects the single best match

Key challenge

Reduce false positives and false negatives

Eliminate overlap between criteria values


Best practices for writing Monitor flag criteria

Monitors run as independent yes/no checks. Multiple Monitors can flag the same conversation - and that is fine. Because of this, precision matters more than distinction.

1. Describe observable behavior, not inferred intent

  • Focus on what appears in the conversation.

  • Avoid: Customer is frustrated

  • Prefer: Customer uses phrases such as This is unacceptable, I am extremely disappointed, or This is ridiculous.

The AI performs better when evaluating explicit signals rather than emotional interpretations.

2. Include concrete examples

  • Examples dramatically reduce ambiguity.

  • Use explicit phrasing patterns: e.g., cancel my subscription, close my account, delete my data

  • Examples anchor the model to real-world language.

3. Add explicit exclusions

Reducing false positives is critical for Monitors.

Example: Customer uses profanity. EXCLUDE: mild language such as damn or crap. If something should not trigger the monitor, say so clearly.

4. Use quantifiable thresholds

  • Avoid vague wording.

  • Bad: Fin gives a short response.

  • Better: Fin response is fewer than 50 words.

  • Specific thresholds improve consistency.

5. Break multi-step logic into numbered criteria

If your Monitor depends on sequence or pattern, structure it clearly:

  1. Customer expresses frustration.

  2. Fin responds without acknowledging emotion.

  3. Customer repeats complaint.

This makes the logic deterministic and easier to evaluate.

6. Keep it simple

  • If the rule is straightforward, do not overcomplicate it.

  • Example: Fin suggests next steps (e.g., Please try clearing your cache, Log out and back in, Click this link).

  • Clarity beats complexity.

7. Use 'explicitly' to require direct customer language

If your Monitor should only trigger when a customer directly states something — not just implies it — include the word "explicitly" in your criteria. Without it, the AI may infer intent from context and match conversations where the behavior was only suggested, not stated.

  • Without "explicitly": Customer requests a call back — could match "Can you connect me to the security team?" since the AI may infer this implies a request for direct contact.

  • With "explicitly": Customer explicitly requests a call back — only matches if the customer directly asks, e.g., "Can I get a call?" or "Please call me."

Tip: Use the Test Monitor tool to validate your criteria against real conversations before turning it on. Update the flag criteria and rerun the test until results accurately reflect what you want the Monitor to capture.


Best practices for writing scorecard criteria descriptions

Start with the core principle: criteria compete. The AI looks at the full list and selects the single best match for each criteria. Your job is to make that choice obvious.

1. Use clear, concise names

  • Keep names short and specific. Someone reading the list should immediately understand the purpose without opening the description.

  • Bad: Customer Communication Issues

  • Better: Tone - Rude or Dismissive

2. Write comprehensive descriptions

Descriptions carry most of the classification signal.

  • Explicitly describe all conversation types that belong.

  • Include keywords, common phrasings, and examples.

  • Think through edge cases and include them.

  • Clarify what good and bad instances look like.

The description should make it easy for the AI to recognise real-world phrasing, not just abstract definitions.

3. Make criteria clearly distinct

Criteria within the same scorecard should not compete conceptually.

  • Avoid semantic overlap.

  • Ensure each attribute has a clear boundary.

  • If two attributes could reasonably apply for the same reason, refine one of them.

It is fine if a single conversation fits multiple criteria across the scorecard. What matters is that within each criteria set, the values are clearly separable.

4. Evaluate quality systematically

When reviewing your taxonomy, assess each criteria on:

  • Clarity / conciseness

  • Description comprehensiveness

  • Criteria distinction

  • Overlapping criteria (if any)

  • Final score + commentary

This structured review forces you to tighten definitions and reduce ambiguity - which directly improves classification performance.


FAQs

How long should my flag criteria be?

There is no fixed length - the right length is whatever it takes to describe the behavior precisely. A simple Monitor might only need two or three sentences. A complex one (like detecting multi-step failure patterns) may need a structured, numbered description. Err on the side of more detail rather than less.

Can I use the same scorecard criteria across multiple scorecards?

Yes - criteria titles and descriptions are reusable. Once you have created criteria, you can add it to multiple scorecards. Note that previous rating scores cannot be reused and will need to be set from scratch in each scorecard.

What is the difference between monitor flag criteria and a scorecard criteria description?

Monitor flag criteria determines whether a conversation gets pulled into a Monitor at all - it is a yes/no filter. Scorecard criteria descriptions define how each conversation is scored once it is in the Monitor. Think of the Monitor as the net and the scorecard as the ruler.


💡Tip

Need more help? Get support from our Community Forum
Find answers and get help from Intercom Support and Community Experts


Did this answer your question?