Skip to main content

How we measure SLA adherence at Intercom

This guide outlines Intercom's comprehensive approach to measuring Service Level Agreement (SLA) adherence.

Dawn avatar
Written by Dawn
Updated over 4 months ago

At Intercom, we track SLA adherence using real customer activity - not just server health.

We use “heartbeat metrics” to monitor the core functionality of our platform. These metrics reflect whether customers can use key features of our products - they are a measure of real-world success rates within our product.

If a heartbeat metric drops, and the issue affects customer ability to use our core product, we count it as impacting our SLAs.

What is a ‘heartbeat metric’?

A heartbeat metric is a high-volume, real-time indicator of whether a core feature is working.

Examples include:

  • Messages created in Web or Mobile Messenger.

  • Teammate replies in the Inbox.

  • Fin text responses to customers.

  • Teammate interactions with the Inbox.

We monitor these constantly. When one drops below expected levels, we investigate immediately - even if users haven't reported a problem yet.


SLAs we track

We maintain two SLAs, adherence to which is informed by our various heartbeat metrics:

  • Fin SLA: Covers Fin’s ability to generate text responses and reply to customers on your behalf,

  • Core Platform SLA: Covers the usability of the Inbox for your team and the Messenger for your customers.

When do we count impact against these SLAs?

We count impact against our specific SLAs when:

  • There is a complete outage of our Fin or Core Platform product functionalities.

  • There is significant degradation in the experience of using our core features such as Fin, Inbox, and Messenger.

Due to our architecture and detection methods, our calculated SLA impact always reflects the worst-case scenario, even if not all customers are equally affected by a given issue.

For full terms surrounding our SLAs, please reference our full terms of service.

How we detect impact

With our heartbeat metrics, we use anomaly detection, not binary thresholds. This allows us to catch subtle degradations in the customer experience, and not just complete outages, meaning our understanding of if we’re meeting our commitments to you is clearer.

As part of this detection, we have a robust incident response program in place. If a heartbeat metric triggers an alert:

  • An incident is opened and engineers are paged. Our on-call process is available 24/7 to ensure we have full coverage.

  • We may roll back recent code changes automatically to get back to a stable state.

  • First responders are provided with suggested root causes via automation.

Recovery objectives

  • RPO (Recovery Point Objective): 0 – our infrastructure is architected with sufficient redundancy to ensure that no customer data is lost, even when incidents occur.

  • RTO (Recovery Time Objective): 8 hours – in the event of a major outage affecting availability, we aim to fully restore service within 8 hours across impacted regions or products.


Planned maintenance windows

Our recent system rearchitecture enables zero-downtime maintenance, allowing continuous customer operation even during system improvements. If this is not possible, we will notify customers a minimum of 24 hours in advance of any maintenance, as per our SLAs.

You can read more about the recent changes to our system architecture and the benefits they bring on our blog here: Evolving Intercoms Database Infrastructure lessons and progress.


💡Tip

Need more help? Get support from our Community Forum
Find answers and get help from Intercom Support and Community Experts


Did this answer your question?