Error Rate Alerts with Multi-Threshold Routing

Introduction: Error Rates and Context

Imagine receiving an alert: "10 errors detected in the payments service." You need to investigate. But when you open your dashboard, you see the service is handling 10,000 requests per minute—those 10 errors represent a 0.1% error rate. Meanwhile, your auth service also has 10 errors, but it's only handling 50 requests per minute—a 20% error rate where one in five login attempts is failing.

The same error count has completely different meanings depending on traffic volume. This is where error rates come in. By calculating the percentage of failed requests, you add context that helps you understand problem severity. A service experiencing 100 errors might be failing catastrophically if it only receives 200 total requests, or it might be performing within acceptable bounds if it's processing 50,000 requests.

This lesson teaches you to calculate error rate percentages from success and failure counters, create alerts with different severity thresholds, and understand how multi-threshold alerting enables intelligent notification routing. You'll build two alert rules monitoring the same metric: one that fires at 5% errors, and another that fires at 15% errors. This foundation prepares you for advanced routing patterns where different severity levels trigger different notification channels.

Understanding the requests_slo Table

Your PostgreSQL database contains a table called requests_slo that tracks service reliability by recording successes and failures in pre-aggregated time windows:

Each row represents a time window during which your monitoring system counted successful requests and failed requests separately. This pre-aggregation pattern is common in observability systems, particularly for high-traffic services where storing every individual request would create enormous data volumes.

Your monitoring system collects these counts every minute, so each service generates one row per minute with that minute's success and failure totals. Sample data:

ts	service	ok_count	err_count
2025-12-10 14:46:37	auth	498	2
2025-12-10 14:46:37	payments	407	1
2025-12-10 14:46:37	search	407	0
2025-12-10 14:45:37	auth	155	2
2025-12-10 14:45:37	payments	289	0
2025-12-10 14:45:37	search	161	3
2025-12-10 14:44:37	auth	262	0
2025-12-10 14:44:37	payments	476	2
2025-12-10 14:44:37	search	499	3

During the minute ending at 14:46, the auth service handled 498 successful requests and 2 failures. Payments had one failure among 408 requests, while search had zero failures among 407 requests. During the previous minute at 14:45, search experienced 3 failures among 164 requests, while auth had 2 failures among 157 requests. These raw counts tell part of the story, but calculating the percentage gives you the context needed for threshold-based alerting.

Building the Error Rate Query

To calculate error rates over time, you need to aggregate these pre-counted values into time buckets. You'll use the $__timeGroupAlias() macro to create dynamic time buckets that match your dashboard's zoom level—when viewing the last hour, it creates one-minute buckets; when viewing 24 hours, it might create five-minute or fifteen-minute buckets.

Start with the SELECT statement:

This groups your data into time buckets and identifies which service each row represents. The calculation itself requires attention to an important edge case: what happens when there are no requests at all?

If both ok_count and err_count are zero, your denominator becomes zero and division by zero causes SQL errors. Handle this explicitly:

This CASE statement checks the total request count first. When it's zero, return 0 as the error rate since there are no errors when there are no requests. Otherwise, calculate the percentage. Notice the 100.0 decimal and ::double precision cast—these ensure decimal division rather than integer division, preserving values like 2.5% instead of truncating to 2%.

Complete the query:

The WHERE $__timeFilter(ts) limits results to your evaluation window. The GROUP BY 1, 2 groups by time buckets and service name, ensuring your SUM calculations operate on the right subsets. The ORDER BY 1 ensures chronological data delivery.

How Multi-Threshold Alerting Works in Practice

You now have two alert rules monitoring the same error rate metric but with different thresholds and labels. Here's how they behave as conditions change:

When error rates are low (0-2%), both alerts remain in Normal state. When error rates climb to 7%, your warning alert transitions to Pending after the first check, then to Alerting after five minutes of sustained elevation. The notification arrives with label severity=warning. The critical alert remains Normal because 7% hasn't crossed the 15% threshold yet.

If error rates continue climbing to 18%, your critical alert now triggers. It enters Pending, then fires after its five-minute window. The notification includes label severity=critical. You now have two active alerts with different severities providing layered visibility into the problem.

When the problem gets resolved and error rates drop back to 3%, both alerts immediately return to Normal. The critical alert clears first (since 3% is well below 15%), followed by the warning alert (since 3% is also below 5%). Your notification channels receive resolution messages for both.

This multi-threshold approach provides graduated response to problems. The 5% threshold catches issues early while they're still manageable. The 15% threshold indicates severe degradation that may require different response patterns. The same metric powers both detection levels through different threshold configurations.

Labels and Future Routing Possibilities

The labels you've added—severity=warning and severity=critical—serve as metadata attached to your alerts. Right now, both alerts send notifications to the same contact point. But in more advanced Grafana configurations, notification policies can use these labels to route alerts differently.

A notification policy might route alerts with severity=warning to a Slack channel where your team monitors during business hours, while alerts with severity=critical get sent to PagerDuty to page the on-call engineer regardless of time. You could add a third alert at 25% with label severity=emergency that triggers multiple notification channels simultaneously.

This separation of concerns—alert rules detect and evaluate conditions, labels classify and describe alerts, policies route notifications—makes alert management scalable. You define your detection logic once in each alert rule. You define your routing logic once in notification policies. Changes to where notifications go don't require editing dozens of alert rules; you adjust the routing policy instead.

The multi-threshold pattern you've built here extends naturally to more complex scenarios. You might create service-specific thresholds—perhaps auth services tolerate only 2% errors while batch processing services accept 10%. You might add additional severity levels. The pattern remains consistent: same metric, different thresholds, different labels, enabling flexible notification routing.

Summary

You've learned to calculate error rate percentages from success and failure counters, handling division-by-zero edge cases and ensuring proper decimal precision. The query aggregates pre-counted successes and errors into time buckets, calculates percentages with appropriate type casting, and returns values that Grafana can evaluate against thresholds.

The multi-threshold alert strategy demonstrates how multiple alert rules can monitor the same metric at different severity levels. By creating two alerts with different thresholds (5% and 15%) and different labels (severity=warning and severity=critical), you've built a graduated detection system. The labels provide metadata that can be used for routing decisions in more advanced configurations.

This completes your journey through Grafana alerting fundamentals. You started with basic threshold alerts, progressed through event-based filtering, learned advanced evaluation with reduce functions and annotations, and finished with multi-threshold strategies for percentage-based monitoring. You now have the skills to build alerts that detect problems at appropriate severity levels and provide the context needed for effective incident response.

Previous Lesson

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal

time	metric	value
2024-01-15 14:20:00+00	auth	0.81
2024-01-15 14:20:00+00	payments	0.00
2024-01-15 14:20:00+00	search	0.32