Spike Detection and Alert Annotations

Introduction

Your first CPU alert used the Last reduce function—it checked the most recent measurement at evaluation time. If CPU was at 85% when the alert checked, the condition triggered. If it had dropped to 75% by the next check, the condition cleared. This works well for relatively stable metrics that change gradually.

But here's the problem. Consider your edge server's latency over a five-minute period:

45ms → 48ms → 380ms → 52ms → 47ms

If your alert checks at the end of that window, it sees 47ms and concludes everything is fine. The 380ms spike that frustrated users never registers because it happened between evaluations. Your alert remained Normal while customers experienced slow page loads.

This lesson teaches you to choose the right reduce function for different problem patterns. You'll learn when Max catches critical spikes that Last misses, how Mean identifies trending problems while filtering noise, when Min helps detect unexpected drops, and how alert annotations transform generic notifications into actionable messages with context, runbook links, and precise measurements.

Understanding Reduce Functions Through Real Scenarios

Every Grafana alert reduces time-series data into a single value for threshold comparison. The reduce function you choose determines which aspect of your data gets evaluated. Think of it as asking different questions about the same measurements.

Last - "What's happening right now?"

When you use Last, you're checking current state. Is the disk 90% full at this exact moment? The catch: It misses historical context within your evaluation window. A brief CPU spike to 98% that resolved before the next check goes undetected because Last only sees the current 45%.

Max - "What's the worst it got during this period?"

For latency monitoring, this is often what you actually care about. Even a single request taking 2 seconds represents a terrible user experience for whoever made that request. Using Max ensures your alert catches every spike, no matter how brief. If any measurement in your evaluation window exceeded 300ms, Max surfaces that value for threshold comparison.

Mean - "What's the typical level?"

This smooths out single-point anomalies while catching sustained elevation. Consider latency measurements of 180ms, 185ms, 190ms, 195ms, 188ms—nothing individually crosses your 200ms threshold, but the mean of 188ms indicates trending upward toward a problem. Mean helps you catch gradual degradation before it becomes critical.

Min - "What's the best it achieved?"

Creating a Spike Detection Alert with Max

You'll build an alert that catches any latency spike above 300ms, even if it only lasted a few seconds.

Navigate to Alerting > Alert rules > New alert rule and enter the name "Latency Spike Detection". Select Postgres Local as your data source.

Your query follows the same pattern you've used before:

The critical difference comes in the alert condition configuration. In expression B (Reduce), select Max as your function instead of Last. Keep Input as A and Mode as Strict. In expression C (Threshold), set Input to B and condition to IS ABOVE 300.

What This Configuration Means

Every minute, Grafana executes your query and receives all latency measurements from the last minute. The Max function scans through those measurements—perhaps twenty or thirty individual pings—and extracts the highest value. That maximum value then gets compared against your 300ms threshold. If even one measurement hit 380ms while the others were all below 100ms, the alert condition triggers because Max returned 380ms.

Create a new evaluation group named "1-minute-spike-check" with a 1-minute evaluation interval. Set the pending period to 2 minutes.

Just like your CPU alert from Unit 1, this alert will transition through Normal, Pending, and Alerting states as conditions change. The key difference is that Max ensures you catch every spike rather than just the most recent measurement. If you need a refresher on how alert states work, review the "How Your Alert Rule Works" section in Unit 1.

Select any folder for organization and choose grafana-default-email as your contact point, but don't save yet—you'll enrich this alert with annotations in the next section.

Enriching Alerts with Annotations

The 3 AM problem: Your alert fires. The on-call engineer receives a notification that says "Latency Spike Detection is FIRING"—technically accurate but frustratingly vague. Which host is affected? How bad is the latency? What should I check first? The engineer starts their investigation by opening multiple dashboards, searching documentation, and trying to reconstruct context from fragmented information sources.

Annotations: Context That Travels With Alerts

Annotations solve this problem by embedding contextual information directly into the alert notification. Think of them as metadata that travels with the alert payload, enriching the notification message with everything an engineer needs to respond effectively. Annotations are simple key-value pairs, but their impact on incident response is significant.

Building Your Annotation Set

Scroll down to the Add annotations section. You'll create five annotations that transform your generic alert into a self-contained incident report.

Start by clicking Add annotation. Set the key to summary and the value to Edge server latency spike detected. This provides a human-readable one-line description that appears prominently in notifications. Recipients immediately understand what failed without needing to decode technical jargon.

Add a second annotation with key description and value Latency exceeded 300ms threshold. Check network conditions and server load. This gives the on-call engineer immediate context about what the alert means and suggests initial troubleshooting directions.

Create a third annotation with key runbook_url and value https://wiki.company.com/runbooks/latency-troubleshooting. Replace this example URL with your actual internal documentation link. Engineers can click directly to detailed investigation procedures rather than searching wikis. This single link saves critical minutes during incident response.

Now you'll use annotation templates to include dynamic data. Add an annotation with key affected_host and value {{ $labels.metric }}. The double curly braces indicate a template variable that Grafana replaces with actual data when the alert fires. The $labels.metric variable contains the value from your query's metric column—the host name. If edge-eu-west triggered the alert, the notification shows affected_host: edge-eu-west. One alert rule monitors all hosts, but each notification identifies exactly which host has the problem.

How Annotations Transform Notifications

When your alert fires, Grafana sends notifications to all configured contact points. The format varies by channel—email, Slack, PagerDuty, or others—but annotations travel with every notification.

Your email might display:

Notice how the template variables got replaced with actual values. The {{ $labels.metric }} became edge-eu-west, and {{ $values.B }} became 380. The engineer receiving this notification immediately knows which host failed, how severe the problem is, and where to find troubleshooting procedures. They can start investigating without opening dashboards or reconstructing context from multiple systems.

Choosing the Right Reduce Function

Let's examine how different reduce functions behave with identical data to understand when to use each.

Sample data: Latency measurements over five minutes: 45ms, 48ms, 380ms, 52ms, 47ms

Scenario 1: Using Last

If you configured your alert with Last and it checks at the end of this period, it evaluates 47ms—the most recent measurement. Since 47ms is below your 300ms threshold, the condition is false and the alert state remains Normal. The problem is clear: you completely missed the 380ms spike that occurred mid-window. Users who made requests during that spike experienced terrible performance, but your monitoring reported everything was fine.

Scenario 2: Using Max

If you configured the same alert with Max, it evaluates 380ms—the highest measurement in the window. Since 380ms exceeds your 300ms threshold, the condition is true and the alert enters Pending state. If the next evaluation also detects spikes, the alert fires. You successfully caught the brief performance degradation that impacted users.

Scenario 3: Using Mean

If you configured it with Mean, it evaluates 114.4ms—the mean of all measurements. Since 114.4ms is well below 300ms, the condition is false. This might seem like a missed detection, but Mean serves a different purpose. It filters out single-point anomalies to catch trending problems. If all your measurements were elevated—say 180ms, 185ms, 190ms, 195ms, 188ms—then Mean would show sustained elevation even though no individual measurement crossed the threshold. This helps you catch

Decision Guide: Which Reduce Function?

Use Case	Reduce Function	Why
Current state monitoring (disk space, memory usage)	`Last`	Metric changes gradually over time
User-impacting spikes (latency, errors)	`Max`	Even one bad experience matters
Trending problems (CPU usage, request rates)	`Mean`	Identify patterns while filtering noise
Minimum throughput (transaction rates, messages processed)	`Min`	Detect capacity constraints

Summary

You've learned two powerful enhancements beyond basic threshold checking.

Reduce Functions Match Alert Intent. Choosing the right reduce function determines what patterns your alert catches—Max for worst-case spike detection, Mean for identifying trending problems, Last for current state monitoring. Each function asks a different question about your data, and matching that question to your alerting needs prevents both false negatives (missed problems) and false positives (alert fatigue from normal variations).

Annotations Create Actionable Incident Reports. By embedding context, runbook links, and measured values directly in the alert, you eliminate the need for engineers to reconstruct information during incident response. Template variables like {{ $labels.metric }} and {{ $values.B }} make annotations dynamic, providing specific details about what triggered each alert while maintaining a single rule that monitors multiple entities.

In the practice exercises ahead, you'll configure alerts with different reduce functions and observe how each behaves under various data patterns. You'll also create rich annotations using template variables to make your alerts self-documenting for on-call teams.

Previous Lesson

Next Lesson: Error Rate Alerts with Multi-Threshold Routing

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal