Alert Query Multi-Condition Filters

Introduction: Events vs Numbers

In the previous lesson, you monitored CPU usage—a number that goes up and down. You asked: "Is CPU above 80%?" This works great for metrics.

But what about application logs? Your services generate thousands of log entries every minute: "User logged in successfully," "Payment processed," "Database connection failed," "API request completed." You don't want to measure these—you want to filter them. You need alerts that say "notify me when ERROR logs appear" rather than "notify me when a number is too high."

This lesson shows you how Grafana handles event-based alerts using multi-condition filters.

Alert Philosophy: Symptoms, Not Causes

Before building any alert, understand this rule: alert on symptoms, not root causes.

A symptom is what users experience. Users can't log in. That's a symptom. The root cause might be database connection pool exhaustion, but that's a technical detail one layer removed from user impact.

Why does this matter? If you alert on every technical anomaly—CPU spikes, memory increases, cache misses—you'll get hundreds of alerts daily. Engineers start ignoring them. Critical alerts get lost in noise. But if you alert when users can't log in, that's always worth investigating.

Grafana excels at symptom-based monitoring because you can combine conditions. Instead of alerting on a single database error (which might be a transient blip), you can alert when ERROR logs appear in your auth service AND those errors persist for more than 2 minutes AND the error rate exceeds 5% of total requests. This multi-condition approach filters out noise.

Your Log Data Structure

Your PostgreSQL database has a log_events table that captures logs from your microservices:

If you query recent data, you might see:

ts	level	service	message
2024-01-15 14:23:00+00	INFO	auth	User login successful
2024-01-15 14:23:30+00	ERROR	search	Database connection timeout
2024-01-15 14:24:15+00	ERROR	auth	Failed to verify JWT token

Notice how INFO logs vastly outnumber ERROR logs—maybe a 1000:1 ratio in production. Most operations succeed, so you need filtering to focus only on the errors that matter.

Building the Filter Query

Your alert query needs to return a numeric value that Grafana can evaluate. While you can view individual log rows in Explore or table panels, alert rules need aggregated data they can compare against thresholds.

For log-based alerts, the most common approach is to count matching rows. Start by counting all ERROR logs:

This counts how many ERROR logs exist, but it searches all historical data. You need to focus on recent logs using Grafana's time filter macro:

The $__timeFilter(ts) macro limits results to your alert's evaluation window, like "last 5 minutes." Both conditions must be true—that's what the AND operator means. You're building a multi-condition filter.

Here's your complete basic query:

When this runs, you might get a result like value = 3, meaning three ERROR logs occurred in your evaluation window. Grafana can evaluate this number against your alert threshold. If you set a threshold of 0 (fire when any errors exist), the alert triggers when the count is greater than 0. If you set a threshold of 5 (fire only on persistent errors), the alert triggers when the count exceeds 5.

Want to focus on specific services? Add more conditions. For payment service errors only:

Or monitor multiple critical services:

Each additional condition refines your focus, helping ensure the alert only fires for genuinely important scenarios. The query returns a single number—the count of matching errors—which the alert engine uses to determine whether your threshold is exceeded.

Alert States and State Transitions

Before understanding how Grafana evaluates your queries, you need to understand the three states an alert can be in: Normal, Pending, and Alerting.

Normal means everything is okay. Your query returns a value that doesn't exceed the threshold. For example, if your query counts ERROR logs and returns 0, and your threshold is "greater than 0," the alert stays in Normal state. No notifications are sent. This is where your alert should spend most of its time.

Pending means the threshold has been exceeded, but not long enough to confirm a real problem. If your query suddenly returns 5 errors and your threshold is 0, the alert moves to Pending. But it won't send notifications yet—it's waiting to see if this is a brief spike or a sustained issue. You configure the pending duration in your alert rule, typically 1-5 minutes. If the condition remains true for the entire pending period, the alert transitions to Alerting. If the condition resolves before the pending period expires, the alert returns to Normal without ever sending a notification.

Alerting means the threshold has been exceeded for longer than the pending period. Notifications are sent to your configured contact points. The alert stays in Alerting state as long as the condition remains true. When your query returns a value below the threshold again (like when errors stop), the alert immediately returns to Normal state and can optionally send a "resolved" notification.

Here's how state transitions work with a concrete example:

Normal → Pending: At 14:23:00, your query starts returning count = 3 (errors exist). Your threshold is 0 and your pending period is 2 minutes. The alert enters Pending state.
Pending → Alerting: At 14:25:00 (2 minutes later), if errors still exist (count still > 0), the alert transitions to Alerting. Notifications are sent.
Alerting → Normal: At 14:27:00, your query returns count = 0 (errors stopped). The alert immediately returns to Normal. If you've configured "send resolved notifications," Grafana sends an "all clear" message.

The pending period prevents alert noise. Without it, a single transient error (lasting 5 seconds) would trigger a notification. With a 2-minute pending period, only errors that persist meaningfully long enough to indicate a real problem cause notifications.

How Grafana Evaluates Filtered Queries

Your CPU alert from Unit 1 evaluated a numeric condition: "Is 85 greater than 80?" Log alerts work exactly the same way, but instead of comparing measurements, you're comparing counts.

When your COUNT(*) query returns 0, there are no matching errors within your evaluation window. The alert stays in Normal state because the condition "count > 0" is false. When the query returns 1 or higher, errors exist and the condition becomes true. If your alert threshold is set to 0, the alert moves to Pending state when any error appears.

This counting approach gives you more flexibility than simple boolean logic. You can distinguish between a single transient error (count = 1) and a serious problem (count > 50). Set different thresholds for different severity levels: a count above 5 might send a Slack notification, while a count above 50 might page your on-call engineer. The numeric count becomes the foundation for intelligent routing.

The pending period acts as your noise filter. If you set a 2-minute pending period, the error count must exceed your threshold continuously for 2 minutes before the alert transitions from Pending to Alerting. A brief burst of 3 errors that resolves in 30 seconds won't trigger notifications because it doesn't persist through the pending period. But if errors continue occurring for the full 2 minutes, the alert fires and sends notifications.

This persistence requirement is especially important for log alerts. Services often experience transient hiccups—a single failed request due to network jitter, a momentary database connection timeout that immediately retries successfully. You don't want alerts for these blips. The pending period filters them out while still catching genuine problems that persist over time.

Testing Before Deployment

Always test alerts before going live. After writing your query, click the Preview button at the bottom of the alert rule form in Grafana.

Preview shows you whether your query syntax is correct, what the current alert state would be, and what numeric value the alert sees. If preview shows "Alert would be FIRING" with a count of 42, but you know there are no real problems, something is wrong with your filtering logic. Maybe you forgot the level = 'ERROR' condition and you're counting all log levels. Catching this during testing saves you from alert fatigue later.

The testing workflow is straightforward: write your query, click Preview, review the alert state and count value, adjust conditions if needed, preview again, and only save when preview confirms correct behavior. Two minutes of testing saves hours of troubleshooting.

Routing Alerts with Labels

When creating your alert, you'll add labels to control routing. Labels are key-value pairs like severity=high or service=payments. They don't affect when alerts fire—they affect where notifications go.

You'll combine labels with notification policies in later units. For example, alerts with severity=critical might page the on-call engineer via PagerDuty, while severity=warning alerts go to Slack without paging anyone. Alerts labeled service=payments can route to the payments team's channel regardless of severity.

For your ERROR log alerts, consider labels like severity=high (since errors are confirmed failures), service=<name> (to identify which service is affected), and symptom=yes (to mark this as a user-facing problem). This ensures the right urgency reaches the right people.

You'll also select a contact point—the actual notification channel where alerts are sent. This might be email, Slack, PagerDuty, or other integrations. We'll cover contact points in detail in later units.

Summary

You've learned how to create multi-condition filtered alerts for events instead of metrics. The key shift is focusing on symptoms (user problems) rather than causes (technical details). You combine WHERE conditions using AND/OR logic with $__timeFilter() to select only the logs you care about from recent activity. Unlike the raw queries you write in Explore, alert queries count matching events and return a numeric value that Grafana can evaluate against thresholds.

You now understand how alerts transition between Normal, Pending, and Alerting states based on evaluation intervals and pending periods. This state machine prevents alert fatigue by filtering out transient problems while catching persistent issues that need attention.

Always test with Preview before deploying, and use labels to control alert routing.

Now it's time to build these alerts yourself. The practices ahead will have you counting ERROR logs, refining conditions to reduce noise, and testing your alerts against real data. You'll see firsthand how multi-condition filters transform noisy log streams into actionable alerts. Let's get started.

Previous Lesson

Next Lesson: Spike Detection and Alert Annotations

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal