What is Datadog Anomaly Detection?
Answer
Overview
Datadog Anomaly Detection is an AI-powered monitoring feature that automatically identifies when metrics behave differently than expected based on historical patterns. Instead of setting static thresholds, Datadog learns normal behavior patterns (including trends, seasonality, and time-of-day variations) and alerts you when metrics deviate significantly.
Why Use Anomaly Detection?
Traditional Threshold Alerts vs Anomaly Detection
| Traditional Thresholds | Anomaly Detection |
|---|---|
| Static: "Alert if CPU > 80%" | Dynamic: "Alert if CPU deviates from normal" |
| Requires manual tuning | Learns patterns automatically |
| Misses seasonal patterns | Accounts for time-of-day, day-of-week |
| High false positives | Reduces alert fatigue |
| One-size-fits-all | Adapts to each metric |
Example Scenario
Problem: Your API traffic spikes to 10,000 req/s every Monday at 9 AM (normal), but also spikes to 5,000 req/s on Sunday at 3 AM (suspicious).
- Static Threshold (10,000): Misses the Sunday anomaly ✗
- Anomaly Detection: Flags Sunday spike as unusual ✓
How It Works
Algorithm Options
Datadog offers three anomaly detection algorithms:
| Algorithm | Best For | How It Works |
|---|---|---|
| Basic | Quick changes, no seasonality | Simple lagging rolling quantile - fast adaptation |
| Agile | Seasonal metrics with level shifts | SARIMA-based - balances seasonality & responsiveness |
| Robust | Stable seasonal patterns | Seasonal-trend decomposition - resistant to outliers |
1. Basic Algorithm
textUse Case: Metrics that change quickly, minimal patterns Example: New feature adoption rate Pros: ✓ Adapts quickly to changes ✓ Simple, fast computation Cons: ✗ No seasonality awareness ✗ Can be too sensitive
2. Agile Algorithm (Most Common)
textUse Case: Metrics with daily/weekly patterns Example: Website traffic, API requests Pros: ✓ Detects seasonality (time-of-day, day-of-week) ✓ Adjusts to gradual changes ✓ Balances responsiveness and stability Cons: ✗ May react to long anomalies
3. Robust Algorithm
textUse Case: Very stable seasonal metrics Example: Batch job execution times Pros: ✓ Ignores transient spikes ✓ Very stable predictions ✓ Good for well-established patterns Cons: ✗ Slow to adapt to real changes ✗ May miss gradual drift
Setting Up Anomaly Detection in Datadog
Creating an Anomaly Monitor
Step 1: Choose Metric
textMetric: avg:system.cpu.user{*}
Step 2: Select Algorithm
textAlgorithm: Agile (recommended for most use cases) Seasonality: Auto-detect
Step 3: Configure Alert Conditions
textAlert threshold: 3 (trigger if metric is 3 std deviations away) Warning threshold: 2 Evaluation window: Last 5 minutes
Step 4: Set Alert Preferences
textNotify: #ops-alerts Slack channel Message: "CPU usage anomaly detected on {{host.name}}"
Monitor Configuration Example
json{ "name": "Anomalous API Response Time", "type": "metric alert", "query": "avg(last_15m):anomalies(avg:api.response_time{env:prod}, 'agile', 2) >= 1", "message": "API response time is behaving abnormally.\n\nCurrent: {{value}}\nExpected range: {{threshold}} - {{max_threshold}}\n\n@slack-ops-alerts", "tags": ["env:prod", "team:backend"], "options": { "thresholds": { "critical": 1, "warning": 0.5 }, "notify_no_data": false, "evaluation_delay": 60 } }
Watchdog: AI-Powered Anomaly Detection
What is Watchdog?
Watchdog is Datadog's AI engine that automatically detects anomalies across your entire infrastructure without requiring manual monitor configuration.
Watchdog Capabilities
Automatic Detection:
- Scans all metrics, traces, and logs
- Identifies anomalies using ML algorithms
- No configuration needed
Root Cause Analysis:
- Correlates anomalies across services
- Identifies deployment impacts
- Highlights affected components
Prioritization:
- Ranks anomalies by severity
- Filters noise
- Focuses on actionable alerts
Watchdog Alerts Example
text🔴 Critical Anomaly Detected Service: payment-api Metric: Error rate increased by 450% Time: 2026-02-27 14:32 UTC Impact: - 1,250 failed transactions - Affecting 340 users Correlation: - Deployment "v2.3.1" rolled out 5 minutes prior - Database connection pool saturation detected Recommendation: Rollback deployment
Practical Use Cases
1. Application Performance Monitoring (APM)
Metric: API latency (p99)
Configuration:
textAlgorithm: Agile Deviations: 3 (alert if p99 > 3 std dev from baseline) Seasonality: Auto (accounts for business hours)
Why: Detects performance degradation before users complain
2. Infrastructure Monitoring
Metric: Memory usage
Configuration:
textAlgorithm: Robust Deviations: 2 Seasonality: None
Why: Catches memory leaks early (gradual increase over baseline)
3. Business Metrics
Metric: Order conversion rate
Configuration:
textAlgorithm: Agile Deviations: 2 Seasonality: Weekly (accounts for weekday/weekend patterns)
Why: Alerts if conversion suddenly drops (bug or competition)
4. Security Monitoring
Metric: Failed login attempts
Configuration:
textAlgorithm: Basic Deviations: 4 Seasonality: None
Why: Detects brute-force attacks immediately
Best Practices
Choosing the Right Algorithm
Use Basic when:
- Metric is new (< 1 week of data)
- No clear patterns
- Need immediate response to changes
Use Agile when:
- Metric has daily/weekly seasonality
- Traffic patterns vary by time
- Most common choice (70% of use cases)
Use Robust when:
- Very predictable patterns
- Infrequent but expected spikes
- Need to ignore transient anomalies
Setting Deviation Thresholds
| Threshold | Sensitivity | Use Case |
|---|---|---|
| 1-2 | Very sensitive | Critical systems (payments, security) |
| 3 | Balanced | General application monitoring |
| 4-5 | Conservative | Noisy metrics, reduce false positives |
Reducing False Positives
1. Increase deviation threshold:
text2 → 3 (fewer alerts, only significant anomalies)
2. Extend evaluation window:
textlast_5m → last_15m (requires anomaly to persist)
3. Switch algorithm:
textAgile → Robust (less reactive to short spikes)
4. Add filters:
textavg:api.latency{env:prod,!service:experimental}
Integration with Flutter/Mobile Apps
Monitoring Mobile App Metrics
dartimport 'package:datadog_flutter_plugin/datadog_flutter_plugin.dart'; class AppMonitoring { static void initialize() { final configuration = DdSdkConfiguration( clientToken: 'YOUR_CLIENT_TOKEN', env: 'prod', site: DatadogSite.us1, trackingConsent: TrackingConsent.granted, ); DatadogSdk.instance.initialize(configuration); } // Send custom metrics that Datadog will monitor for anomalies static void trackMetric(String name, double value) { DatadogSdk.instance.rum?.addTiming(name); // Custom metric for anomaly detection DatadogSdk.instance.logs?.addAttribute( 'custom.metrics.$name', value, ); } } // Usage void main() { AppMonitoring.initialize(); // Track app launch time final launchTime = DateTime.now().millisecondsSinceEpoch; // ... app initialization ... final duration = DateTime.now().millisecondsSinceEpoch - launchTime; AppMonitoring.trackMetric('app.launch.duration', duration.toDouble()); }
Example Anomaly Monitors for Mobile Apps
1. Crash Rate Anomaly:
textMetric: mobile.crash.rate Algorithm: Agile Threshold: 3 Alert: "Crash rate spiked! Investigate latest release."
2. API Error Rate:
textMetric: mobile.api.error_rate Algorithm: Agile Threshold: 2 Alert: "API errors increased. Check backend health."
3. Screen Load Time:
textMetric: mobile.screen.load_time Algorithm: Robust Threshold: 3 Alert: "Screens loading slower than normal."
Comparison with Other Tools
| Feature | Datadog | New Relic | Grafana |
|---|---|---|---|
| Anomaly Algorithms | 3 (Basic, Agile, Robust) | 2 (Static, Dynamic) | 1 (EWMA-based) |
| Auto-detection (Watchdog) | ✅ Yes | ✅ Yes | ❌ No |
| Seasonality Support | ✅ Yes | ✅ Yes | ⚠️ Limited |
| Root Cause Analysis | ✅ Yes (Watchdog) | ✅ Yes | ❌ Manual |
| Mobile SDK | ✅ Yes | ✅ Yes | ⚠️ Limited |
Learning Resources
- Datadog Anomaly Monitor Documentation
- Anomaly Monitor Setup Guide
- Watchdog Alerts
- AI-Powered Metrics Monitoring
Pro Tip: Start with the Agile algorithm and a deviation threshold of 3 for most metrics. Monitor for a week, then adjust based on false positive/negative rates. Use Watchdog to discover anomalies you didn't know to monitor, then create dedicated monitors for critical patterns.