Monitoring
Set up health checks, alerting, and dashboards for your Drasi deployment
3 minute read
Effective monitoring helps you identify issues before they impact your users. This guide covers how to set up health checks, configure alerting, and create useful dashboards.
Prerequisites
Before setting up monitoring, ensure you have observability configured in your Drasi deployment.
Health Checks
Component Health Endpoints
Drasi components expose health endpoints that you can use to monitor their status:
# Check source connector health
kubectl exec -n drasi-system deploy/drasi-source-<name> -- curl -s http://localhost:8080/health
# Check query container health
kubectl exec -n drasi-system deploy/drasi-query-<name> -- curl -s http://localhost:8080/health
# Check reaction health
kubectl exec -n drasi-system deploy/drasi-reaction-<name> -- curl -s http://localhost:8080/health
Kubernetes Readiness/Liveness Probes
Drasi components include Kubernetes probes. Monitor probe failures to detect unhealthy components:
kubectl get events -n drasi-system --field-selector reason=Unhealthy
Key Metrics to Monitor
Source Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
drasi_source_changes_received_total |
Total changes received from source | N/A (track rate) |
drasi_source_changes_per_second |
Change rate from source | Varies by workload |
drasi_source_connection_status |
Source connection health | 0 = disconnected |
drasi_source_replication_lag_ms |
Lag behind source | > 10000ms |
Query Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
drasi_query_evaluation_duration_ms |
Time to evaluate queries | p99 > 1000ms |
drasi_query_results_total |
Query result count | N/A (track changes) |
drasi_query_errors_total |
Query evaluation errors | Any increase |
drasi_query_memory_usage_bytes |
Query container memory | > 80% of limit |
Reaction Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
drasi_reaction_executions_total |
Total reaction executions | N/A (track rate) |
drasi_reaction_duration_ms |
Reaction execution time | p99 > 5000ms |
drasi_reaction_errors_total |
Reaction errors | Any increase |
drasi_reaction_queue_depth |
Pending reactions | > 1000 |
Setting Up Alerts
Prometheus Alerting Rules
Create alerting rules for critical conditions:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: drasi-alerts
namespace: drasi-system
spec:
groups:
- name: drasi.rules
rules:
- alert: DrasiSourceDisconnected
expr: drasi_source_connection_status == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Drasi source {{ $labels.source }} is disconnected"
- alert: DrasiHighReplicationLag
expr: drasi_source_replication_lag_ms > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "Drasi source {{ $labels.source }} has high replication lag"
- alert: DrasiQueryErrors
expr: increase(drasi_query_errors_total[5m]) > 0
labels:
severity: warning
annotations:
summary: "Query {{ $labels.query }} has errors"
- alert: DrasiReactionErrors
expr: increase(drasi_reaction_errors_total[5m]) > 0
labels:
severity: warning
annotations:
summary: "Reaction {{ $labels.reaction }} has errors"
Alert Destinations
Configure alert routing to appropriate destinations:
- PagerDuty for critical production alerts
- Slack for warning-level notifications
- Email for daily summaries
Creating Dashboards
Grafana Dashboard
Create a comprehensive Drasi dashboard in Grafana:
- Access Grafana (see Observability)
- Create a new dashboard
- Add panels for:
- Source change rate (time series)
- Query evaluation latency (histogram)
- Reaction success/failure rate (gauge)
- Resource utilization (time series)
Example Dashboard JSON
{
"title": "Drasi Overview",
"panels": [
{
"title": "Changes per Second",
"type": "timeseries",
"targets": [
{
"expr": "rate(drasi_source_changes_received_total[5m])",
"legendFormat": "{{ source }}"
}
]
},
{
"title": "Query Latency (p99)",
"type": "timeseries",
"targets": [
{
"expr": "histogram_quantile(0.99, drasi_query_evaluation_duration_ms_bucket)",
"legendFormat": "{{ query }}"
}
]
}
]
}
Monitoring Best Practices
Establish Baselines
Before setting alert thresholds:
- Run Drasi under normal load for several days
- Collect baseline metrics
- Set thresholds based on observed patterns
Use Multi-Signal Alerting
Combine multiple signals to reduce false positives:
- High latency AND error rate increase
- Memory usage high AND approaching OOM
Document Runbooks
For each alert, document:
- What the alert means
- Potential causes
- Investigation steps
- Remediation actions
Next Steps
- Configure Scaling for high workloads
- Set up Troubleshooting procedures
- Review Performance patterns
Feedback
Was this page helpful?
Glad to hear it! Please tell us what you found helpful.
Sorry to hear that. Please tell us how we can improve.