Q9: Explain PromQL and provide examples of common queries.
PromQL (Prometheus Query Language) is Prometheus's powerful functional language for querying time series data. Think of it as SQL for metrics - it helps you extract meaningful insights from your monitoring data.
Core Data Types:
- Instant Vector: Current value of metrics at a specific time
- Range Vector: Values over a time period
- Scalar: Simple numeric values
Essential Query Examples:
Basic Metric Selection:
promql# Get all HTTP requests http_requests_total # Filter by specific conditions http_requests_total{method="GET", status="200"} # Use regex for flexible matching http_requests_total{status=~"2.."} # All 2xx status codes
Rate Calculations:
promql# Requests per second over 5 minutes rate(http_requests_total[5m]) # Total increase over 1 hour increase(http_requests_total[1h])
Aggregations:
promql# Sum all requests across instances sum(http_requests_total) # Average CPU usage by job avg by (job) (cpu_usage_percent) # Top 5 highest CPU instances topk(5, cpu_usage_percent)
Real-World Example - Error Rate:
promql# Calculate error percentage rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100
Q10: What are PromQL functions and operators? Provide examples.
PromQL provides a rich set of functions and operators for data manipulation and analysis.
Mathematical Operators:
promql# Basic arithmetic cpu_usage_percent / 100 memory_used_bytes + memory_cached_bytes # Comparison operators cpu_usage_percent > 80 response_time_seconds >= 0.5
Key Functions by Category:
Rate Functions:
promqlrate(counter[5m]) # Per-second average rate irate(counter[5m]) # Instantaneous rate increase(counter[1h]) # Total increase
Aggregation Functions:
promqlsum(http_requests_total) # Total requests avg(response_time_seconds) # Average response time max(cpu_usage_percent) # Peak CPU usage count(up == 1) # Number of healthy instances
Selection Functions:
promqltopk(5, cpu_usage_percent) # Top 5 CPU consumers bottomk(3, response_time) # 3 fastest responses
Mathematical Functions:
promqlabs(temperature_celsius) # Absolute value sqrt(variance_metric) # Square root round(cpu_percent, 0.1) # Round to nearest 0.1
Advanced Example - SLA Monitoring:
promql# Check if 99.9% availability SLA is met (sum(rate(http_requests_total{status!~"5.."}[30d])) / sum(rate(http_requests_total[30d]))) * 100 > 99.9
Q11: How does Prometheus alerting work with Alertmanager?
Prometheus alerting follows a simple but powerful workflow: Alert Rules → Evaluation → Alertmanager → Notifications.
Step 1: Define Alert Rules
yaml# /etc/prometheus/rules/alerts.yml groups: - name: system_alerts rules: - alert: HighCPUUsage expr: cpu_usage_percent > 80 for: 5m labels: severity: warning team: infrastructure annotations: summary: "High CPU on {{ $labels.instance }}" description: "CPU usage is {{ $value }}% for 5+ minutes" - alert: ServiceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.job }} is down"
Step 2: Configure Prometheus
yaml# prometheus.yml rule_files: - "rules/*.yml" alerting: alertmanagers: - static_configs: - targets: ["alertmanager:9093"]
Step 3: Set Up Alertmanager
yaml# alertmanager.yml global: smtp_smarthost: 'smtp.company.com:587' smtp_from: '[email protected]' route: group_by: ['alertname'] group_wait: 10s repeat_interval: 1h receiver: 'default' routes: - match: severity: critical receiver: 'critical-alerts' receivers: - name: 'critical-alerts' email_configs: - to: '[email protected]' subject: 'CRITICAL: {{ .GroupLabels.alertname }}' slack_configs: - channel: '#alerts' title: 'Critical Alert'
Alert States:
- Inactive: Condition not met
- Pending: Condition met, waiting for duration
- Firing: Alert actively sending notifications
Q12: Explain Alertmanager's routing, grouping, and silencing features.
Alertmanager provides sophisticated alert management through three key features:
1. Routing
Routes determine which alerts go to which teams based on labels:
yamlroute: receiver: 'default' routes: # Critical alerts to SRE team - match: severity: critical receiver: 'sre-team' # Database alerts to DBA team - match_re: service: '^(mysql|postgres|redis)$' receiver: 'dba-team' # Infrastructure alerts during business hours - match: team: infrastructure receiver: 'infra-business-hours' active_time_intervals: - business_hours
2. Grouping
Combines related alerts to reduce notification spam:
yamlroute: group_by: ['alertname', 'cluster', 'service'] group_wait: 10s # Wait before sending first notification group_interval: 5m # Wait before sending updates repeat_interval: 4h # Resend interval for unresolved alerts
Example: Three HighCPUUsage alerts from different servers get grouped into one notification: "3 alerts for HighCPUUsage in prod cluster"
3. Silencing
Temporarily suppress alerts during maintenance:
bash# Create silence via API curl -X POST http://alertmanager:9093/api/v1/silences \ -H "Content-Type: application/json" \ -d '{ "matchers": [ {"name": "alertname", "value": "HighCPUUsage"}, {"name": "instance", "value": "web1"} ], "startsAt": "2025-06-08T10:00:00Z", "endsAt": "2025-06-08T18:00:00Z", "createdBy": "john.doe", "comment": "Planned maintenance on web1" }'
4. Inhibition
Suppress lower-priority alerts when critical ones are firing:
yamlinhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'cluster', 'service']
Q13: What is service discovery in Prometheus and why is it important?
Service Discovery automatically finds and configures monitoring targets without manual intervention - essential for dynamic cloud environments.
Why It's Critical:
- Dynamic Scaling: Automatically monitors new instances as they spin up
- Reduced Manual Work: No config updates for each new service
- Consistency: Ensures comprehensive monitoring coverage
- Reliability: Handles instance failures gracefully
Types of Service Discovery:
1. File-based Discovery
yamlscrape_configs: - job_name: 'file-discovery' file_sd_configs: - files: ['/etc/prometheus/targets/*.json'] refresh_interval: 5m
Target file example:
json[ { "targets": ["web1:9100", "web2:9100"], "labels": { "job": "web-servers", "env": "production" } } ]
2. Kubernetes Discovery
yamlscrape_configs: - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod namespaces: names: ['default', 'monitoring'] relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true
3. Cloud Provider Discovery
yaml# AWS EC2 Discovery scrape_configs: - job_name: 'aws-ec2' ec2_sd_configs: - region: us-west-2 port: 9100 filters: - name: tag:Environment values: [production] # Google Cloud Discovery scrape_configs: - job_name: 'gce-instances' gce_sd_configs: - project: my-project zone: us-central1-a port: 9100
Benefits:
- Zero-touch monitoring for new services
- Automatic cleanup when services are removed
- Label-based organization for better metric management
- Scalable architecture that grows with your infrastructure
Service discovery transforms Prometheus from a static monitoring tool into a dynamic, self-configuring system that adapts to your changing infrastructure automatically.
0 comments:
Post a Comment