-->

Sunday, June 8, 2025

Prometheus Devops Interview Questions Part-3

 

Q9: Explain PromQL and provide examples of common queries.

PromQL (Prometheus Query Language) is Prometheus's powerful functional language for querying time series data. Think of it as SQL for metrics - it helps you extract meaningful insights from your monitoring data.

Core Data Types:

  • Instant Vector: Current value of metrics at a specific time
  • Range Vector: Values over a time period
  • Scalar: Simple numeric values

Essential Query Examples:

Basic Metric Selection:

promql
# Get all HTTP requests
http_requests_total

# Filter by specific conditions
http_requests_total{method="GET", status="200"}

# Use regex for flexible matching
http_requests_total{status=~"2.."}  # All 2xx status codes

Rate Calculations:

promql
# Requests per second over 5 minutes
rate(http_requests_total[5m])

# Total increase over 1 hour
increase(http_requests_total[1h])

Aggregations:

promql
# Sum all requests across instances
sum(http_requests_total)

# Average CPU usage by job
avg by (job) (cpu_usage_percent)

# Top 5 highest CPU instances
topk(5, cpu_usage_percent)

Real-World Example - Error Rate:

promql
# Calculate error percentage
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100

Q10: What are PromQL functions and operators? Provide examples.

PromQL provides a rich set of functions and operators for data manipulation and analysis.

Mathematical Operators:

promql
# Basic arithmetic
cpu_usage_percent / 100
memory_used_bytes + memory_cached_bytes

# Comparison operators
cpu_usage_percent > 80
response_time_seconds >= 0.5

Key Functions by Category:

Rate Functions:

promql
rate(counter[5m])     # Per-second average rate
irate(counter[5m])    # Instantaneous rate
increase(counter[1h]) # Total increase

Aggregation Functions:

promql
sum(http_requests_total)      # Total requests
avg(response_time_seconds)    # Average response time
max(cpu_usage_percent)        # Peak CPU usage
count(up == 1)                # Number of healthy instances

Selection Functions:

promql
topk(5, cpu_usage_percent)    # Top 5 CPU consumers
bottomk(3, response_time)     # 3 fastest responses

Mathematical Functions:

promql
abs(temperature_celsius)      # Absolute value
sqrt(variance_metric)         # Square root
round(cpu_percent, 0.1)       # Round to nearest 0.1

Advanced Example - SLA Monitoring:

promql
# Check if 99.9% availability SLA is met
(sum(rate(http_requests_total{status!~"5.."}[30d])) / 
 sum(rate(http_requests_total[30d]))) * 100 > 99.9

Q11: How does Prometheus alerting work with Alertmanager?

Prometheus alerting follows a simple but powerful workflow: Alert Rules → Evaluation → Alertmanager → Notifications.

Step 1: Define Alert Rules

yaml
# /etc/prometheus/rules/alerts.yml
groups:
  - name: system_alerts
    rules:
      - alert: HighCPUUsage
        expr: cpu_usage_percent > 80
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}% for 5+ minutes"

      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.job }} is down"

Step 2: Configure Prometheus

yaml
# prometheus.yml
rule_files:
  - "rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

Step 3: Set Up Alertmanager

yaml
# alertmanager.yml
global:
  smtp_smarthost: 'smtp.company.com:587'
  smtp_from: '[email protected]'

route:
  group_by: ['alertname']
  group_wait: 10s
  repeat_interval: 1h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'

receivers:
  - name: 'critical-alerts'
    email_configs:
      - to: '[email protected]'
        subject: 'CRITICAL: {{ .GroupLabels.alertname }}'
    slack_configs:
      - channel: '#alerts'
        title: 'Critical Alert'

Alert States:

  • Inactive: Condition not met
  • Pending: Condition met, waiting for duration
  • Firing: Alert actively sending notifications

Q12: Explain Alertmanager's routing, grouping, and silencing features.

Alertmanager provides sophisticated alert management through three key features:

1. Routing

Routes determine which alerts go to which teams based on labels:

yaml
route:
  receiver: 'default'
  routes:
    # Critical alerts to SRE team
    - match:
        severity: critical
      receiver: 'sre-team'
    
    # Database alerts to DBA team
    - match_re:
        service: '^(mysql|postgres|redis)$'
      receiver: 'dba-team'
    
    # Infrastructure alerts during business hours
    - match:
        team: infrastructure
      receiver: 'infra-business-hours'
      active_time_intervals:
        - business_hours

2. Grouping

Combines related alerts to reduce notification spam:

yaml
route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s       # Wait before sending first notification
  group_interval: 5m    # Wait before sending updates
  repeat_interval: 4h   # Resend interval for unresolved alerts

Example: Three HighCPUUsage alerts from different servers get grouped into one notification: "3 alerts for HighCPUUsage in prod cluster"

3. Silencing

Temporarily suppress alerts during maintenance:

bash
# Create silence via API
curl -X POST http://alertmanager:9093/api/v1/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {"name": "alertname", "value": "HighCPUUsage"},
      {"name": "instance", "value": "web1"}
    ],
    "startsAt": "2025-06-08T10:00:00Z",
    "endsAt": "2025-06-08T18:00:00Z",
    "createdBy": "john.doe",
    "comment": "Planned maintenance on web1"
  }'

4. Inhibition

Suppress lower-priority alerts when critical ones are firing:

yaml
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']

Q13: What is service discovery in Prometheus and why is it important?

Service Discovery automatically finds and configures monitoring targets without manual intervention - essential for dynamic cloud environments.

Why It's Critical:

  • Dynamic Scaling: Automatically monitors new instances as they spin up
  • Reduced Manual Work: No config updates for each new service
  • Consistency: Ensures comprehensive monitoring coverage
  • Reliability: Handles instance failures gracefully

Types of Service Discovery:

1. File-based Discovery

yaml
scrape_configs:
  - job_name: 'file-discovery'
    file_sd_configs:
      - files: ['/etc/prometheus/targets/*.json']
        refresh_interval: 5m

Target file example:

json
[
  {
    "targets": ["web1:9100", "web2:9100"],
    "labels": {
      "job": "web-servers",
      "env": "production"
    }
  }
]

2. Kubernetes Discovery

yaml
scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names: ['default', 'monitoring']
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

3. Cloud Provider Discovery

yaml
# AWS EC2 Discovery
scrape_configs:
  - job_name: 'aws-ec2'
    ec2_sd_configs:
      - region: us-west-2
        port: 9100
        filters:
          - name: tag:Environment
            values: [production]

# Google Cloud Discovery
scrape_configs:
  - job_name: 'gce-instances'
    gce_sd_configs:
      - project: my-project
        zone: us-central1-a
        port: 9100

Benefits:

  • Zero-touch monitoring for new services
  • Automatic cleanup when services are removed
  • Label-based organization for better metric management
  • Scalable architecture that grows with your infrastructure

Service discovery transforms Prometheus from a static monitoring tool into a dynamic, self-configuring system that adapts to your changing infrastructure automatically.

0 comments:

Post a Comment