-->

Sunday, June 8, 2025

Prometheus Devops Interview Questions Part-2

Q6: Explain Prometheus data model and metric types.

Answer:

Understanding Prometheus's data model is fundamental to effectively using the monitoring system. The data model defines how metrics are structured, stored, and identified, forming the foundation for all monitoring and alerting capabilities.

Prometheus Data Model - The Foundation

Core Concept: Prometheus stores all data as time series - sequences of timestamped values belonging to the same metric and the same set of labeled dimensions.

Time Series Identity: Each time series is uniquely identified by:

  1. Metric Name: What you're measuring (e.g., http_requests_total, cpu_usage_percent)
  2. Labels: Key-value pairs that add dimensions (e.g., method="GET", status="200")
  3. Timestamp: When the measurement was taken (Unix timestamp)
  4. Value: The actual measurement (64-bit floating-point number)

Data Storage Format

Standard Format:

metric_name{label1="value1", label2="value2"} value timestamp

Real Example:

http_requests_total{method="GET", status="200", endpoint="/api/users"} 1027.0 1641024000

Data Model Visualization:

Time Series Database Structure:
┌─────────────────────────────────────────────────────────────────┐
│                     PROMETHEUS TSDB                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Metric: http_requests_total                                    │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Labels: {method="GET", status="200", endpoint="/api"}   │   │
│  │ ┌─────────────┬─────────────┬─────────────┬───────────┐ │   │
│  │ │ Timestamp   │ Value       │ Timestamp   │ Value     │ │   │
│  │ │ 1641024000  │ 1027.0      │ 1641024015  │ 1031.0    │ │   │
│  │ │ 1641024030  │ 1035.0      │ 1641024045  │ 1040.0    │ │   │
│  │ └─────────────┴─────────────┴─────────────┴───────────┘ │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Labels: {method="POST", status="200", endpoint="/api"}  │   │
│  │ ┌─────────────┬─────────────┬─────────────┬───────────┐ │   │
│  │ │ Timestamp   │ Value       │ Timestamp   │ Value     │ │   │
│  │ │ 1641024000  │ 45.0        │ 1641024015  │ 47.0      │ │   │
│  │ └─────────────┴─────────────┴─────────────┴───────────┘ │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Label Theory and Best Practices

Label Cardinality: The combination of all possible label values creates the cardinality of a metric. High cardinality can impact performance.

Good Label Example:

http_requests_total{method="GET", status="200", service="api"}
# Low cardinality: method (5 values), status (10 values), service (20 values)
# Total combinations: 5 × 10 × 20 = 1,000 time series

Bad Label Example (High Cardinality):

http_requests_total{method="GET", status="200", user_id="12345"}
# High cardinality: user_id could have millions of values
# Avoid using user IDs, request IDs, or other unbounded values as labels

Four Metric Types in Detail

1. Counter - The Accumulator

Theory: Counters represent cumulative metrics that only increase (or reset to zero on restart). They're perfect for counting events like requests, errors, or bytes transferred.

Key Characteristics:

  • Monotonically increasing
  • Resets only when process restarts
  • Rate of change is more meaningful than absolute value
  • Always starts from 0

Mathematical Representation:

Counter(t) ≥ Counter(t-1) for all t (except resets)

Practical Examples:

prometheus
# HTTP requests counter
http_requests_total{method="GET", status="200"} 1027

# Bytes sent counter
bytes_sent_total{interface="eth0"} 482847392

# Error counter
errors_total{type="timeout", service="payment"} 15

Common PromQL Operations with Counters:

promql
# Rate of HTTP requests per second over 5 minutes
rate(http_requests_total[5m])

# Total increase in requests over 1 hour
increase(http_requests_total[1h])

# Requests per minute
rate(http_requests_total[5m]) * 60

Counter Implementation Example (Go):

go
package main

import (
    "net/http"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequests = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "status", "endpoint"},
    )
)

func init() {
    prometheus.MustRegister(httpRequests)
}

func handleRequest(w http.ResponseWriter, r *http.Request) {
    // Increment counter for each request
    httpRequests.WithLabelValues(r.Method, "200", r.URL.Path).Inc()
    w.Write([]byte("Hello World"))
}

Counter Reset Detection:

promql
# Detect counter resets (useful for calculating rates across restarts)
resets(http_requests_total[1h])

2. Gauge - The Thermometer

Theory: Gauges represent point-in-time values that can go up and down. They're like a thermometer or speedometer - showing current state rather than cumulative values.

Key Characteristics:

  • Values can increase or decrease
  • Represents current state/level
  • Instant value is meaningful
  • No reset behavior

Mathematical Representation:

Gauge(t) can be any value relative to Gauge(t-1)

Practical Examples:

prometheus
# CPU usage percentage
cpu_usage_percent{cpu="0", mode="user"} 45.2

# Memory usage in bytes
memory_usage_bytes{type="heap"} 1073741824

# Current temperature
temperature_celsius{sensor="cpu", location="server_room"} 68.5

# Active connections
active_connections{service="database", pool="primary"} 42

Common PromQL Operations with Gauges:

promql
# Current CPU usage
cpu_usage_percent

# Average memory usage across instances
avg(memory_usage_bytes) by (instance)

# Maximum temperature in last hour
max_over_time(temperature_celsius[1h])

# Memory usage trend (derivative)
deriv(memory_usage_bytes[10m])

Gauge Implementation Example (Python):

python
from prometheus_client import Gauge, start_http_server
import psutil
import time

# Create gauge metrics
cpu_usage = Gauge('cpu_usage_percent', 'CPU usage percentage', ['cpu'])
memory_usage = Gauge('memory_usage_bytes', 'Memory usage in bytes')
disk_usage = Gauge('disk_usage_percent', 'Disk usage percentage', ['device'])

def collect_system_metrics():
    while True:
        # Update CPU usage for each core
        cpu_percentages = psutil.cpu_percent(percpu=True)
        for i, usage in enumerate(cpu_percentages):
            cpu_usage.labels(cpu=str(i)).set(usage)
        
        # Update memory usage
        memory = psutil.virtual_memory()
        memory_usage.set(memory.used)
        
        # Update disk usage
        for partition in psutil.disk_partitions():
            try:
                usage = psutil.disk_usage(partition.mountpoint)
                disk_usage.labels(device=partition.device).set(
                    (usage.used / usage.total) * 100
                )
            except PermissionError:
                continue
        
        time.sleep(15)

if __name__ == '__main__':
    start_http_server(8000)
    collect_system_metrics()

3. Histogram - The Distribution Analyzer

Theory: Histograms sample observations and count them in configurable buckets. They're designed to measure distributions of values like request latencies, response sizes, or any measurement where you need to understand the distribution pattern.

Key Characteristics:

  • Samples observations into buckets
  • Provides count, sum, and bucket counts
  • Buckets are cumulative (le = "less than or equal")
  • Enables percentile calculations

Histogram Components:

prometheus
# Bucket counters (cumulative)
http_request_duration_seconds_bucket{le="0.1"} 24054
http_request_duration_seconds_bucket{le="0.25"} 26335
http_request_duration_seconds_bucket{le="0.5"} 27534
http_request_duration_seconds_bucket{le="1.0"} 28126
http_request_duration_seconds_bucket{le="2.5"} 28312
http_request_duration_seconds_bucket{le="5.0"} 28358
http_request_duration_seconds_bucket{le="10.0"} 28367
http_request_duration_seconds_bucket{le="+Inf"} 28367

# Total count of all observations
http_request_duration_seconds_count 28367

# Sum of all observed values
http_request_duration_seconds_sum 1896.04

Histogram Bucket Visualization:

Distribution of Response Times:
┌─────────────────────────────────────────────────────────────────┐
│ Bucket Analysis for http_request_duration_seconds               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│ le="0.1"    │████████████████████████████████████████│ 24,054  │
│ le="0.25"   │██│ 2,281 (26,335 - 24,054)                      │
│ le="0.5"    │█│ 1,199 (27,534 - 26,335)                       │
│ le="1.0"    │█│ 592 (28,126 - 27,534)                         │
│ le="2.5"    ││ 186 (28,312 - 28,126)                          │
│ le="5.0"    ││ 46 (28,358 - 28,312)                           │
│ le="10.0"   ││ 9 (28,367 - 28,358)                            │
│ le="+Inf"   ││ 0 (28,367 - 28,367)                            │
│                                                                 │
│ Total Observations: 28,367                                      │
│ Sum of All Values: 1,896.04 seconds                           │
│ Average Response Time: 0.067 seconds                          │
└─────────────────────────────────────────────────────────────────┘

Percentile Calculations with Histograms:

promql
# 50th percentile (median) response time
histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[5m]))

# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# 99th percentile response time
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Average response time
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

Histogram Implementation Example (Go):

go
package main

import (
    "math/rand"
    "time"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "HTTP request latency distributions",
            Buckets: []float64{0.1, 0.25, 0.5, 1, 2.5, 5, 10}, // Custom buckets
        },
        []string{"method", "endpoint"},
    )
)

func init() {
    prometheus.MustRegister(requestDuration)
}

func simulateRequest(method, endpoint string) {
    start := time.Now()
    
    // Simulate request processing
    processingTime := rand.Float64() * 2 // 0-2 seconds
    time.Sleep(time.Duration(processingTime * float64(time.Second)))
    
    // Record the duration
    duration := time.Since(start).Seconds()
    requestDuration.WithLabelValues(method, endpoint).Observe(duration)
}

4. Summary - The Quantile Calculator

Theory: Summaries are similar to histograms but calculate quantiles directly on the client side. They provide count, sum, and configurable quantiles, offering an alternative approach to understanding distributions.

Key Characteristics:

  • Calculates quantiles client-side
  • Provides count, sum, and quantiles
  • Lower server-side computational cost
  • Less flexible for aggregation across instances

Summary Components:

prometheus
# Configured quantiles
http_request_duration_seconds{quantile="0.5"} 0.052
http_request_duration_seconds{quantile="0.9"} 0.564
http_request_duration_seconds{quantile="0.99"} 1.245

# Total count of observations
http_request_duration_seconds_count 144320

# Sum of all observed values
http_request_duration_seconds_sum 1896.04

Summary vs Histogram Comparison:

AspectHistogramSummary
Quantile CalculationServer-side with PromQLClient-side
AccuracyApproximated from bucketsExact for configured quantiles
AggregationCan aggregate across instancesCannot aggregate quantiles
StorageStores bucket countsStores quantile values
FlexibilityAny quantile can be calculatedOnly pre-configured quantiles
PerformanceHigher server load for queriesHigher client memory usage

Summary Implementation Example (Python):

python
from prometheus_client import Summary, start_http_server
import time
import random

# Create summary with custom quantiles
request_latency = Summary(
    'request_processing_seconds',
    'Time spent processing requests',
    ['method', 'endpoint']
)

# Decorator for automatic timing
@request_latency.labels(method='GET', endpoint='/api/users').time()
def process_get_users():
    # Simulate processing time
    time.sleep(random.uniform(0.1, 0.5))
    return "users data"

# Manual timing
def process_post_users():
    with request_latency.labels(method='POST', endpoint='/api/users').time():
        # Simulate processing
        time.sleep(random.uniform(0.2, 1.0))
        return "user created"

if __name__ == '__main__':
    start_http_server(8000)
    while True:
        process_get_users()
        process_post_users()
        time.sleep(1)

Choosing the Right Metric Type

Decision Matrix:

Use CaseMetric TypeReasoning
Count events (requests, errors)CounterEvents only increase over time
Current state (CPU, memory, queue size)GaugeValues can go up and down
Measure distributions (latency, size)HistogramNeed percentiles and buckets
Simple quantiles (response time)SummaryPre-defined quantiles sufficient

Best Practices:

  1. Use descriptive names: http_requests_total vs requests
  2. Include units: duration_seconds, size_bytes
  3. Use consistent labels: Same labels across related metrics
  4. Avoid high cardinality: Don't use user IDs or request IDs as labels
  5. Document metrics: Include help text describing what each metric measures

Q7: How do you configure Prometheus to scrape targets?

Answer:

Prometheus configuration is the cornerstone of effective monitoring. The configuration file defines what to monitor, how often to collect data, and how to process that data. Understanding configuration is essential for building robust monitoring systems.

Configuration File Structure

Prometheus uses a YAML configuration file (typically prometheus.yml) with several main sections:

Complete Configuration Template:

yaml
# Global configuration
global:
  scrape_interval: 15s        # How often to scrape targets by default
  evaluation_interval: 15s    # How often to evaluate rules
  external_labels:            # Labels attached to any time series
    monitor: 'production'
    datacenter: 'us-east-1'

# Rule files
rule_files:
  - "rules/*.yml"
  - "alerts/*.yml"

# Alerting configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093
      timeout: 10s
      api_version: v2

# Scrape configurations
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
    
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100', 'server2:9100']
    scrape_interval: 5s
    metrics_path: /metrics
    scheme: https
    
# Remote write configuration (optional)
remote_write:
  - url: "https://remote-storage-endpoint/api/v1/write"
    
# Remote read configuration (optional)
remote_read:
  - url: "https://remote-storage-endpoint/api/v1/read"

Global Configuration Section

Purpose: Sets default values that apply to all jobs unless overridden.

Key Parameters:

yaml
global:
  scrape_interval: 15s          # Default scrape frequency
  scrape_timeout: 10s           # Default scrape timeout
  evaluation_interval: 15s      # How often to evaluate recording and alerting rules
  external_labels:              # Labels added to all time series
    region: 'us-west-2'
    environment: 'production'

Theory Behind Intervals:

  • Scrape Interval: Balance between data resolution and system load
  • Evaluation Interval: How quickly alerts are triggered
  • Timeout: Must be less than scrape interval

Scrape Configuration Deep Dive

Basic Static Configuration:

yaml
scrape_configs:
  - job_name: 'web-servers'           # Logical grouping name
    static_configs:
      - targets: 
          - 'web1.example.com:8080'
          - 'web2.example.com:8080'
          - 'web3.example.com:8080'
    scrape_interval: 30s              # Override global interval
    scrape_timeout: 10s               # Maximum time to wait for response
    metrics_path: '/metrics'          # Path to metrics endpoint
    scheme: 'http'                    # http or https
    params:                           # URL parameters
      format: ['prometheus']
    basic_auth:                       # HTTP basic authentication
      username: 'prometheus'
      password: 'secret'
    bearer_token: 'abc123'            # Bearer token authentication
    tls_config:                       # TLS configuration
      ca_file: '/path/to/ca.pem'
      cert_file: '/path/to/cert.pem'
      key_file: '/path/to/key.pem'
      insecure_skip_verify: false

Service Discovery Mechanisms

Theory: Modern infrastructure is dynamic. Containers start and stop, auto-scaling changes instance counts, and services move between hosts. Static configuration doesn't scale. Service discovery automatically finds targets to monitor.

Kubernetes Service Discovery:

yaml
scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod                     # Discover pods
        namespaces:
          names:
            - default
            - monitoring
    relabel_configs:
      # Only scrape pods with annotation prometheus.io/scrape=true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      
      # Use custom metrics path if specified
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      
      # Use custom port if specified
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      
      # Add pod name as label
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

Service Discovery Flow Diagram:

┌─────────────────────────────────────────────────────────────────┐
│                    SERVICE DISCOVERY FLOW                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐    API Query    ┌─────────────┐                │
│  │ Prometheus  │ ────────────── │ Kubernetes  │                │
│  │   Server    │                │ API Server  │                │
│  └─────────────┘                └─────────────┘                │
│         │                               │                      │
│         ▼                               ▼                      │
│  ┌─────────────┐                ┌─────────────┐                │
│  │  Target     │                │    Pod      │                │
│  │ Discovery   │ ◀──────────── │ Metadata    │                │
│  └─────────────┘   Pod Info     └─────────────┘                │
│         │                                                      │
│         ▼                                                      │
│  ┌─────────────┐                                               │
│  │ Relabeling  │                                               │
│  │   Rules     │                                               │
│  └─────────────┘                                               │
│         │                                                      │
│         ▼                                                      │
│  ┌─────────────┐    HTTP GET     ┌─────────────┐               │
│  │   Scrape    │ ────────────── │   Target    │               │
│  │  Targets    │   /metrics      │ Endpoints   │               │
│  └─────────────┘                └─────────────┘               │
└─────────────────────────────────────────────────────────────────┘

AWS EC2 Service Discovery:

yaml
scrape_configs:
  - job_name: 'ec2-instances'
    ec2_sd_configs:
      - region: us-west-2
        port: 9100
        filters:
          - name: 'tag:Environment'
            values: ['production', 'staging']
          - name: 'instance-state-name'
            values: ['running']
    relabel_configs:
      # Use instance ID as instance label
      - source_labels: [__meta_ec2_instance_id]
        target_label: instance
      
      # Add environment from EC2 tag
      - source_labels: [__meta_ec2_tag_Environment]
        target_label: environment
      
      # Add instance type
      - source_labels: [__meta_ec2_instance_type]
        target_label: instance_type

Consul Service Discovery:

yaml
scrape_configs:
  - job_name: 'consul-services'
    consul_sd_configs:
      - server: 'consul.service.consul:8500'
        services: ['web', 'api', 'database']
    relabel_configs:
      # Use service name as job label
      - source_labels: [__meta_consul_service]
        target_label: job
      
      # Add datacenter information
      - source_labels: [__meta_consul_datacenter]
        target_label: datacenter
      
      # Filter healthy services only
      - source_labels: [__meta_consul_health]
        regex: passing
        action: keep

Relabeling - The Power Tool

Theory: Relabeling is Prometheus's Swiss Army knife. It allows you to modify, add, or remove labels before storing metrics. This is crucial for:

  • Filtering unwanted targets
  • Adding contextual information
  • Standardizing label names
  • Reducing cardinality

Relabeling Actions:

  1. replace: Replace label value with new value
  2. keep: Keep only targets where label matches regex
  3. drop: Drop targets where label matches regex
  4. labelmap: Map label names using regex
  5. labeldrop: Drop labels matching regex
  6. labelkeep: Keep only labels matching regex

Advanced Relabeling Examples:

yaml
relabel_configs:
  # Keep only targets with specific annotation
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: 'true'
  
  # Extract service name from pod name
  - source_labels: [__meta_kubernetes_pod_name]
    regex: '([^-]+)-.*'
    replacement: '${1}'
    target_label: service
  
  # Add custom labels based on namespace
  - source_labels: [__meta_kubernetes_namespace]
    regex: 'production'
    replacement: 'prod'
    target_label: environment
  
  # Drop internal labels (starting with __)
  - regex: '__meta_.*'
    action: labeldrop
  
  # Map Kubernetes labels to Prometheus labels
  - regex: '__meta_kubernetes_pod_label_(.+)'
    action: labelmap
    replacement: 'k8s_${1}'

Authentication and Security

Basic Authentication:

yaml
scrape_configs:
  - job_name: 'secured-app'
    static_configs:
      - targets: ['app.example.com:8080']
    basic_auth:
      username: 'monitoring'
      password_file: '/etc/prometheus/passwords/app_password'

Bearer Token Authentication:

yaml
scrape_configs:
  - job_name: 'api-service'
    static_configs:
      - targets: ['api.example.com:8080']
    bearer_token_file: '/etc/prometheus/tokens/api_token'

TLS Configuration:

yaml
scrape_configs:
  - job_name: 'https-service'
    static_configs:
      - targets: ['secure.example.com:8443']
    scheme: https
    tls_config:
      ca_file: '/etc/prometheus/ca.pem'
      cert_file: '/etc/prometheus/client.pem'
      key_file: '/etc/prometheus/client-key.pem'
      server_name: 'secure.example.com'
      insecure_skip_verify: false

Configuration Validation and Best Practices

Validation Commands:

bash
# Check configuration syntax
promtool check config prometheus.yml

# Check rules syntax
promtool check rules /etc/prometheus/rules/*.yml

# Test configuration without starting server
prometheus --config.file=prometheus.yml --dry-run

Best Practices:

  1. Start Simple: Begin with static configs, add service discovery later
  2. Use Meaningful Job Names: Make them descriptive and consistent
  3. Group Related Targets: Use job names to group similar services
  4. Implement Proper Labeling: Use consistent label naming conventions
  5. Monitor Configuration Changes: Version control your config files
  6. Test Before Deploy: Always validate configuration changes
  7. Use Appropriate Intervals: Balance data resolution with system load
  8. Secure Credentials: Use file-based authentication, never hardcode passwords

Example Production Configuration:

yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production'
    region: 'us-west-2'

rule_files:
  - '/etc/prometheus/rules/recording.yml'
  - '/etc/prometheus/rules/alerting.yml'

alerting:
  alertmanagers:
    - kubernetes_sd_configs:
        - role: pod
          namespaces:
            names: ['monitoring']
      relabel_configs:
        - source_labels: [__meta_kubernetes_pod_label_app]
          action: keep
          regex: alertmanager

scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  # Node exporters via service discovery
  - job_name: 'node-exporter'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: node-exporter
      - source_labels: [__address__]
        regex: '([^:]+):.*'
        replacement: '${1}:9100'
        target_label: __address__
  
  # Application pods
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

Q8: What are Prometheus exporters and how do they work?

Answer:

Prometheus exporters are specialized programs that bridge the gap between Prometheus and systems that don't natively expose metrics in Prometheus format. They're essential components of the Prometheus ecosystem, enabling monitoring of virtually any system or service.

Exporter Architecture and Theory

Core Concept: An exporter acts as a translator, collecting metrics from a target system (database, operating system, application) and exposing them in Prometheus format via an HTTP endpoint.

Exporter Workflow:

┌─────────────────────────────────────────────────────────────────┐
│                    EXPORTER ARCHITECTURE                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐    Native API    ┌─────────────┐              │
│  │   Target    │ ────────────── │  Exporter   │              │
│  │   System    │   (MySQL API,   │             │              │
│  │ (Database,  │    /proc files,  │ - Collects  │              │
│  │  App, etc.) │    REST API)     │ - Converts  │              │
│  └─────────────┘                 │ - Exposes   │              │
│                                   └─────────────┘              │
│                                           │                    │
│                                           ▼                    │
│                                   ┌─────────────┐              │
│                                   │ HTTP Server │              │
│                                   │ /metrics    │              │
│                                   │ endpoint    │              │
│                                   └─────────────┘              │
│                                           │                    │
│                                           ▼                    │
│  ┌─────────────┐    HTTP GET     ┌─────────────┐              │
│  │ Prometheus  │ ────────────── │ Prometheus  │              │
│  │   Server    │   /metrics      │   Metrics   │              │
│  │             │                 │   Format    │              │
│  └─────────────┘                 └─────────────┘              │
└─────────────────────────────────────────────────────────────────┘

How Exporters Work - Step by Step

Step 1: Data Collection The exporter connects to the target system using its native API, protocols, or interfaces:

  • Databases: SQL queries, admin commands
  • Operating Systems: /proc filesystem, system calls
  • APIs: REST endpoints, GraphQL queries
  • Files: Log parsing, configuration files

Step 2: Data Transformation Raw data is converted into Prometheus metric format:

  • Apply naming conventions
  • Add appropriate labels
  • Choose correct metric types
  • Handle missing or invalid data

Step 3: HTTP Exposition Metrics are exposed via HTTP endpoint (typically /metrics):

  • Standard Prometheus text format
  • Real-time generation (not cached)
  • Consistent response format

Step 4: Prometheus Scraping Prometheus periodically scrapes the exporter endpoint according to configuration.


Official Prometheus Exporters

1. Node Exporter - System Metrics Champion

Purpose: Exposes hardware and OS metrics for Unix systems (Linux, FreeBSD, macOS).

Installation and Setup:

bash
# Download and install
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
cd node_exporter-1.6.1.linux-amd64

# Run with default configuration
./node_exporter

# Run with specific collectors enabled
./node_exporter --collector.systemd --collector.processes --no-collector.hwmon

# Run as systemd service
sudo useradd --no-create-home --shell /bin/false node_exporter
sudo mv node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

Systemd Service Configuration:

ini
# /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
    --collector.systemd \
    --collector.processes

[Install]
WantedBy=multi-user.target

Key Metrics Provided:

prometheus
# CPU Metrics
node_cpu_seconds_total{cpu="0",mode="idle"} 123456.78
node_cpu_seconds_total{cpu="0",mode="user"} 9876.54
node_cpu_seconds_total{cpu="0",mode="system"} 5432.10

# Memory Metrics
node_memory_MemTotal_bytes 8589934592
node_memory_MemFree_bytes 2147483648
node_memory_MemAvailable_bytes 4294967296
node_memory_Buffers_bytes 536870912
node_memory_Cached_bytes 1073741824

# Disk Metrics
node_filesystem_size_bytes{device="/dev/sda1",fstype="ext4",mountpoint="/"} 107374182400
node_filesystem_free_bytes{device="/dev/sda1",fstype="ext4",mountpoint="/"} 85899345920
node_filesystem_avail_bytes{device="/dev/sda1",fstype="ext4",mountpoint="/"} 80530636800

# Network Metrics
node_network_receive_bytes_total{device="eth0"} 12345678901
node_network_transmit_bytes_total{device="eth0"} 9876543210
node_network_receive_packets_total{device="eth0"} 87654321
node_network_transmit_packets_total{device="eth0"} 65432109

# Load Average
node_load1 0.85
node_load5 0.92
node_load15 1.05

# Boot Time
node_boot_time_seconds 1641024000

# Time
node_time_seconds 1641110400

Useful PromQL Queries for Node Exporter:

promql
# CPU usage percentage (average across all cores)
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk usage percentage
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100

# Network throughput (bytes per second)
irate(node_network_receive_bytes_total[5m]) + irate(node_network_transmit_bytes_total[5m])

# Disk I/O operations per second
irate(node_disk_reads_completed_total[5m]) + irate(node_disk_writes_completed_total[5m])

2. MySQL Exporter - Database Monitoring

Purpose: Exposes MySQL server metrics for performance monitoring and capacity planning.

Installation and Configuration:

bash
# Download MySQL exporter
wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.15.0/mysqld_exporter-0.15.0.linux-amd64.tar.gz
tar xvfz mysqld_exporter-0.15.0.linux-amd64.tar.gz

# Create MySQL user for monitoring
mysql -u root -p << EOF
CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'password';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
FLUSH PRIVILEGES;
EOF

Configuration File (.my.cnf):

ini
[client]
user=exporter
password=secretpassword
host=localhost
port=3306

Running MySQL Exporter:

bash
# Using configuration file
export DATA_SOURCE_NAME="exporter:password@(localhost:3306)/"
./mysqld_exporter --config.my-cnf=.my.cnf

# Using connection string
./mysqld_exporter --config.my-cnf=.my.cnf --collect.info_schema.tables

Key MySQL Metrics:

prometheus
# Connection Metrics
mysql_global_status_connections 123456
mysql_global_status_threads_connected 45
mysql_global_status_threads_running 8
mysql_global_variables_max_connections 151

# Query Performance
mysql_global_status_queries 9876543
mysql_global_status_slow_queries 123
mysql_global_status_com_select 654321
mysql_global_status_com_insert 98765
mysql_global_status_com_update 54321
mysql_global_status_com_delete 12345

# InnoDB Metrics
mysql_global_status_innodb_buffer_pool_read_requests 8765432
mysql_global_status_innodb_buffer_pool_reads 87654
mysql_global_status_innodb_buffer_pool_pages_data 45678
mysql_global_status_innodb_buffer_pool_pages_free 12345

# Replication Metrics (for slaves)
mysql_slave_lag_seconds 0.5
mysql_slave_sql_running 1
mysql_slave_io_running 1

MySQL Monitoring Queries:

promql
# Connection usage percentage
mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100

# Query rate per second
rate(mysql_global_status_queries[5m])

# Slow query rate
rate(mysql_global_status_slow_queries[5m])

# Buffer pool hit ratio (should be > 99%)
(mysql_global_status_innodb_buffer_pool_read_requests - mysql_global_status_innodb_buffer_pool_reads) / mysql_global_status_innodb_buffer_pool_read_requests * 100

# Average query response time
rate(mysql_global_status_queries[5m]) / rate(mysql_global_status_connections[5m])

3. Blackbox Exporter - External Monitoring

Purpose: Probes endpoints over HTTP, HTTPS, DNS, TCP, and ICMP to monitor external services and network connectivity.

Configuration (blackbox.yml):

yaml
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: [200]
      method: GET
      headers:
        User-Agent: "Prometheus Blackbox Exporter"
      follow_redirects: true
      preferred_ip_protocol: "ip4"

  http_post_2xx:
    prober: http
    timeout: 5s
    http:
      valid_status_codes: [200]
      method: POST
      headers:
        Content-Type: application/json
      body: '{"test": "data"}'

  tcp_connect:
    prober: tcp
    timeout: 5s
    tcp:
      preferred_ip_protocol: "ip4"

  icmp:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: "ip4"

  dns:
    prober: dns
    timeout: 5s
    dns:
      query_name: "example.com"
      query_type: "A"
      valid_rcodes:
        - NOERROR

Prometheus Configuration for Blackbox:

yaml
scrape_configs:
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]  # Default module
    static_configs:
      - targets:
        - http://prometheus.io
        - https://prometheus.io
        - http://example.com:8080
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

  - job_name: 'blackbox-tcp'
    metrics_path: /probe
    params:
      module: [tcp_connect]
    static_configs:
      - targets:
        - example.com:22
        - example.com:80
        - example.com:443
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

Building Custom Exporters

Theory Behind Custom Exporters

Sometimes you need to monitor systems that don't have existing exporters. Building custom exporters allows you to:

  • Monitor proprietary applications
  • Expose business metrics
  • Integrate with internal APIs
  • Create specialized monitoring solutions

Custom Exporter Best Practices

  1. Use Official Client Libraries: Leverage Prometheus client libraries for your language
  2. Follow Naming Conventions: Use descriptive metric names with units
  3. Handle Errors Gracefully: Don't crash on collection failures
  4. Keep It Simple: Focus on essential metrics
  5. Document Your Metrics: Include help text and examples
  6. Test Thoroughly: Verify metric accuracy and performance

Simple Custom Exporter (Python)

python
#!/usr/bin/env python3
"""
Custom application exporter example
Monitors a web application's performance and business metrics
"""

import time
import requests
import json
from prometheus_client import start_http_server, Gauge, Counter, Histogram, Info
from prometheus_client.core import CollectorRegistry, REGISTRY
import threading
import logging

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class ApplicationExporter:
    """Custom exporter for application metrics"""
    
    def __init__(self, app_url, app_token):
        self.app_url = app_url
        self.app_token = app_token
        
        # Define metrics
        self.app_info = Info('app_info', 'Application information')
        self.app_up = Gauge('app_up', 'Application availability', ['service'])
        self.app_response_time = Histogram(
            'app_response_time_seconds',
            'Application response time',
            ['endpoint', 'method'],
            buckets=[0.1, 0.25, 0.5, 1, 2.5, 5, 10]
        )
        self.app_requests_total = Counter(
            'app_requests_total',
            'Total application requests',
            ['endpoint', 'method', 'status']
        )
        self.app_active_users = Gauge(
            'app_active_users',
            'Number of active users',
            ['type']
        )
        self.app_database_connections = Gauge(
            'app_database_connections',
            'Active database connections',
            ['pool']
        )
        self.app_queue_size = Gauge(
            'app_queue_size',
            'Queue size',
            ['queue_name']
        )
        self.app_revenue_total = Counter(
            'app_revenue_total_cents',
            'Total revenue in cents',
            ['product_type']
        )
        
        # Application info (static)
        self.app_info.info({
            'version': self.get_app_version(),
            'environment': 'production',
            'build_date': '2024-01-15'
        })
    
    def get_app_version(self):
        """Get application version from API"""
        try:
            response = requests.get(f"{self.app_url}/api/version", 
                                  headers={'Authorization': f'Bearer {self.app_token}'},
                                  timeout=5)
            return response.json().get('version', 'unknown')
        except Exception as e:
            logger.error(f"Failed to get app version: {e}")
            return 'unknown'
    
    def collect_health_metrics(self):
        """Collect application health metrics"""
        try:
            # Check main application
            start_time = time.time()
            response = requests.get(f"{self.app_url}/health", 
                                  headers={'Authorization': f'Bearer {self.app_token}'},
                                  timeout=10)
            response_time = time.time() - start_time
            
            if response.status_code == 200:
                self.app_up.labels(service='main').set(1)
                self.app_response_time.labels(endpoint='/health', method='GET').observe(response_time)
                
                # Parse health data
                health_data = response.json()
                
                # Database connections
                for pool_name, count in health_data.get('database_pools', {}).items():
                    self.app_database_connections.labels(pool=pool_name).set(count)
                
                # Queue sizes
                for queue_name, size in health_data.get('queues', {}).items():
                    self.app_queue_size.labels(queue_name=queue_name).set(size)
                
            else:
                self.app_up.labels(service='main').set(0)
                
        except Exception as e:
            logger.error(f"Failed to collect health metrics: {e}")
            self.app_up.labels(service='main').set(0)
    
    def collect_business_metrics(self):
        """Collect business metrics"""
        try:
            response = requests.get(f"{self.app_url}/api/metrics", 
                                  headers={'Authorization': f'Bearer {self.app_token}'},
                                  timeout=10)
            
            if response.status_code == 200:
                metrics_data = response.json()
                
                # Active users
                for user_type, count in metrics_data.get('active_users', {}).items():
                    self.app_active_users.labels(type=user_type).set(count)
                
                # Revenue (business metric)
                for product_type, revenue in metrics_data.get('revenue_today', {}).items():
                    # Convert to cents to avoid floating point issues
                    revenue_cents = int(revenue * 100)
                    self.app_revenue_total.labels(product_type=product_type)._value._value = revenue_cents
                
                # Request statistics
                for endpoint_data in metrics_data.get('endpoints', []):
                    endpoint = endpoint_data['path']
                    method = endpoint_data['method']
                    
                    for status_code, count in endpoint_data.get('status_codes', {}).items():
                        self.app_requests_total.labels(
                            endpoint=endpoint, 
                            method=method, 
                            status=status_code
                        )._value._value = count
                
        except Exception as e:
            logger.error(f"Failed to collect business metrics: {e}")
    
    def collect_all_metrics(self):
        """Collect all metrics"""
        logger.info("Collecting application metrics...")
        self.collect_health_metrics()
        self.collect_business_metrics()
        logger.info("Metrics collection completed")

def run_exporter():
    """Run the custom exporter"""
    # Configuration
    APP_URL = "https://api.myapp.com"
    APP_TOKEN = "your-api-token-here"
    EXPORTER_PORT = 8000
    COLLECTION_INTERVAL = 30  # seconds
    
    # Create exporter instance
    exporter = ApplicationExporter(APP_URL, APP_TOKEN)
    
    # Start HTTP server
    start_http_server(EXPORTER_PORT)
    logger.info(f"Custom exporter started on port {EXPORTER_PORT}")
    
    # Collection loop
    while True:
        try:
            exporter.collect_all_metrics()
            time.sleep(COLLECTION_INTERVAL)
        except KeyboardInterrupt:
            logger.info("Exporter stopped by user")
            break
        except Exception as e:
            logger.error(f"Error in collection loop: {e}")
            time.sleep(10)  # Wait before retrying

if __name__ == '__main__':
    run_exporter()

Advanced Custom Exporter (Go)

go
package main

import (
    "context"
    "database/sql"
    "encoding/json"
    "fmt"
    "log"
    "net/http"
    "time"
    
    _ "github.com/lib/pq"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

type DatabaseExporter struct {
    db *sql.DB
    
    // Metrics
    dbUp          prometheus.Gauge
    dbConnections *prometheus.GaugeVec
    queryDuration *prometheus.HistogramVec
    slowQueries   prometheus.Counter
    tableRows     *prometheus.GaugeVec
    tableSize     *prometheus.GaugeVec
}

func NewDatabaseExporter(dsn string) (*DatabaseExporter, error) {
    db, err := sql.Open("postgres", dsn)
    if err != nil {
        return nil, err
    }
    
    return &DatabaseExporter{
        db: db,
        dbUp: prometheus.NewGauge(prometheus.GaugeOpts{
            Name: "database_up",
            Help: "Database availability",
        }),
        dbConnections: prometheus.NewGaugeVec(
            prometheus.GaugeOpts{
                Name: "database_connections",
                Help: "Database connections by state",
            },
            []string{"state"},
        ),
        queryDuration: prometheus.NewHistogramVec(
            prometheus.HistogramOpts{
                Name: "database_query_duration_seconds",
                Help: "Database query duration",
                Buckets: []float64{0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5},
            },
            []string{"query_type"},
        ),
        slowQueries: prometheus.NewCounter(prometheus.CounterOpts{
            Name: "database_slow_queries_total",
            Help: "Total number of slow queries",
        }),
        tableRows: prometheus.NewGaugeVec(
            prometheus.GaugeOpts{
                Name: "database_table_rows",
                Help: "Number of rows in each table",
            },
            []string{"table", "schema"},
        ),
        tableSize: prometheus.NewGaugeVec(
            prometheus.GaugeOpts{
                Name: "database_table_size_bytes",
                Help: "Size of each table in bytes",
            },
            []string{"table", "schema"},
        ),
    }, nil
}

func (e *DatabaseExporter) collectConnectionMetrics() {
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()
    
    start := time.Now()
    
    // Test database connectivity
    if err := e.db.PingContext(ctx); err != nil {
        e.dbUp.Set(0)
        log.Printf("Database ping failed: %v", err)
        return
    }
    e.dbUp.Set(1)
    
    // Collect connection statistics
    query := `
        SELECT state, count(*) 
        FROM pg_stat_activity 
        WHERE datname = current_database() 
        GROUP BY state
    `
    
    rows, err := e.db.QueryContext(ctx, query)
    if err != nil {
        log.Printf("Failed to query connection stats: %v", err)
        return
    }
    defer rows.Close()
    
    for rows.Next() {
        var state string
        var count float64
        if err := rows.Scan(&state, &count); err != nil {
            log.Printf("Failed to scan connection stats: %v", err)
            continue
        }
        e.dbConnections.WithLabelValues(state).Set(count)
    }
    
    e.queryDuration.WithLabelValues("connection_stats").Observe(time.Since(start).Seconds())
}

func (e *DatabaseExporter) collectTableMetrics() {
    ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
    defer cancel()
    
    start := time.Now()
    
    query := `
        SELECT 
            schemaname,
            tablename,
            n_tup_ins + n_tup_upd + n_tup_del as total_rows,
            pg_total_relation_size(schemaname||'.'||tablename) as table_size
        FROM pg_stat_user_tables
    `
    
    rows, err := e.db.QueryContext(ctx, query)
    if err != nil {
        log.Printf("Failed to query table stats: %v", err)
        return
    }
    defer rows.Close()
    
    for rows.Next() {
        var schema, table string
        var rowCount, tableSize float64
        
        if err := rows.Scan(&schema, &table, &rowCount, &tableSize); err != nil {
            log.Printf("Failed to scan table stats: %v", err)
            continue
        }
        
        e.tableRows.WithLabelValues(table, schema).Set(rowCount)
        e.tableSize.WithLabelValues(table, schema).Set(tableSize)
    }
    
    e.queryDuration.WithLabelValues("table_stats").Observe(time.Since(start).Seconds())
}

func (e *DatabaseExporter) collectSlowQueries() {
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()
    
    start := time.Now()
    
    query := `
        SELECT count(*) 
        FROM pg_stat_statements 
        WHERE mean_time > 1000
    `
    
    var slowCount float64
    if err := e.db.QueryRowContext(ctx, query).Scan(&slowCount); err != nil {
        log.Printf("Failed to query slow queries: %v", err)
        return
    }
    
    // This is a simplified example - in practice you'd track the delta
    e.slowQueries.Add(slowCount)
    e.queryDuration.WithLabelValues("slow_queries").Observe(time.Since(start).Seconds())
}

func (e *DatabaseExporter) Collect(ch chan<- prometheus.Metric) {
    e.collectConnectionMetrics()
    e.collectTableMetrics()
    e.collectSlowQueries()
    
    e.dbUp.Collect(ch)
    e.dbConnections.Collect(ch)
    e.queryDuration.Collect(ch)
    e.slowQueries.Collect(ch)
    e.tableRows.Collect(ch)
    e.tableSize.Collect(ch)
}

func (e *DatabaseExporter) Describe(ch chan<- *prometheus.Desc) {
    e.dbUp.Describe(ch)
    e.dbConnections.Describe(ch)
    e.queryDuration.Describe(ch)
    e.slowQueries.Describe(ch)
    e.tableRows.Describe(ch)
    e.tableSize.Describe(ch)
}

func main() {
    dsn := "postgresql://user:password@localhost/dbname?sslmode=disable"
    
    exporter, err := NewDatabaseExporter(dsn)
    if err != nil {
        log.Fatalf("Failed to create exporter: %v", err)
    }
    
    prometheus.MustRegister(exporter)
    
    http.Handle("/metrics", promhttp.Handler())
    log.Println("Database exporter started on :8080")
    log.Fatal(http.ListenAndServe(":8080", nil))
}

Exporter Deployment Patterns

Sidecar Pattern (Kubernetes):

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-with-exporter
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: application
        image: myapp:latest
        ports:
        - containerPort: 3000
      - name: exporter
        image: custom-exporter:latest
        ports:
        - containerPort: 8080
        env:
        - name: APP_URL
          value: "http://localhost:3000"
        - name: SCRAPE_INTERVAL
          value: "30s"

Standalone Exporter:

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: database-exporter
spec:
  replicas: 1
  selector:
    matchLabels:
      app: database-exporter
  template:
    metadata:
      labels:
        app: database-exporter
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9104"
    spec:
      containers:
      - name: exporter
        image: prom/mysqld-exporter:latest
        ports:
        - containerPort: 9104
        env:
        - name: DATA_SOURCE_NAME
          valueFrom:
            secretKeyRef:
              name: mysql-secret
              key: dsn



0 comments:

Post a Comment