Q6: Explain Prometheus data model and metric types.
Answer:
Understanding Prometheus's data model is fundamental to effectively using the monitoring system. The data model defines how metrics are structured, stored, and identified, forming the foundation for all monitoring and alerting capabilities.
Prometheus Data Model - The Foundation
Core Concept: Prometheus stores all data as time series - sequences of timestamped values belonging to the same metric and the same set of labeled dimensions.
Time Series Identity: Each time series is uniquely identified by:
- Metric Name: What you're measuring (e.g.,
http_requests_total
,cpu_usage_percent
) - Labels: Key-value pairs that add dimensions (e.g.,
method="GET"
,status="200"
) - Timestamp: When the measurement was taken (Unix timestamp)
- Value: The actual measurement (64-bit floating-point number)
Data Storage Format
Standard Format:
metric_name{label1="value1", label2="value2"} value timestamp
Real Example:
http_requests_total{method="GET", status="200", endpoint="/api/users"} 1027.0 1641024000
Data Model Visualization:
Time Series Database Structure: ┌─────────────────────────────────────────────────────────────────┐ │ PROMETHEUS TSDB │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Metric: http_requests_total │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Labels: {method="GET", status="200", endpoint="/api"} │ │ │ │ ┌─────────────┬─────────────┬─────────────┬───────────┐ │ │ │ │ │ Timestamp │ Value │ Timestamp │ Value │ │ │ │ │ │ 1641024000 │ 1027.0 │ 1641024015 │ 1031.0 │ │ │ │ │ │ 1641024030 │ 1035.0 │ 1641024045 │ 1040.0 │ │ │ │ │ └─────────────┴─────────────┴─────────────┴───────────┘ │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Labels: {method="POST", status="200", endpoint="/api"} │ │ │ │ ┌─────────────┬─────────────┬─────────────┬───────────┐ │ │ │ │ │ Timestamp │ Value │ Timestamp │ Value │ │ │ │ │ │ 1641024000 │ 45.0 │ 1641024015 │ 47.0 │ │ │ │ │ └─────────────┴─────────────┴─────────────┴───────────┘ │ │ │ └─────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘
Label Theory and Best Practices
Label Cardinality: The combination of all possible label values creates the cardinality of a metric. High cardinality can impact performance.
Good Label Example:
http_requests_total{method="GET", status="200", service="api"} # Low cardinality: method (5 values), status (10 values), service (20 values) # Total combinations: 5 × 10 × 20 = 1,000 time series
Bad Label Example (High Cardinality):
http_requests_total{method="GET", status="200", user_id="12345"} # High cardinality: user_id could have millions of values # Avoid using user IDs, request IDs, or other unbounded values as labels
Four Metric Types in Detail
1. Counter - The Accumulator
Theory: Counters represent cumulative metrics that only increase (or reset to zero on restart). They're perfect for counting events like requests, errors, or bytes transferred.
Key Characteristics:
- Monotonically increasing
- Resets only when process restarts
- Rate of change is more meaningful than absolute value
- Always starts from 0
Mathematical Representation:
Counter(t) ≥ Counter(t-1) for all t (except resets)
Practical Examples:
prometheus# HTTP requests counter http_requests_total{method="GET", status="200"} 1027 # Bytes sent counter bytes_sent_total{interface="eth0"} 482847392 # Error counter errors_total{type="timeout", service="payment"} 15
Common PromQL Operations with Counters:
promql# Rate of HTTP requests per second over 5 minutes rate(http_requests_total[5m]) # Total increase in requests over 1 hour increase(http_requests_total[1h]) # Requests per minute rate(http_requests_total[5m]) * 60
Counter Implementation Example (Go):
gopackage main import ( "net/http" "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promhttp" ) var ( httpRequests = prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "http_requests_total", Help: "Total number of HTTP requests", }, []string{"method", "status", "endpoint"}, ) ) func init() { prometheus.MustRegister(httpRequests) } func handleRequest(w http.ResponseWriter, r *http.Request) { // Increment counter for each request httpRequests.WithLabelValues(r.Method, "200", r.URL.Path).Inc() w.Write([]byte("Hello World")) }
Counter Reset Detection:
promql# Detect counter resets (useful for calculating rates across restarts) resets(http_requests_total[1h])
2. Gauge - The Thermometer
Theory: Gauges represent point-in-time values that can go up and down. They're like a thermometer or speedometer - showing current state rather than cumulative values.
Key Characteristics:
- Values can increase or decrease
- Represents current state/level
- Instant value is meaningful
- No reset behavior
Mathematical Representation:
Gauge(t) can be any value relative to Gauge(t-1)
Practical Examples:
prometheus# CPU usage percentage cpu_usage_percent{cpu="0", mode="user"} 45.2 # Memory usage in bytes memory_usage_bytes{type="heap"} 1073741824 # Current temperature temperature_celsius{sensor="cpu", location="server_room"} 68.5 # Active connections active_connections{service="database", pool="primary"} 42
Common PromQL Operations with Gauges:
promql# Current CPU usage cpu_usage_percent # Average memory usage across instances avg(memory_usage_bytes) by (instance) # Maximum temperature in last hour max_over_time(temperature_celsius[1h]) # Memory usage trend (derivative) deriv(memory_usage_bytes[10m])
Gauge Implementation Example (Python):
pythonfrom prometheus_client import Gauge, start_http_server import psutil import time # Create gauge metrics cpu_usage = Gauge('cpu_usage_percent', 'CPU usage percentage', ['cpu']) memory_usage = Gauge('memory_usage_bytes', 'Memory usage in bytes') disk_usage = Gauge('disk_usage_percent', 'Disk usage percentage', ['device']) def collect_system_metrics(): while True: # Update CPU usage for each core cpu_percentages = psutil.cpu_percent(percpu=True) for i, usage in enumerate(cpu_percentages): cpu_usage.labels(cpu=str(i)).set(usage) # Update memory usage memory = psutil.virtual_memory() memory_usage.set(memory.used) # Update disk usage for partition in psutil.disk_partitions(): try: usage = psutil.disk_usage(partition.mountpoint) disk_usage.labels(device=partition.device).set( (usage.used / usage.total) * 100 ) except PermissionError: continue time.sleep(15) if __name__ == '__main__': start_http_server(8000) collect_system_metrics()
3. Histogram - The Distribution Analyzer
Theory: Histograms sample observations and count them in configurable buckets. They're designed to measure distributions of values like request latencies, response sizes, or any measurement where you need to understand the distribution pattern.
Key Characteristics:
- Samples observations into buckets
- Provides count, sum, and bucket counts
- Buckets are cumulative (le = "less than or equal")
- Enables percentile calculations
Histogram Components:
prometheus# Bucket counters (cumulative) http_request_duration_seconds_bucket{le="0.1"} 24054 http_request_duration_seconds_bucket{le="0.25"} 26335 http_request_duration_seconds_bucket{le="0.5"} 27534 http_request_duration_seconds_bucket{le="1.0"} 28126 http_request_duration_seconds_bucket{le="2.5"} 28312 http_request_duration_seconds_bucket{le="5.0"} 28358 http_request_duration_seconds_bucket{le="10.0"} 28367 http_request_duration_seconds_bucket{le="+Inf"} 28367 # Total count of all observations http_request_duration_seconds_count 28367 # Sum of all observed values http_request_duration_seconds_sum 1896.04
Histogram Bucket Visualization:
Distribution of Response Times: ┌─────────────────────────────────────────────────────────────────┐ │ Bucket Analysis for http_request_duration_seconds │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ le="0.1" │████████████████████████████████████████│ 24,054 │ │ le="0.25" │██│ 2,281 (26,335 - 24,054) │ │ le="0.5" │█│ 1,199 (27,534 - 26,335) │ │ le="1.0" │█│ 592 (28,126 - 27,534) │ │ le="2.5" ││ 186 (28,312 - 28,126) │ │ le="5.0" ││ 46 (28,358 - 28,312) │ │ le="10.0" ││ 9 (28,367 - 28,358) │ │ le="+Inf" ││ 0 (28,367 - 28,367) │ │ │ │ Total Observations: 28,367 │ │ Sum of All Values: 1,896.04 seconds │ │ Average Response Time: 0.067 seconds │ └─────────────────────────────────────────────────────────────────┘
Percentile Calculations with Histograms:
promql# 50th percentile (median) response time histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[5m])) # 95th percentile response time histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # 99th percentile response time histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) # Average response time rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
Histogram Implementation Example (Go):
gopackage main import ( "math/rand" "time" "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promhttp" ) var ( requestDuration = prometheus.NewHistogramVec( prometheus.HistogramOpts{ Name: "http_request_duration_seconds", Help: "HTTP request latency distributions", Buckets: []float64{0.1, 0.25, 0.5, 1, 2.5, 5, 10}, // Custom buckets }, []string{"method", "endpoint"}, ) ) func init() { prometheus.MustRegister(requestDuration) } func simulateRequest(method, endpoint string) { start := time.Now() // Simulate request processing processingTime := rand.Float64() * 2 // 0-2 seconds time.Sleep(time.Duration(processingTime * float64(time.Second))) // Record the duration duration := time.Since(start).Seconds() requestDuration.WithLabelValues(method, endpoint).Observe(duration) }
4. Summary - The Quantile Calculator
Theory: Summaries are similar to histograms but calculate quantiles directly on the client side. They provide count, sum, and configurable quantiles, offering an alternative approach to understanding distributions.
Key Characteristics:
- Calculates quantiles client-side
- Provides count, sum, and quantiles
- Lower server-side computational cost
- Less flexible for aggregation across instances
Summary Components:
prometheus# Configured quantiles http_request_duration_seconds{quantile="0.5"} 0.052 http_request_duration_seconds{quantile="0.9"} 0.564 http_request_duration_seconds{quantile="0.99"} 1.245 # Total count of observations http_request_duration_seconds_count 144320 # Sum of all observed values http_request_duration_seconds_sum 1896.04
Summary vs Histogram Comparison:
Aspect | Histogram | Summary |
---|---|---|
Quantile Calculation | Server-side with PromQL | Client-side |
Accuracy | Approximated from buckets | Exact for configured quantiles |
Aggregation | Can aggregate across instances | Cannot aggregate quantiles |
Storage | Stores bucket counts | Stores quantile values |
Flexibility | Any quantile can be calculated | Only pre-configured quantiles |
Performance | Higher server load for queries | Higher client memory usage |
Summary Implementation Example (Python):
pythonfrom prometheus_client import Summary, start_http_server import time import random # Create summary with custom quantiles request_latency = Summary( 'request_processing_seconds', 'Time spent processing requests', ['method', 'endpoint'] ) # Decorator for automatic timing @request_latency.labels(method='GET', endpoint='/api/users').time() def process_get_users(): # Simulate processing time time.sleep(random.uniform(0.1, 0.5)) return "users data" # Manual timing def process_post_users(): with request_latency.labels(method='POST', endpoint='/api/users').time(): # Simulate processing time.sleep(random.uniform(0.2, 1.0)) return "user created" if __name__ == '__main__': start_http_server(8000) while True: process_get_users() process_post_users() time.sleep(1)
Choosing the Right Metric Type
Decision Matrix:
Use Case | Metric Type | Reasoning |
---|---|---|
Count events (requests, errors) | Counter | Events only increase over time |
Current state (CPU, memory, queue size) | Gauge | Values can go up and down |
Measure distributions (latency, size) | Histogram | Need percentiles and buckets |
Simple quantiles (response time) | Summary | Pre-defined quantiles sufficient |
Best Practices:
- Use descriptive names:
http_requests_total
vsrequests
- Include units:
duration_seconds
,size_bytes
- Use consistent labels: Same labels across related metrics
- Avoid high cardinality: Don't use user IDs or request IDs as labels
- Document metrics: Include help text describing what each metric measures
Q7: How do you configure Prometheus to scrape targets?
Answer:
Prometheus configuration is the cornerstone of effective monitoring. The configuration file defines what to monitor, how often to collect data, and how to process that data. Understanding configuration is essential for building robust monitoring systems.
Configuration File Structure
Prometheus uses a YAML configuration file (typically prometheus.yml
) with several main sections:
Complete Configuration Template:
yaml# Global configuration global: scrape_interval: 15s # How often to scrape targets by default evaluation_interval: 15s # How often to evaluate rules external_labels: # Labels attached to any time series monitor: 'production' datacenter: 'us-east-1' # Rule files rule_files: - "rules/*.yml" - "alerts/*.yml" # Alerting configuration alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093 timeout: 10s api_version: v2 # Scrape configurations scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'node-exporter' static_configs: - targets: ['localhost:9100', 'server2:9100'] scrape_interval: 5s metrics_path: /metrics scheme: https # Remote write configuration (optional) remote_write: - url: "https://remote-storage-endpoint/api/v1/write" # Remote read configuration (optional) remote_read: - url: "https://remote-storage-endpoint/api/v1/read"
Global Configuration Section
Purpose: Sets default values that apply to all jobs unless overridden.
Key Parameters:
yamlglobal: scrape_interval: 15s # Default scrape frequency scrape_timeout: 10s # Default scrape timeout evaluation_interval: 15s # How often to evaluate recording and alerting rules external_labels: # Labels added to all time series region: 'us-west-2' environment: 'production'
Theory Behind Intervals:
- Scrape Interval: Balance between data resolution and system load
- Evaluation Interval: How quickly alerts are triggered
- Timeout: Must be less than scrape interval
Scrape Configuration Deep Dive
Basic Static Configuration:
yamlscrape_configs: - job_name: 'web-servers' # Logical grouping name static_configs: - targets: - 'web1.example.com:8080' - 'web2.example.com:8080' - 'web3.example.com:8080' scrape_interval: 30s # Override global interval scrape_timeout: 10s # Maximum time to wait for response metrics_path: '/metrics' # Path to metrics endpoint scheme: 'http' # http or https params: # URL parameters format: ['prometheus'] basic_auth: # HTTP basic authentication username: 'prometheus' password: 'secret' bearer_token: 'abc123' # Bearer token authentication tls_config: # TLS configuration ca_file: '/path/to/ca.pem' cert_file: '/path/to/cert.pem' key_file: '/path/to/key.pem' insecure_skip_verify: false
Service Discovery Mechanisms
Theory: Modern infrastructure is dynamic. Containers start and stop, auto-scaling changes instance counts, and services move between hosts. Static configuration doesn't scale. Service discovery automatically finds targets to monitor.
Kubernetes Service Discovery:
yamlscrape_configs: - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod # Discover pods namespaces: names: - default - monitoring relabel_configs: # Only scrape pods with annotation prometheus.io/scrape=true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true # Use custom metrics path if specified - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) # Use custom port if specified - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ # Add pod name as label - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name
Service Discovery Flow Diagram:
┌─────────────────────────────────────────────────────────────────┐ │ SERVICE DISCOVERY FLOW │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────┐ API Query ┌─────────────┐ │ │ │ Prometheus │ ────────────── │ Kubernetes │ │ │ │ Server │ │ API Server │ │ │ └─────────────┘ └─────────────┘ │ │ │ │ │ │ ▼ ▼ │ │ ┌─────────────┐ ┌─────────────┐ │ │ │ Target │ │ Pod │ │ │ │ Discovery │ ◀──────────── │ Metadata │ │ │ └─────────────┘ Pod Info └─────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────┐ │ │ │ Relabeling │ │ │ │ Rules │ │ │ └─────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────┐ HTTP GET ┌─────────────┐ │ │ │ Scrape │ ────────────── │ Target │ │ │ │ Targets │ /metrics │ Endpoints │ │ │ └─────────────┘ └─────────────┘ │ └─────────────────────────────────────────────────────────────────┘
AWS EC2 Service Discovery:
yamlscrape_configs: - job_name: 'ec2-instances' ec2_sd_configs: - region: us-west-2 port: 9100 filters: - name: 'tag:Environment' values: ['production', 'staging'] - name: 'instance-state-name' values: ['running'] relabel_configs: # Use instance ID as instance label - source_labels: [__meta_ec2_instance_id] target_label: instance # Add environment from EC2 tag - source_labels: [__meta_ec2_tag_Environment] target_label: environment # Add instance type - source_labels: [__meta_ec2_instance_type] target_label: instance_type
Consul Service Discovery:
yamlscrape_configs: - job_name: 'consul-services' consul_sd_configs: - server: 'consul.service.consul:8500' services: ['web', 'api', 'database'] relabel_configs: # Use service name as job label - source_labels: [__meta_consul_service] target_label: job # Add datacenter information - source_labels: [__meta_consul_datacenter] target_label: datacenter # Filter healthy services only - source_labels: [__meta_consul_health] regex: passing action: keep
Relabeling - The Power Tool
Theory: Relabeling is Prometheus's Swiss Army knife. It allows you to modify, add, or remove labels before storing metrics. This is crucial for:
- Filtering unwanted targets
- Adding contextual information
- Standardizing label names
- Reducing cardinality
Relabeling Actions:
- replace: Replace label value with new value
- keep: Keep only targets where label matches regex
- drop: Drop targets where label matches regex
- labelmap: Map label names using regex
- labeldrop: Drop labels matching regex
- labelkeep: Keep only labels matching regex
Advanced Relabeling Examples:
yamlrelabel_configs: # Keep only targets with specific annotation - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: 'true' # Extract service name from pod name - source_labels: [__meta_kubernetes_pod_name] regex: '([^-]+)-.*' replacement: '${1}' target_label: service # Add custom labels based on namespace - source_labels: [__meta_kubernetes_namespace] regex: 'production' replacement: 'prod' target_label: environment # Drop internal labels (starting with __) - regex: '__meta_.*' action: labeldrop # Map Kubernetes labels to Prometheus labels - regex: '__meta_kubernetes_pod_label_(.+)' action: labelmap replacement: 'k8s_${1}'
Authentication and Security
Basic Authentication:
yamlscrape_configs: - job_name: 'secured-app' static_configs: - targets: ['app.example.com:8080'] basic_auth: username: 'monitoring' password_file: '/etc/prometheus/passwords/app_password'
Bearer Token Authentication:
yamlscrape_configs: - job_name: 'api-service' static_configs: - targets: ['api.example.com:8080'] bearer_token_file: '/etc/prometheus/tokens/api_token'
TLS Configuration:
yamlscrape_configs: - job_name: 'https-service' static_configs: - targets: ['secure.example.com:8443'] scheme: https tls_config: ca_file: '/etc/prometheus/ca.pem' cert_file: '/etc/prometheus/client.pem' key_file: '/etc/prometheus/client-key.pem' server_name: 'secure.example.com' insecure_skip_verify: false
Configuration Validation and Best Practices
Validation Commands:
bash# Check configuration syntax promtool check config prometheus.yml # Check rules syntax promtool check rules /etc/prometheus/rules/*.yml # Test configuration without starting server prometheus --config.file=prometheus.yml --dry-run
Best Practices:
- Start Simple: Begin with static configs, add service discovery later
- Use Meaningful Job Names: Make them descriptive and consistent
- Group Related Targets: Use job names to group similar services
- Implement Proper Labeling: Use consistent label naming conventions
- Monitor Configuration Changes: Version control your config files
- Test Before Deploy: Always validate configuration changes
- Use Appropriate Intervals: Balance data resolution with system load
- Secure Credentials: Use file-based authentication, never hardcode passwords
Example Production Configuration:
yamlglobal: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: 'production' region: 'us-west-2' rule_files: - '/etc/prometheus/rules/recording.yml' - '/etc/prometheus/rules/alerting.yml' alerting: alertmanagers: - kubernetes_sd_configs: - role: pod namespaces: names: ['monitoring'] relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app] action: keep regex: alertmanager scrape_configs: # Prometheus itself - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] # Node exporters via service discovery - job_name: 'node-exporter' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app] action: keep regex: node-exporter - source_labels: [__address__] regex: '([^:]+):.*' replacement: '${1}:9100' target_label: __address__ # Application pods - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+)
Q8: What are Prometheus exporters and how do they work?
Answer:
Prometheus exporters are specialized programs that bridge the gap between Prometheus and systems that don't natively expose metrics in Prometheus format. They're essential components of the Prometheus ecosystem, enabling monitoring of virtually any system or service.
Exporter Architecture and Theory
Core Concept: An exporter acts as a translator, collecting metrics from a target system (database, operating system, application) and exposing them in Prometheus format via an HTTP endpoint.
Exporter Workflow:
┌─────────────────────────────────────────────────────────────────┐ │ EXPORTER ARCHITECTURE │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────┐ Native API ┌─────────────┐ │ │ │ Target │ ────────────── │ Exporter │ │ │ │ System │ (MySQL API, │ │ │ │ │ (Database, │ /proc files, │ - Collects │ │ │ │ App, etc.) │ REST API) │ - Converts │ │ │ └─────────────┘ │ - Exposes │ │ │ └─────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────┐ │ │ │ HTTP Server │ │ │ │ /metrics │ │ │ │ endpoint │ │ │ └─────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────┐ HTTP GET ┌─────────────┐ │ │ │ Prometheus │ ────────────── │ Prometheus │ │ │ │ Server │ /metrics │ Metrics │ │ │ │ │ │ Format │ │ │ └─────────────┘ └─────────────┘ │ └─────────────────────────────────────────────────────────────────┘
How Exporters Work - Step by Step
Step 1: Data Collection The exporter connects to the target system using its native API, protocols, or interfaces:
- Databases: SQL queries, admin commands
- Operating Systems: /proc filesystem, system calls
- APIs: REST endpoints, GraphQL queries
- Files: Log parsing, configuration files
Step 2: Data Transformation Raw data is converted into Prometheus metric format:
- Apply naming conventions
- Add appropriate labels
- Choose correct metric types
- Handle missing or invalid data
Step 3: HTTP Exposition Metrics are exposed via HTTP endpoint (typically /metrics):
- Standard Prometheus text format
- Real-time generation (not cached)
- Consistent response format
Step 4: Prometheus Scraping Prometheus periodically scrapes the exporter endpoint according to configuration.
Official Prometheus Exporters
1. Node Exporter - System Metrics Champion
Purpose: Exposes hardware and OS metrics for Unix systems (Linux, FreeBSD, macOS).
Installation and Setup:
bash# Download and install wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz cd node_exporter-1.6.1.linux-amd64 # Run with default configuration ./node_exporter # Run with specific collectors enabled ./node_exporter --collector.systemd --collector.processes --no-collector.hwmon # Run as systemd service sudo useradd --no-create-home --shell /bin/false node_exporter sudo mv node_exporter /usr/local/bin/ sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
Systemd Service Configuration:
ini# /etc/systemd/system/node_exporter.service [Unit] Description=Node Exporter Wants=network-online.target After=network-online.target [Service] User=node_exporter Group=node_exporter Type=simple ExecStart=/usr/local/bin/node_exporter \ --collector.systemd \ --collector.processes [Install] WantedBy=multi-user.target
Key Metrics Provided:
prometheus# CPU Metrics node_cpu_seconds_total{cpu="0",mode="idle"} 123456.78 node_cpu_seconds_total{cpu="0",mode="user"} 9876.54 node_cpu_seconds_total{cpu="0",mode="system"} 5432.10 # Memory Metrics node_memory_MemTotal_bytes 8589934592 node_memory_MemFree_bytes 2147483648 node_memory_MemAvailable_bytes 4294967296 node_memory_Buffers_bytes 536870912 node_memory_Cached_bytes 1073741824 # Disk Metrics node_filesystem_size_bytes{device="/dev/sda1",fstype="ext4",mountpoint="/"} 107374182400 node_filesystem_free_bytes{device="/dev/sda1",fstype="ext4",mountpoint="/"} 85899345920 node_filesystem_avail_bytes{device="/dev/sda1",fstype="ext4",mountpoint="/"} 80530636800 # Network Metrics node_network_receive_bytes_total{device="eth0"} 12345678901 node_network_transmit_bytes_total{device="eth0"} 9876543210 node_network_receive_packets_total{device="eth0"} 87654321 node_network_transmit_packets_total{device="eth0"} 65432109 # Load Average node_load1 0.85 node_load5 0.92 node_load15 1.05 # Boot Time node_boot_time_seconds 1641024000 # Time node_time_seconds 1641110400
Useful PromQL Queries for Node Exporter:
promql# CPU usage percentage (average across all cores) 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # Memory usage percentage (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 # Disk usage percentage (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 # Network throughput (bytes per second) irate(node_network_receive_bytes_total[5m]) + irate(node_network_transmit_bytes_total[5m]) # Disk I/O operations per second irate(node_disk_reads_completed_total[5m]) + irate(node_disk_writes_completed_total[5m])
2. MySQL Exporter - Database Monitoring
Purpose: Exposes MySQL server metrics for performance monitoring and capacity planning.
Installation and Configuration:
bash# Download MySQL exporter wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.15.0/mysqld_exporter-0.15.0.linux-amd64.tar.gz tar xvfz mysqld_exporter-0.15.0.linux-amd64.tar.gz # Create MySQL user for monitoring mysql -u root -p << EOF CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'password'; GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost'; FLUSH PRIVILEGES; EOF
Configuration File (.my.cnf):
ini[client] user=exporter password=secretpassword host=localhost port=3306
Running MySQL Exporter:
bash# Using configuration file export DATA_SOURCE_NAME="exporter:password@(localhost:3306)/" ./mysqld_exporter --config.my-cnf=.my.cnf # Using connection string ./mysqld_exporter --config.my-cnf=.my.cnf --collect.info_schema.tables
Key MySQL Metrics:
prometheus# Connection Metrics mysql_global_status_connections 123456 mysql_global_status_threads_connected 45 mysql_global_status_threads_running 8 mysql_global_variables_max_connections 151 # Query Performance mysql_global_status_queries 9876543 mysql_global_status_slow_queries 123 mysql_global_status_com_select 654321 mysql_global_status_com_insert 98765 mysql_global_status_com_update 54321 mysql_global_status_com_delete 12345 # InnoDB Metrics mysql_global_status_innodb_buffer_pool_read_requests 8765432 mysql_global_status_innodb_buffer_pool_reads 87654 mysql_global_status_innodb_buffer_pool_pages_data 45678 mysql_global_status_innodb_buffer_pool_pages_free 12345 # Replication Metrics (for slaves) mysql_slave_lag_seconds 0.5 mysql_slave_sql_running 1 mysql_slave_io_running 1
MySQL Monitoring Queries:
promql# Connection usage percentage mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100 # Query rate per second rate(mysql_global_status_queries[5m]) # Slow query rate rate(mysql_global_status_slow_queries[5m]) # Buffer pool hit ratio (should be > 99%) (mysql_global_status_innodb_buffer_pool_read_requests - mysql_global_status_innodb_buffer_pool_reads) / mysql_global_status_innodb_buffer_pool_read_requests * 100 # Average query response time rate(mysql_global_status_queries[5m]) / rate(mysql_global_status_connections[5m])
3. Blackbox Exporter - External Monitoring
Purpose: Probes endpoints over HTTP, HTTPS, DNS, TCP, and ICMP to monitor external services and network connectivity.
Configuration (blackbox.yml):
yamlmodules: http_2xx: prober: http timeout: 5s http: valid_http_versions: ["HTTP/1.1", "HTTP/2.0"] valid_status_codes: [200] method: GET headers: User-Agent: "Prometheus Blackbox Exporter" follow_redirects: true preferred_ip_protocol: "ip4" http_post_2xx: prober: http timeout: 5s http: valid_status_codes: [200] method: POST headers: Content-Type: application/json body: '{"test": "data"}' tcp_connect: prober: tcp timeout: 5s tcp: preferred_ip_protocol: "ip4" icmp: prober: icmp timeout: 5s icmp: preferred_ip_protocol: "ip4" dns: prober: dns timeout: 5s dns: query_name: "example.com" query_type: "A" valid_rcodes: - NOERROR
Prometheus Configuration for Blackbox:
yamlscrape_configs: - job_name: 'blackbox' metrics_path: /probe params: module: [http_2xx] # Default module static_configs: - targets: - http://prometheus.io - https://prometheus.io - http://example.com:8080 relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: blackbox-exporter:9115 - job_name: 'blackbox-tcp' metrics_path: /probe params: module: [tcp_connect] static_configs: - targets: - example.com:22 - example.com:80 - example.com:443 relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: blackbox-exporter:9115
Building Custom Exporters
Theory Behind Custom Exporters
Sometimes you need to monitor systems that don't have existing exporters. Building custom exporters allows you to:
- Monitor proprietary applications
- Expose business metrics
- Integrate with internal APIs
- Create specialized monitoring solutions
Custom Exporter Best Practices
- Use Official Client Libraries: Leverage Prometheus client libraries for your language
- Follow Naming Conventions: Use descriptive metric names with units
- Handle Errors Gracefully: Don't crash on collection failures
- Keep It Simple: Focus on essential metrics
- Document Your Metrics: Include help text and examples
- Test Thoroughly: Verify metric accuracy and performance
Simple Custom Exporter (Python)
python#!/usr/bin/env python3 """ Custom application exporter example Monitors a web application's performance and business metrics """ import time import requests import json from prometheus_client import start_http_server, Gauge, Counter, Histogram, Info from prometheus_client.core import CollectorRegistry, REGISTRY import threading import logging # Set up logging logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) class ApplicationExporter: """Custom exporter for application metrics""" def __init__(self, app_url, app_token): self.app_url = app_url self.app_token = app_token # Define metrics self.app_info = Info('app_info', 'Application information') self.app_up = Gauge('app_up', 'Application availability', ['service']) self.app_response_time = Histogram( 'app_response_time_seconds', 'Application response time', ['endpoint', 'method'], buckets=[0.1, 0.25, 0.5, 1, 2.5, 5, 10] ) self.app_requests_total = Counter( 'app_requests_total', 'Total application requests', ['endpoint', 'method', 'status'] ) self.app_active_users = Gauge( 'app_active_users', 'Number of active users', ['type'] ) self.app_database_connections = Gauge( 'app_database_connections', 'Active database connections', ['pool'] ) self.app_queue_size = Gauge( 'app_queue_size', 'Queue size', ['queue_name'] ) self.app_revenue_total = Counter( 'app_revenue_total_cents', 'Total revenue in cents', ['product_type'] ) # Application info (static) self.app_info.info({ 'version': self.get_app_version(), 'environment': 'production', 'build_date': '2024-01-15' }) def get_app_version(self): """Get application version from API""" try: response = requests.get(f"{self.app_url}/api/version", headers={'Authorization': f'Bearer {self.app_token}'}, timeout=5) return response.json().get('version', 'unknown') except Exception as e: logger.error(f"Failed to get app version: {e}") return 'unknown' def collect_health_metrics(self): """Collect application health metrics""" try: # Check main application start_time = time.time() response = requests.get(f"{self.app_url}/health", headers={'Authorization': f'Bearer {self.app_token}'}, timeout=10) response_time = time.time() - start_time if response.status_code == 200: self.app_up.labels(service='main').set(1) self.app_response_time.labels(endpoint='/health', method='GET').observe(response_time) # Parse health data health_data = response.json() # Database connections for pool_name, count in health_data.get('database_pools', {}).items(): self.app_database_connections.labels(pool=pool_name).set(count) # Queue sizes for queue_name, size in health_data.get('queues', {}).items(): self.app_queue_size.labels(queue_name=queue_name).set(size) else: self.app_up.labels(service='main').set(0) except Exception as e: logger.error(f"Failed to collect health metrics: {e}") self.app_up.labels(service='main').set(0) def collect_business_metrics(self): """Collect business metrics""" try: response = requests.get(f"{self.app_url}/api/metrics", headers={'Authorization': f'Bearer {self.app_token}'}, timeout=10) if response.status_code == 200: metrics_data = response.json() # Active users for user_type, count in metrics_data.get('active_users', {}).items(): self.app_active_users.labels(type=user_type).set(count) # Revenue (business metric) for product_type, revenue in metrics_data.get('revenue_today', {}).items(): # Convert to cents to avoid floating point issues revenue_cents = int(revenue * 100) self.app_revenue_total.labels(product_type=product_type)._value._value = revenue_cents # Request statistics for endpoint_data in metrics_data.get('endpoints', []): endpoint = endpoint_data['path'] method = endpoint_data['method'] for status_code, count in endpoint_data.get('status_codes', {}).items(): self.app_requests_total.labels( endpoint=endpoint, method=method, status=status_code )._value._value = count except Exception as e: logger.error(f"Failed to collect business metrics: {e}") def collect_all_metrics(self): """Collect all metrics""" logger.info("Collecting application metrics...") self.collect_health_metrics() self.collect_business_metrics() logger.info("Metrics collection completed") def run_exporter(): """Run the custom exporter""" # Configuration APP_URL = "https://api.myapp.com" APP_TOKEN = "your-api-token-here" EXPORTER_PORT = 8000 COLLECTION_INTERVAL = 30 # seconds # Create exporter instance exporter = ApplicationExporter(APP_URL, APP_TOKEN) # Start HTTP server start_http_server(EXPORTER_PORT) logger.info(f"Custom exporter started on port {EXPORTER_PORT}") # Collection loop while True: try: exporter.collect_all_metrics() time.sleep(COLLECTION_INTERVAL) except KeyboardInterrupt: logger.info("Exporter stopped by user") break except Exception as e: logger.error(f"Error in collection loop: {e}") time.sleep(10) # Wait before retrying if __name__ == '__main__': run_exporter()
Advanced Custom Exporter (Go)
gopackage main import ( "context" "database/sql" "encoding/json" "fmt" "log" "net/http" "time" _ "github.com/lib/pq" "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promhttp" ) type DatabaseExporter struct { db *sql.DB // Metrics dbUp prometheus.Gauge dbConnections *prometheus.GaugeVec queryDuration *prometheus.HistogramVec slowQueries prometheus.Counter tableRows *prometheus.GaugeVec tableSize *prometheus.GaugeVec } func NewDatabaseExporter(dsn string) (*DatabaseExporter, error) { db, err := sql.Open("postgres", dsn) if err != nil { return nil, err } return &DatabaseExporter{ db: db, dbUp: prometheus.NewGauge(prometheus.GaugeOpts{ Name: "database_up", Help: "Database availability", }), dbConnections: prometheus.NewGaugeVec( prometheus.GaugeOpts{ Name: "database_connections", Help: "Database connections by state", }, []string{"state"}, ), queryDuration: prometheus.NewHistogramVec( prometheus.HistogramOpts{ Name: "database_query_duration_seconds", Help: "Database query duration", Buckets: []float64{0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5}, }, []string{"query_type"}, ), slowQueries: prometheus.NewCounter(prometheus.CounterOpts{ Name: "database_slow_queries_total", Help: "Total number of slow queries", }), tableRows: prometheus.NewGaugeVec( prometheus.GaugeOpts{ Name: "database_table_rows", Help: "Number of rows in each table", }, []string{"table", "schema"}, ), tableSize: prometheus.NewGaugeVec( prometheus.GaugeOpts{ Name: "database_table_size_bytes", Help: "Size of each table in bytes", }, []string{"table", "schema"}, ), }, nil } func (e *DatabaseExporter) collectConnectionMetrics() { ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) defer cancel() start := time.Now() // Test database connectivity if err := e.db.PingContext(ctx); err != nil { e.dbUp.Set(0) log.Printf("Database ping failed: %v", err) return } e.dbUp.Set(1) // Collect connection statistics query := ` SELECT state, count(*) FROM pg_stat_activity WHERE datname = current_database() GROUP BY state ` rows, err := e.db.QueryContext(ctx, query) if err != nil { log.Printf("Failed to query connection stats: %v", err) return } defer rows.Close() for rows.Next() { var state string var count float64 if err := rows.Scan(&state, &count); err != nil { log.Printf("Failed to scan connection stats: %v", err) continue } e.dbConnections.WithLabelValues(state).Set(count) } e.queryDuration.WithLabelValues("connection_stats").Observe(time.Since(start).Seconds()) } func (e *DatabaseExporter) collectTableMetrics() { ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) defer cancel() start := time.Now() query := ` SELECT schemaname, tablename, n_tup_ins + n_tup_upd + n_tup_del as total_rows, pg_total_relation_size(schemaname||'.'||tablename) as table_size FROM pg_stat_user_tables ` rows, err := e.db.QueryContext(ctx, query) if err != nil { log.Printf("Failed to query table stats: %v", err) return } defer rows.Close() for rows.Next() { var schema, table string var rowCount, tableSize float64 if err := rows.Scan(&schema, &table, &rowCount, &tableSize); err != nil { log.Printf("Failed to scan table stats: %v", err) continue } e.tableRows.WithLabelValues(table, schema).Set(rowCount) e.tableSize.WithLabelValues(table, schema).Set(tableSize) } e.queryDuration.WithLabelValues("table_stats").Observe(time.Since(start).Seconds()) } func (e *DatabaseExporter) collectSlowQueries() { ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) defer cancel() start := time.Now() query := ` SELECT count(*) FROM pg_stat_statements WHERE mean_time > 1000 ` var slowCount float64 if err := e.db.QueryRowContext(ctx, query).Scan(&slowCount); err != nil { log.Printf("Failed to query slow queries: %v", err) return } // This is a simplified example - in practice you'd track the delta e.slowQueries.Add(slowCount) e.queryDuration.WithLabelValues("slow_queries").Observe(time.Since(start).Seconds()) } func (e *DatabaseExporter) Collect(ch chan<- prometheus.Metric) { e.collectConnectionMetrics() e.collectTableMetrics() e.collectSlowQueries() e.dbUp.Collect(ch) e.dbConnections.Collect(ch) e.queryDuration.Collect(ch) e.slowQueries.Collect(ch) e.tableRows.Collect(ch) e.tableSize.Collect(ch) } func (e *DatabaseExporter) Describe(ch chan<- *prometheus.Desc) { e.dbUp.Describe(ch) e.dbConnections.Describe(ch) e.queryDuration.Describe(ch) e.slowQueries.Describe(ch) e.tableRows.Describe(ch) e.tableSize.Describe(ch) } func main() { dsn := "postgresql://user:password@localhost/dbname?sslmode=disable" exporter, err := NewDatabaseExporter(dsn) if err != nil { log.Fatalf("Failed to create exporter: %v", err) } prometheus.MustRegister(exporter) http.Handle("/metrics", promhttp.Handler()) log.Println("Database exporter started on :8080") log.Fatal(http.ListenAndServe(":8080", nil)) }
Exporter Deployment Patterns
Sidecar Pattern (Kubernetes):
yamlapiVersion: apps/v1 kind: Deployment metadata: name: app-with-exporter spec: replicas: 3 selector: matchLabels: app: myapp template: metadata: labels: app: myapp annotations: prometheus.io/scrape: "true" prometheus.io/port: "8080" prometheus.io/path: "/metrics" spec: containers: - name: application image: myapp:latest ports: - containerPort: 3000 - name: exporter image: custom-exporter:latest ports: - containerPort: 8080 env: - name: APP_URL value: "http://localhost:3000" - name: SCRAPE_INTERVAL value: "30s"
Standalone Exporter:
yamlapiVersion: apps/v1 kind: Deployment metadata: name: database-exporter spec: replicas: 1 selector: matchLabels: app: database-exporter template: metadata: labels: app: database-exporter annotations: prometheus.io/scrape: "true" prometheus.io/port: "9104" spec: containers: - name: exporter image: prom/mysqld-exporter:latest ports: - containerPort: 9104 env: - name: DATA_SOURCE_NAME valueFrom: secretKeyRef: name: mysql-secret key: dsn
0 comments:
Post a Comment