Advanced Monitoring & Metrics¶

🎯 Learning Objectives

Master Prometheus and Grafana
Understand custom metrics and exporters
Learn advanced alerting patterns
Troubleshoot monitoring issues
Optimize monitoring performance

Comprehensive monitoring is essential for production clusters. Understanding Prometheus, custom metrics, and alerting enables proactive issue detection and resolution.

Observability

Monitoring provides visibility into cluster health. Combine metrics, logs, and traces for full observability.

Monitoring Overhead

Monitoring adds overhead. Balance comprehensiveness with performance impact.

Prometheus Architecture¶

Prometheus Components¶

Prometheus Server: - Scrapes metrics from targets - Stores time-series data - Evaluates alerting rules

Exporters: - Expose metrics in Prometheus format - Node Exporter, cAdvisor, kube-state-metrics

Alertmanager: - Handles alert routing and grouping - Sends notifications

Prometheus Operator

Prometheus Operator simplifies Prometheus deployment and management in Kubernetes.

Prometheus Configuration¶

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    scrape_configs:
    - job_name: 'kubernetes-apiservers'
      kubernetes_sd_configs:
      - role: endpoints
    - job_name: 'kubernetes-nodes'
      kubernetes_sd_configs:
      - role: node

Custom Metrics¶

Exposing Custom Metrics¶

# Python application exposing metrics
from prometheus_client import Counter, Gauge, start_http_server

requests_total = Counter('http_requests_total', 'Total HTTP requests')
active_connections = Gauge('active_connections', 'Active connections')

# Increment counter
requests_total.inc()

# Set gauge
active_connections.set(10)

# Start metrics server
start_http_server(8000)

Custom Metrics

Custom metrics enable application-specific monitoring and HPA scaling.

ServiceMonitor¶

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-metrics
spec:
  selector:
    matchLabels:
      app: api
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

Alerting¶

Alert Rules¶

groups:
- name: kubernetes
  rules:
  - alert: HighCPUUsage
    expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage detected"
      description: "CPU usage is above 80% for 5 minutes"

Alert Fatigue

Too many alerts reduce effectiveness. Tune alert thresholds and use alert grouping.

Troubleshooting¶

Monitoring Issues¶

Metrics Not Scraped¶

# Check ServiceMonitor
kubectl get servicemonitor

# Check Prometheus targets
# Access Prometheus UI and check /targets

# Check pod annotations
kubectl get pod <pod-name> -o yaml | grep annotations

High Prometheus Resource Usage¶

# Check Prometheus metrics
kubectl top pod -n monitoring prometheus-0

# Review retention settings
kubectl get prometheus -n monitoring -o yaml

# Check scrape interval
kubectl get prometheus -n monitoring -o yaml | grep scrapeInterval

Best Practices¶

Production Recommendations

Monitor all critical components
Set up comprehensive alerting
Use ServiceMonitors for service discovery
Implement custom metrics for applications
Regular review of alert effectiveness
Document monitoring architecture

Next Chapter: Logging & Tracing