Advanced Monitoring & Metrics¶
🎯 Learning Objectives
- Master Prometheus and Grafana
- Understand custom metrics and exporters
- Learn advanced alerting patterns
- Troubleshoot monitoring issues
- Optimize monitoring performance
Comprehensive monitoring is essential for production clusters. Understanding Prometheus, custom metrics, and alerting enables proactive issue detection and resolution.
Observability
Monitoring provides visibility into cluster health. Combine metrics, logs, and traces for full observability.
Monitoring Overhead
Monitoring adds overhead. Balance comprehensiveness with performance impact.
Prometheus Architecture¶
Prometheus Components¶
Prometheus Server: - Scrapes metrics from targets - Stores time-series data - Evaluates alerting rules
Exporters: - Expose metrics in Prometheus format - Node Exporter, cAdvisor, kube-state-metrics
Alertmanager: - Handles alert routing and grouping - Sends notifications
Prometheus Operator
Prometheus Operator simplifies Prometheus deployment and management in Kubernetes.
Prometheus Configuration¶
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
Custom Metrics¶
Exposing Custom Metrics¶
# Python application exposing metrics
from prometheus_client import Counter, Gauge, start_http_server
requests_total = Counter('http_requests_total', 'Total HTTP requests')
active_connections = Gauge('active_connections', 'Active connections')
# Increment counter
requests_total.inc()
# Set gauge
active_connections.set(10)
# Start metrics server
start_http_server(8000)
Custom Metrics
Custom metrics enable application-specific monitoring and HPA scaling.
ServiceMonitor¶
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-metrics
spec:
selector:
matchLabels:
app: api
endpoints:
- port: metrics
interval: 30s
path: /metrics
Alerting¶
Alert Rules¶
groups:
- name: kubernetes
rules:
- alert: HighCPUUsage
expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for 5 minutes"
Alert Fatigue
Too many alerts reduce effectiveness. Tune alert thresholds and use alert grouping.
Troubleshooting¶
Monitoring Issues¶
Metrics Not Scraped¶
# Check ServiceMonitor
kubectl get servicemonitor
# Check Prometheus targets
# Access Prometheus UI and check /targets
# Check pod annotations
kubectl get pod <pod-name> -o yaml | grep annotations
High Prometheus Resource Usage¶
# Check Prometheus metrics
kubectl top pod -n monitoring prometheus-0
# Review retention settings
kubectl get prometheus -n monitoring -o yaml
# Check scrape interval
kubectl get prometheus -n monitoring -o yaml | grep scrapeInterval
Best Practices¶
Production Recommendations
- Monitor all critical components
- Set up comprehensive alerting
- Use ServiceMonitors for service discovery
- Implement custom metrics for applications
- Regular review of alert effectiveness
- Document monitoring architecture
Next Chapter: Logging & Tracing