Monitoring & Logging¶
🎯 Learning Objectives
- Master monitoring concepts and metrics
- Understand logging best practices
- Learn distributed tracing
- Know observability tools and platforms
- Design effective alerting strategies
Monitoring and logging are essential for production systems. This comprehensive chapter covers observability, metrics, logs, traces, and best practices.
Interview Focus
Understand the three pillars of observability, know when to use which tool, and design effective monitoring strategies.
Observability Fundamentals¶
The Three Pillars of Observability¶
1. Metrics: - Numerical measurements over time - Aggregated data - Low overhead - Use for: Trends, alerts, dashboards
2. Logs: - Discrete events with timestamps - Detailed information - Higher overhead - Use for: Debugging, audit trails
3. Traces: - Request flow through system - Distributed system visibility - Moderate overhead - Use for: Performance analysis, debugging
Observability vs Monitoring
- Monitoring: Known unknowns (what you expect to monitor)
- Observability: Unknown unknowns (ability to understand system behavior)
Metrics¶
Types of Metrics¶
Counter: - Always increasing - Example: Total requests, errors - Use: Track cumulative values
Gauge: - Can go up or down - Example: CPU usage, memory, active connections - Use: Current state values
Histogram: - Distribution of values - Example: Request latency distribution - Use: Understand value distribution
Summary: - Similar to histogram with quantiles - Pre-calculated percentiles - Use: Latency percentiles (p50, p95, p99)
Key Metrics to Monitor¶
Infrastructure Metrics: - CPU utilization - Memory usage - Disk I/O - Network bandwidth - Disk space
Application Metrics: - Request rate (RPS) - Error rate - Latency (p50, p95, p99, p99.9) - Throughput - Queue depth
Business Metrics: - User signups - Revenue - Conversion rate - Active users
Prometheus¶
Prometheus Architecture: - Pull-based metrics collection - Time-series database - PromQL query language - Service discovery
Prometheus Configuration:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'web-app'
static_configs:
- targets: ['web:8080']
metrics_path: '/metrics'
scrape_interval: 10s
PromQL Examples:
# CPU usage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Request rate
rate(http_requests_total[5m])
# Error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Logging¶
Log Levels¶
DEBUG: - Detailed information for debugging - Usually disabled in production - Example: Variable values, function entry/exit
INFO: - General informational messages - Normal application flow - Example: User login, request processed
WARN: - Warning messages - Something unexpected but handled - Example: Retry attempt, deprecated API usage
ERROR: - Error messages - Something failed but application continues - Example: Database connection failed, retrying
FATAL/CRITICAL: - Critical errors - Application may terminate - Example: Cannot connect to database, out of memory
Structured Logging¶
Benefits: - Machine-readable - Easy to query - Better parsing - Consistent format
Example:
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "ERROR",
"service": "user-service",
"trace_id": "abc123",
"user_id": "user-456",
"message": "Failed to process payment",
"error": "Insufficient funds",
"amount": 100.50,
"currency": "USD"
}
Log Aggregation¶
ELK Stack (Elasticsearch, Logstash, Kibana):
Logstash Configuration:
input {
file {
path => "/var/log/app.log"
start_position => "beginning"
}
}
filter {
grok {
match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:message}" }
}
date {
match => [ "timestamp", "ISO8601" ]
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "app-logs-%{+YYYY.MM.dd}"
}
}
Fluentd:
<source>
@type tail
path /var/log/app.log
pos_file /var/log/fluentd-app.log.pos
tag app.log
format json
</source>
<match app.log>
@type elasticsearch
host elasticsearch
port 9200
index_name app-logs
type_name _doc
</match>
Distributed Tracing¶
OpenTelemetry¶
Instrumentation:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import ConsoleSpanExporter
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
def process_order(order_id):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
# Process order
span.add_event("Order processed successfully")
Jaeger: - Distributed tracing system - OpenTracing compatible - UI for visualization - Supports multiple backends
Alerting Strategies¶
Alert Design Principles¶
Alert Fatigue Prevention: - Only alert on actionable items - Use appropriate severity levels - Group related alerts - Implement alert suppression
Alert Severity Levels: - Critical: Immediate action required, system down - Warning: Attention needed, may become critical - Info: Informational, no action needed
Alert Rules:
# Prometheus Alert Rules
groups:
- name: application
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} errors/sec"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "High latency detected"
On-Call Best Practices¶
Runbooks: - Document common issues - Step-by-step resolution - Escalation procedures - Contact information
Alert Routing: - Route to appropriate team - Escalate if not acknowledged - Use PagerDuty, Opsgenie, etc.
Comprehensive Interview Questions¶
Q1: Design a monitoring strategy for a microservices application¶
Answer:
Metrics: - Service-level: Request rate, error rate, latency per service - Infrastructure: CPU, memory, disk per service - Business: User actions, conversions
Logging: - Structured logs with trace IDs - Centralized log aggregation (ELK) - Log retention policies
Tracing: - Distributed tracing (Jaeger/Zipkin) - Trace all requests across services - Identify bottlenecks
Alerting: - Service-level alerts (error rate, latency) - Infrastructure alerts (resource exhaustion) - Business alerts (unusual patterns)
Q2: Explain the difference between metrics and logs¶
Answer:
Metrics: - Aggregated numerical data - Low overhead - Time-series data - Use for: Trends, dashboards, alerts
Logs: - Individual events with details - Higher overhead - Text or structured data - Use for: Debugging, audit trails
When to use: - Metrics: Monitoring, alerting, capacity planning - Logs: Debugging, compliance, detailed analysis
Q3: How do you reduce log volume while maintaining observability?¶
Answer:
- Structured Logging: Easier to filter and query
- Log Levels: Use appropriate levels, disable DEBUG in production
- Sampling: Sample high-volume logs
- Retention Policies: Delete old logs
- Aggregation: Aggregate similar logs
- Filtering: Don't log unnecessary information
Recommended Resources¶
Books¶
- "Observability Engineering" by Charity Majors - Comprehensive observability guide
- "Site Reliability Engineering" by Google - SRE practices
Articles¶
Previous: Cloud Architecture | Next: System Design