Day-2 Operations¶

🎯 Learning Objectives

Master cluster upgrades and maintenance
Understand operational best practices
Learn capacity planning and scaling
Troubleshoot operational issues
Implement operational excellence

Day-2 operations are ongoing cluster management tasks. Understanding upgrades, maintenance, and operational best practices ensures cluster reliability and performance.

Operational Excellence

Day-2 operations determine long-term cluster success. Invest in automation and best practices.

Operational Debt

Neglecting day-2 operations accumulates technical debt. Regular maintenance is essential.

Cluster Upgrades¶

Upgrade Strategy¶

Upgrade Approaches: - In-Place: Upgrade existing cluster - Blue-Green: New cluster, migrate workloads - Canary: Gradual rollout

Upgrade Planning

Plan upgrades carefully. Test in non-production, have rollback plan, schedule maintenance window.

kubeadm Upgrade¶

# Upgrade control plane
kubeadm upgrade plan
kubeadm upgrade apply v1.28.0

# Upgrade kubelet
apt-get update
apt-get install kubelet=1.28.0-00 kubectl=1.28.0-00
systemctl daemon-reload
systemctl restart kubelet

# Upgrade worker nodes
kubeadm upgrade node

Upgrade Order

Upgrade control plane first, then worker nodes. Follow Kubernetes upgrade documentation.

Maintenance Operations¶

Node Maintenance¶

# Drain node
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Perform maintenance
# ...

# Uncordon node
kubectl uncordon <node-name>

Node Draining

Draining ensures pods are gracefully terminated and rescheduled. Always drain before maintenance.

Certificate Rotation¶

# Check certificate expiration
kubeadm certs check-expiration

# Renew certificates
kubeadm certs renew all

# Restart control plane components
systemctl restart kubelet

Certificate Expiration

Expired certificates cause cluster failures. Monitor and renew before expiration.

Capacity Planning¶

Resource Planning¶

# Check cluster capacity
kubectl describe nodes

# Check resource requests
kubectl top nodes
kubectl top pods --all-namespaces

# Plan scaling
# Based on growth projections and current usage

Capacity Planning

Regular capacity planning prevents resource exhaustion. Plan for growth and peak loads.

Operational Best Practices¶

Monitoring and Alerting¶

Monitor all critical components
Set up comprehensive alerting
Review alerts regularly
Tune alert thresholds

Documentation¶

Document cluster architecture
Maintain runbooks
Document procedures
Keep diagrams updated

Automation¶

Automate repetitive tasks
Use infrastructure as code
Implement GitOps
Automate testing

Operational Excellence

Comprehensive monitoring
Documented procedures
Automation where possible
Regular reviews
Continuous improvement
Team knowledge sharing

Troubleshooting¶

Operational Issues¶

Upgrade Failures¶

# Check upgrade status
kubectl get nodes

# Review upgrade logs
journalctl -u kubelet -f

# Check component versions
kubectl version

# Rollback if necessary
# Follow rollback procedures

Maintenance Issues¶

# Check node status
kubectl get nodes

# Check pod status
kubectl get pods --all-namespaces

# Review events
kubectl get events --all-namespaces

Best Practices¶

Production Recommendations

Regular cluster upgrades
Scheduled maintenance windows
Comprehensive monitoring
Documented procedures
Automation and GitOps
Regular capacity planning
Team training and knowledge sharing
Continuous improvement

Course Complete! 🎉

You've completed the Advanced Kubernetes Troubleshooting & Expert Course. Continue practicing and applying these concepts in your production environments.