Day-2 Operations¶
🎯 Learning Objectives
- Master cluster upgrades and maintenance
- Understand operational best practices
- Learn capacity planning and scaling
- Troubleshoot operational issues
- Implement operational excellence
Day-2 operations are ongoing cluster management tasks. Understanding upgrades, maintenance, and operational best practices ensures cluster reliability and performance.
Operational Excellence
Day-2 operations determine long-term cluster success. Invest in automation and best practices.
Operational Debt
Neglecting day-2 operations accumulates technical debt. Regular maintenance is essential.
Cluster Upgrades¶
Upgrade Strategy¶
Upgrade Approaches: - In-Place: Upgrade existing cluster - Blue-Green: New cluster, migrate workloads - Canary: Gradual rollout
Upgrade Planning
Plan upgrades carefully. Test in non-production, have rollback plan, schedule maintenance window.
kubeadm Upgrade¶
# Upgrade control plane
kubeadm upgrade plan
kubeadm upgrade apply v1.28.0
# Upgrade kubelet
apt-get update
apt-get install kubelet=1.28.0-00 kubectl=1.28.0-00
systemctl daemon-reload
systemctl restart kubelet
# Upgrade worker nodes
kubeadm upgrade node
Upgrade Order
Upgrade control plane first, then worker nodes. Follow Kubernetes upgrade documentation.
Maintenance Operations¶
Node Maintenance¶
# Drain node
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# Perform maintenance
# ...
# Uncordon node
kubectl uncordon <node-name>
Node Draining
Draining ensures pods are gracefully terminated and rescheduled. Always drain before maintenance.
Certificate Rotation¶
# Check certificate expiration
kubeadm certs check-expiration
# Renew certificates
kubeadm certs renew all
# Restart control plane components
systemctl restart kubelet
Certificate Expiration
Expired certificates cause cluster failures. Monitor and renew before expiration.
Capacity Planning¶
Resource Planning¶
# Check cluster capacity
kubectl describe nodes
# Check resource requests
kubectl top nodes
kubectl top pods --all-namespaces
# Plan scaling
# Based on growth projections and current usage
Capacity Planning
Regular capacity planning prevents resource exhaustion. Plan for growth and peak loads.
Operational Best Practices¶
Monitoring and Alerting¶
- Monitor all critical components
- Set up comprehensive alerting
- Review alerts regularly
- Tune alert thresholds
Documentation¶
- Document cluster architecture
- Maintain runbooks
- Document procedures
- Keep diagrams updated
Automation¶
- Automate repetitive tasks
- Use infrastructure as code
- Implement GitOps
- Automate testing
Operational Excellence
- Comprehensive monitoring
- Documented procedures
- Automation where possible
- Regular reviews
- Continuous improvement
- Team knowledge sharing
Troubleshooting¶
Operational Issues¶
Upgrade Failures¶
# Check upgrade status
kubectl get nodes
# Review upgrade logs
journalctl -u kubelet -f
# Check component versions
kubectl version
# Rollback if necessary
# Follow rollback procedures
Maintenance Issues¶
# Check node status
kubectl get nodes
# Check pod status
kubectl get pods --all-namespaces
# Review events
kubectl get events --all-namespaces
Best Practices¶
Production Recommendations
- Regular cluster upgrades
- Scheduled maintenance windows
- Comprehensive monitoring
- Documented procedures
- Automation and GitOps
- Regular capacity planning
- Team training and knowledge sharing
- Continuous improvement
Course Complete! 🎉
You've completed the Advanced Kubernetes Troubleshooting & Expert Course. Continue practicing and applying these concepts in your production environments.