etcd Operations & Troubleshooting¶
🎯 Learning Objectives
- Master etcd backup and restore procedures
- Understand etcd performance tuning
- Learn etcd maintenance operations
- Troubleshoot etcd issues effectively
- Implement etcd disaster recovery
etcd is the brain of your Kubernetes cluster. Understanding etcd operations is critical for cluster reliability and disaster recovery.
Critical Component
etcd failure can cause complete cluster unavailability. Always have backup and recovery procedures in place.
Data Loss Risk
Incorrect etcd operations can cause permanent data loss. Always backup before making changes.
etcd Architecture¶
Data Storage¶
etcd stores all Kubernetes cluster state: - Pod definitions - Service endpoints - ConfigMaps and Secrets - RBAC policies - Node information - Namespace metadata
Single Source of Truth
etcd is the only place where cluster state is persisted. All other components derive state from etcd.
etcd Cluster¶
# Typical etcd cluster (3 nodes for HA)
etcd-0: https://etcd-0.example.com:2379
etcd-1: https://etcd-1.example.com:2379
etcd-2: https://etcd-2.example.com:2379
High Availability
Run etcd in odd-numbered clusters (3, 5, 7) for quorum. 3 nodes can tolerate 1 failure.
Backup Operations¶
Manual Backup¶
# Backup etcd
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db
# Verify backup
ETCDCTL_API=3 etcdctl \
--write-out=table \
snapshot status /backup/etcd-snapshot-20241201-120000.db
Backup Best Practices
- Schedule regular automated backups
- Store backups in multiple locations
- Test restore procedures regularly
- Keep backups for at least 30 days
- Encrypt backups containing sensitive data
Automated Backup Script¶
#!/bin/bash
# etcd-backup.sh
BACKUP_DIR="/backup/etcd"
ETCD_ENDPOINTS="https://127.0.0.1:2379"
CA_CERT="/etc/kubernetes/pki/etcd/ca.crt"
CERT="/etc/kubernetes/pki/etcd/server.crt"
KEY="/etc/kubernetes/pki/etcd/server.key"
RETENTION_DAYS=30
# Create backup directory
mkdir -p $BACKUP_DIR
# Create snapshot
SNAPSHOT_FILE="$BACKUP_DIR/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db"
ETCDCTL_API=3 etcdctl \
--endpoints=$ETCD_ENDPOINTS \
--cacert=$CA_CERT \
--cert=$CERT \
--key=$KEY \
snapshot save $SNAPSHOT_FILE
# Verify snapshot
ETCDCTL_API=3 etcdctl \
--write-out=table \
snapshot status $SNAPSHOT_FILE
# Cleanup old backups
find $BACKUP_DIR -name "etcd-snapshot-*.db" -mtime +$RETENTION_DAYS -delete
echo "Backup completed: $SNAPSHOT_FILE"
Automation
Use cron or Kubernetes CronJob to automate backups. Test the script regularly.
Restore Operations¶
Restore from Snapshot¶
Destructive Operation
Restore operations will overwrite existing etcd data. Only restore when necessary.
# 1. Stop all etcd instances
systemctl stop etcd
# 2. Backup current data (if possible)
mv /var/lib/etcd /var/lib/etcd.backup
# 3. Restore from snapshot
ETCDCTL_API=3 etcdctl \
snapshot restore /backup/etcd-snapshot-20241201-120000.db \
--data-dir=/var/lib/etcd-new \
--name=etcd-0 \
--initial-cluster=etcd-0=https://etcd-0.example.com:2380 \
--initial-cluster-token=etcd-cluster-1 \
--initial-advertise-peer-urls=https://etcd-0.example.com:2380
# 4. Update etcd configuration
# Update etcd.service to use new data directory
# 5. Start etcd
systemctl start etcd
# 6. Verify cluster health
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint health
Cluster Coordination
In multi-node etcd clusters, coordinate restore across all nodes. Restore procedures vary by deployment method.
Performance Tuning¶
Key Performance Metrics¶
# Check etcd performance
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint status -w table
Important Metrics:
- dbSize: Database size (watch for quota limits)
- leader: Current leader
- raftIndex: Raft log index
- raftTerm: Current term
Performance Optimization¶
# etcd performance flags
--quota-backend-bytes=8589934592 # 8GB default, increase if needed
--max-request-bytes=1572864 # 1.5MB max request size
--heartbeat-interval=100 # 100ms heartbeat
--election-timeout=1000 # 1s election timeout
--max-txn-ops=128 # Max operations per transaction
--snapshot-count=100000 # Snapshots after N writes
Performance Tuning
- Use SSD storage for etcd data directory
- Keep etcd on dedicated disks (separate from OS)
- Monitor disk I/O latency (should be < 10ms)
- Increase quota-backend-bytes before reaching limit
- Tune heartbeat/election timeout based on network latency
Database Compaction¶
# Check current revision
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint status | grep revision
# Compact to specific revision
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
compact <revision>
# Defragment after compaction
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
defrag
Compaction Impact
Compaction and defragmentation can impact cluster performance. Perform during maintenance windows.
Maintenance Operations¶
Health Checks¶
# Check endpoint health
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint health
# Check cluster status
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint status -w table
# Check member list
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
member list -w table
Member Management¶
# Add new member
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
member add etcd-3 --peer-urls=https://etcd-3.example.com:2380
# Remove member
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
member remove <member-id>
Member Operations
Member add/remove operations require coordination. Follow etcd documentation carefully.
Troubleshooting¶
Common Issues¶
Issue: etcd Quota Exceeded¶
Symptoms:
- etcdserver: mvcc: database space exceeded
- API server errors
- Cluster instability
Diagnosis:
# Check database size
ETCDCTL_API=3 etcdctl endpoint status | grep dbSize
# Check quota
ETCDCTL_API=3 etcdctl endpoint status | grep quota
Solutions:
# 1. Compact database
ETCDCTL_API=3 etcdctl compact <revision>
# 2. Defragment
ETCDCTL_API=3 etcdctl defrag
# 3. Increase quota (requires restart)
# Edit etcd.service and add:
--quota-backend-bytes=17179869184 # 16GB
Quota Exceeded
When quota is exceeded, etcd enters read-only mode. Immediate action required.
Issue: etcd Leader Election Failures¶
Symptoms: - Frequent leader changes - High election timeout - Cluster instability
Diagnosis:
# Check leader
ETCDCTL_API=3 etcdctl endpoint status | grep leader
# Check network latency
ping <etcd-endpoints>
# Check etcd logs
journalctl -u etcd -f
Solutions: - Increase election timeout if network latency is high - Check network connectivity between etcd nodes - Verify etcd node resources (CPU, memory) - Review etcd logs for errors
Issue: Slow etcd Operations¶
Symptoms: - High API server latency - Slow kubectl responses - Timeout errors
Diagnosis:
# Check etcd metrics
ETCDCTL_API=3 etcdctl endpoint status
# Check disk I/O
iostat -x 1
# Check etcd process
top -p $(pgrep etcd)
Solutions: - Use SSD storage for etcd - Separate etcd disk from OS disk - Increase etcd resources - Optimize network between API server and etcd - Review and optimize etcd flags
Diagnostic Checklist¶
# 1. Check etcd health
ETCDCTL_API=3 etcdctl endpoint health
# 2. Check cluster status
ETCDCTL_API=3 etcdctl endpoint status
# 3. Check member list
ETCDCTL_API=3 etcdctl member list
# 4. Check database size
ETCDCTL_API=3 etcdctl endpoint status | grep dbSize
# 5. Check etcd logs
journalctl -u etcd --since "1 hour ago"
# 6. Check disk space
df -h /var/lib/etcd
# 7. Check disk I/O
iostat -x 1
Best Practices¶
Production Recommendations
- Regular Backups: Automated daily backups with retention policy
- Test Restores: Regularly test restore procedures
- Monitor Metrics: Track dbSize, latency, and health
- Performance Tuning: Use SSD, separate disks, optimize flags
- Maintenance Windows: Schedule compaction during low traffic
- High Availability: Run 3+ etcd nodes in separate failure domains
- Security: Use TLS for all etcd communication
- Documentation: Document backup/restore procedures
Key Takeaways¶
- ✅ etcd is critical - always have backup and recovery procedures
- ✅ Regular backups are essential for disaster recovery
- ✅ Monitor database size and quota limits
- ✅ Performance depends on storage (SSD recommended)
- ✅ Compaction and defragmentation require maintenance windows
- ✅ High availability requires 3+ nodes
- ✅ Troubleshoot systematically: health → status → logs → metrics
Next Chapter: Scheduler & Controller Manager