Cloud Architecture¶
🎯 Learning Objectives
- Design scalable and resilient cloud architectures
- Understand high availability and fault tolerance patterns
- Master disaster recovery strategies
- Learn cost optimization techniques
- Know security and compliance in cloud architecture
Cloud architecture design is crucial for DevOps. This comprehensive chapter covers designing scalable, resilient, secure, and cost-effective cloud architectures.
Interview Focus
Be ready to design complete architectures, explain trade-offs, discuss scalability, and optimize for cost and performance.
Architecture Design Principles¶
The Five Pillars of Well-Architected Framework¶
1. Operational Excellence: - Automate changes and responses - Test all procedures - Learn from failures - Keep procedures current
2. Security: - Implement strong identity foundation - Enable traceability - Apply security at all layers - Automate security best practices
3. Reliability: - Test recovery procedures - Automatically recover from failures - Scale horizontally - Stop guessing capacity
4. Performance Efficiency: - Democratize advanced technologies - Go global in minutes - Use serverless architectures - Experiment more often
5. Cost Optimization: - Adopt consumption model - Measure overall efficiency - Stop spending money on undifferentiated heavy lifting - Analyze and attribute expenditure
Scalability Patterns¶
Horizontal vs Vertical Scaling¶
Horizontal Scaling (Scale Out): - Add more instances/nodes - Better for cloud (unlimited capacity) - Improves availability - Requires load balancing - Example: Add more web servers
Vertical Scaling (Scale Up): - Increase instance size - Limited by hardware - Single point of failure - No code changes needed - Example: Upgrade from t3.medium to t3.large
When to Use: - Horizontal: Stateless applications, high availability needed - Vertical: Stateful applications, single instance, quick fix
Auto Scaling Strategies¶
Reactive Scaling: - Scale based on current metrics - CPU, memory, request count - Responds to actual load
Predictive Scaling: - Use machine learning to predict demand - Scale before load increases - Better for predictable patterns
Scheduled Scaling: - Scale based on time schedules - Known traffic patterns - Business hours, events
Example Auto Scaling Configuration:
# Kubernetes HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: "100"
High Availability Architecture¶
Multi-AZ Deployment¶
Architecture:
Internet
↓
Route 53
↓
Application Load Balancer
(Multi-AZ, Health Checks)
↓
┌──────────────────────────────┐
│ │
┌───┴───┐ ┌───┴───┐
│ AZ-1a │ │ AZ-1b │
│ │ │ │
│ Web │ │ Web │
│ App │ │ App │
│ │ │ │
└───┬───┘ └───┬───┘
│ │
└──────────┬───────────────────┘
↓
┌──────────────────────┐
│ RDS Multi-AZ │
│ Primary (AZ-1a) │
│ Standby (AZ-1b) │
│ Automatic Failover │
└──────────────────────┘
Components: - Load Balancer: Distributes traffic across AZs - Health Checks: Remove unhealthy instances - Database Replication: Automatic failover - Data Synchronization: Real-time replication
Active-Active vs Active-Passive¶
Active-Active: - All instances handle traffic - Better resource utilization - More complex to implement - Use for: Stateless applications
Active-Passive: - Primary handles traffic, standby ready - Simpler to implement - Lower resource utilization - Use for: Stateful applications, databases
Disaster Recovery Strategies¶
RTO and RPO¶
RTO (Recovery Time Objective): - Maximum acceptable downtime - How quickly must system recover - Example: 4 hours RTO
RPO (Recovery Point Objective): - Maximum acceptable data loss - How much data can be lost - Example: 1 hour RPO (can lose 1 hour of data)
Disaster Recovery Strategies¶
1. Backup and Restore (Highest RTO/RPO): - Regular backups to S3 - Restore when disaster occurs - Cost: Low - RTO: Hours to days - RPO: Hours to days
2. Pilot Light: - Minimal version running in DR region - Core services ready - Scale up when needed - Cost: Low to medium - RTO: Minutes to hours - RPO: Minutes
3. Warm Standby: - Scaled-down version always running - Can handle minimal load - Scale up quickly - Cost: Medium - RTO: Minutes - RPO: Minutes
4. Multi-Site Active-Active (Lowest RTO/RPO): - Full production in multiple sites - Load balanced across sites - Cost: High - RTO: Near zero - RPO: Near zero
Cost Optimization¶
Right Sizing¶
Process: 1. Monitor resource utilization 2. Identify over-provisioned resources 3. Identify under-provisioned resources 4. Adjust instance types 5. Use CloudWatch metrics
Tools: - AWS Cost Explorer - AWS Trusted Advisor - CloudWatch metrics - Third-party tools (CloudHealth, Spot.io)
Reserved Instances¶
Types: - Standard RIs: Up to 72% savings, no modification - Convertible RIs: Up to 54% savings, can modify - Scheduled RIs: For predictable workloads
Payment Options: - All Upfront: Highest discount - Partial Upfront: Medium discount - No Upfront: Lowest discount, monthly payments
Spot Instances¶
Use Cases: - Fault-tolerant workloads - Batch processing - CI/CD pipelines - Development/testing
Best Practices: - Use Spot Fleet for availability - Implement checkpointing - Handle interruptions gracefully - Diversify across instance types
Security Architecture¶
Defense in Depth¶
Layers: 1. Network: VPC, subnets, security groups, NACLs 2. Identity: IAM, MFA, least privilege 3. Data: Encryption at rest and in transit 4. Application: WAF, input validation 5. Monitoring: CloudTrail, CloudWatch, GuardDuty
Network Security¶
VPC Design:
Internet
↓
Internet Gateway
↓
Public Subnet (DMZ)
├── NAT Gateway
└── Bastion Host
↓
Private Subnet (Application)
└── Application Servers
↓
Private Subnet (Data)
└── Database (No Internet)
Security Groups: - Stateful firewall - Instance-level security - Allow rules only - Default deny all
NACLs: - Stateless firewall - Subnet-level security - Explicit allow/deny - Rule order matters
Comprehensive Interview Questions¶
Q1: Design a scalable web application architecture¶
Answer:
Architecture Components:
Users
↓
CloudFront CDN (Static Content)
↓
Route 53 (DNS)
↓
Application Load Balancer (Multi-AZ)
↓
Auto Scaling Group
├── Web Servers (Stateless)
└── Application Servers
↓
ElastiCache (Redis) - Session Store
↓
RDS Multi-AZ (Database)
↓
S3 (File Storage)
Key Design Decisions: - CDN: Reduce latency, offload origin - Load Balancer: Distribute traffic, health checks - Auto Scaling: Handle traffic spikes - Stateless Servers: Enable horizontal scaling - Caching: Reduce database load - Multi-AZ Database: High availability
Q2: How do you ensure 99.99% availability?¶
Answer:
Requirements: - 99.99% = 52.56 minutes downtime per year - Requires redundancy at every level
Implementation: 1. Multi-AZ Deployment: All components in multiple AZs 2. Health Checks: Automatic failure detection 3. Auto Scaling: Replace failed instances 4. Database Replication: Automatic failover 5. Monitoring: Real-time alerts 6. Disaster Recovery: Multi-region backup
Example: - Load Balancer: Multi-AZ (99.99%) - Application: Multi-AZ with auto-scaling (99.95%) - Database: Multi-AZ with automatic failover (99.95%) - Overall: 99.85% (need redundancy to reach 99.99%)
Q3: Explain the difference between availability and reliability¶
Answer:
Availability: - System is operational when needed - Measured as uptime percentage - Example: 99.9% availability
Reliability: - System performs correctly over time - Measured as mean time between failures (MTBF) - Includes correctness of operations
Relationship: - High availability doesn't guarantee reliability - System can be available but unreliable (buggy) - Both are important for production systems
Recommended Resources¶
Books¶
- "Designing Data-Intensive Applications" by Martin Kleppmann - System design principles
- "Site Reliability Engineering" by Google - Production system design
Articles¶
Previous: AWS Services | Next: Monitoring & Logging