Skip to content

Cloud Architecture

🎯 Learning Objectives

  • Design scalable and resilient cloud architectures
  • Understand high availability and fault tolerance patterns
  • Master disaster recovery strategies
  • Learn cost optimization techniques
  • Know security and compliance in cloud architecture

Cloud architecture design is crucial for DevOps. This comprehensive chapter covers designing scalable, resilient, secure, and cost-effective cloud architectures.

Interview Focus

Be ready to design complete architectures, explain trade-offs, discuss scalability, and optimize for cost and performance.

Architecture Design Principles

The Five Pillars of Well-Architected Framework

1. Operational Excellence: - Automate changes and responses - Test all procedures - Learn from failures - Keep procedures current

2. Security: - Implement strong identity foundation - Enable traceability - Apply security at all layers - Automate security best practices

3. Reliability: - Test recovery procedures - Automatically recover from failures - Scale horizontally - Stop guessing capacity

4. Performance Efficiency: - Democratize advanced technologies - Go global in minutes - Use serverless architectures - Experiment more often

5. Cost Optimization: - Adopt consumption model - Measure overall efficiency - Stop spending money on undifferentiated heavy lifting - Analyze and attribute expenditure

Scalability Patterns

Horizontal vs Vertical Scaling

Horizontal Scaling (Scale Out): - Add more instances/nodes - Better for cloud (unlimited capacity) - Improves availability - Requires load balancing - Example: Add more web servers

Vertical Scaling (Scale Up): - Increase instance size - Limited by hardware - Single point of failure - No code changes needed - Example: Upgrade from t3.medium to t3.large

When to Use: - Horizontal: Stateless applications, high availability needed - Vertical: Stateful applications, single instance, quick fix

Auto Scaling Strategies

Reactive Scaling: - Scale based on current metrics - CPU, memory, request count - Responds to actual load

Predictive Scaling: - Use machine learning to predict demand - Scale before load increases - Better for predictable patterns

Scheduled Scaling: - Scale based on time schedules - Known traffic patterns - Business hours, events

Example Auto Scaling Configuration:

# Kubernetes HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: "100"

High Availability Architecture

Multi-AZ Deployment

Architecture:

                    Internet
                  Route 53
            Application Load Balancer
            (Multi-AZ, Health Checks)
        ┌──────────────────────────────┐
        │                              │
    ┌───┴───┐                      ┌───┴───┐
    │ AZ-1a │                      │ AZ-1b │
    │       │                      │       │
    │  Web  │                      │  Web  │
    │  App  │                      │  App  │
    │       │                      │       │
    └───┬───┘                      └───┬───┘
        │                              │
        └──────────┬───────────────────┘
        ┌──────────────────────┐
        │   RDS Multi-AZ       │
        │  Primary (AZ-1a)     │
        │  Standby (AZ-1b)     │
        │  Automatic Failover  │
        └──────────────────────┘

Components: - Load Balancer: Distributes traffic across AZs - Health Checks: Remove unhealthy instances - Database Replication: Automatic failover - Data Synchronization: Real-time replication

Active-Active vs Active-Passive

Active-Active: - All instances handle traffic - Better resource utilization - More complex to implement - Use for: Stateless applications

Active-Passive: - Primary handles traffic, standby ready - Simpler to implement - Lower resource utilization - Use for: Stateful applications, databases

Disaster Recovery Strategies

RTO and RPO

RTO (Recovery Time Objective): - Maximum acceptable downtime - How quickly must system recover - Example: 4 hours RTO

RPO (Recovery Point Objective): - Maximum acceptable data loss - How much data can be lost - Example: 1 hour RPO (can lose 1 hour of data)

Disaster Recovery Strategies

1. Backup and Restore (Highest RTO/RPO): - Regular backups to S3 - Restore when disaster occurs - Cost: Low - RTO: Hours to days - RPO: Hours to days

2. Pilot Light: - Minimal version running in DR region - Core services ready - Scale up when needed - Cost: Low to medium - RTO: Minutes to hours - RPO: Minutes

3. Warm Standby: - Scaled-down version always running - Can handle minimal load - Scale up quickly - Cost: Medium - RTO: Minutes - RPO: Minutes

4. Multi-Site Active-Active (Lowest RTO/RPO): - Full production in multiple sites - Load balanced across sites - Cost: High - RTO: Near zero - RPO: Near zero

Cost Optimization

Right Sizing

Process: 1. Monitor resource utilization 2. Identify over-provisioned resources 3. Identify under-provisioned resources 4. Adjust instance types 5. Use CloudWatch metrics

Tools: - AWS Cost Explorer - AWS Trusted Advisor - CloudWatch metrics - Third-party tools (CloudHealth, Spot.io)

Reserved Instances

Types: - Standard RIs: Up to 72% savings, no modification - Convertible RIs: Up to 54% savings, can modify - Scheduled RIs: For predictable workloads

Payment Options: - All Upfront: Highest discount - Partial Upfront: Medium discount - No Upfront: Lowest discount, monthly payments

Spot Instances

Use Cases: - Fault-tolerant workloads - Batch processing - CI/CD pipelines - Development/testing

Best Practices: - Use Spot Fleet for availability - Implement checkpointing - Handle interruptions gracefully - Diversify across instance types

Security Architecture

Defense in Depth

Layers: 1. Network: VPC, subnets, security groups, NACLs 2. Identity: IAM, MFA, least privilege 3. Data: Encryption at rest and in transit 4. Application: WAF, input validation 5. Monitoring: CloudTrail, CloudWatch, GuardDuty

Network Security

VPC Design:

Internet
Internet Gateway
Public Subnet (DMZ)
   ├── NAT Gateway
   └── Bastion Host
Private Subnet (Application)
   └── Application Servers
Private Subnet (Data)
   └── Database (No Internet)

Security Groups: - Stateful firewall - Instance-level security - Allow rules only - Default deny all

NACLs: - Stateless firewall - Subnet-level security - Explicit allow/deny - Rule order matters

Comprehensive Interview Questions

Q1: Design a scalable web application architecture

Answer:

Architecture Components:

Users
CloudFront CDN (Static Content)
Route 53 (DNS)
Application Load Balancer (Multi-AZ)
Auto Scaling Group
  ├── Web Servers (Stateless)
  └── Application Servers
ElastiCache (Redis) - Session Store
RDS Multi-AZ (Database)
S3 (File Storage)

Key Design Decisions: - CDN: Reduce latency, offload origin - Load Balancer: Distribute traffic, health checks - Auto Scaling: Handle traffic spikes - Stateless Servers: Enable horizontal scaling - Caching: Reduce database load - Multi-AZ Database: High availability

Q2: How do you ensure 99.99% availability?

Answer:

Requirements: - 99.99% = 52.56 minutes downtime per year - Requires redundancy at every level

Implementation: 1. Multi-AZ Deployment: All components in multiple AZs 2. Health Checks: Automatic failure detection 3. Auto Scaling: Replace failed instances 4. Database Replication: Automatic failover 5. Monitoring: Real-time alerts 6. Disaster Recovery: Multi-region backup

Example: - Load Balancer: Multi-AZ (99.99%) - Application: Multi-AZ with auto-scaling (99.95%) - Database: Multi-AZ with automatic failover (99.95%) - Overall: 99.85% (need redundancy to reach 99.99%)

Q3: Explain the difference between availability and reliability

Answer:

Availability: - System is operational when needed - Measured as uptime percentage - Example: 99.9% availability

Reliability: - System performs correctly over time - Measured as mean time between failures (MTBF) - Includes correctness of operations

Relationship: - High availability doesn't guarantee reliability - System can be available but unreliable (buggy) - Both are important for production systems

Books

  • "Designing Data-Intensive Applications" by Martin Kleppmann - System design principles
  • "Site Reliability Engineering" by Google - Production system design

Articles


Previous: AWS Services | Next: Monitoring & Logging