Skip to content

Site Reliability Engineer AIML - Associate Resources

🤖 SRE for AI/ML Systems

Comprehensive resources for Site Reliability Engineering in AI/ML infrastructure, LLM serving, monitoring, and incident response.

📖 Books

SRE & Reliability Engineering
  1. Site Reliability Engineering: How Google Runs Production Systems - Google SRE Team (Free)
  2. The Site Reliability Workbook - Google SRE Team (Free)
  3. Building Secure and Reliable Systems - Google SRE Team (Free)
  4. Reliability Engineering Handbook - Bryan Dodson
  5. Chaos Engineering - Casey Rosenthal & Nora Jones
ML Operations & MLOps
  1. MLOps: Continuous delivery and automation pipelines in machine learning - Mark Treveil & Dataiku Team
  2. Practical MLOps - Noah Gift & Alfredo Deza
  3. Building Machine Learning Powered Applications - Emmanuel Ameisen
  4. Designing Machine Learning Systems - Chip Huyen
AI/ML Infrastructure & Serving
  1. High Performance Browser Networking - Ilya Grigorik
  2. Designing Data-Intensive Applications - Martin Kleppmann
  3. Systems Performance: Enterprise and the Cloud - Brendan Gregg

📄 Research Papers

SLO/SLA for AI Systems
  1. SLOs for ML Systems - Service Level Objectives for Machine Learning
  2. Monitoring and Explainability of Models in Production - ML Monitoring Best Practices
  3. Continuous Training for Production ML - Continuous Learning Systems
  4. The Tail at Scale - Google research on latency optimization
ML Model Serving & Inference
  1. Clipper: A Low-Latency Online Prediction Serving System - NSDI 2017
  2. InferLine: ML Inference Pipeline Composition Framework - USENIX ATC 2020
  3. Serving DNNs like Clockwork: Performance Predictability from the Bottom Up - OSDI 2020
  4. Batch Processing for ML Inference - Efficient Batch Inference
Model Drift & Monitoring
  1. Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift - Dataset Shift Detection
  2. Monitoring ML Models in Production - Production ML Monitoring
  3. A Survey on Concept Drift Adaptation - Concept Drift in ML
  4. Data Validation for Machine Learning - Data Quality Monitoring
LLM Serving & Performance
  1. Efficiently Scaling Transformer Inference - Transformer Inference Optimization
  2. vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention - Efficient LLM Serving
  3. Orca: A Distributed Serving System for Transformer-Based Generative Models - OSDI 2022
  4. Fast Inference from Transformers via Speculative Decoding - Speculative Decoding
Cost Optimization & Resource Management
  1. Gandiva: Introspective Cluster Scheduling for Deep Learning - GPU Scheduling
  2. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters - Resource Scheduling
  3. The Case for Learned Index Structures - Learned Indexes

⭐ GitHub Repositories

ML Observability & Monitoring
  1. OpenInference - Open standard for ML observability
  2. Evidently AI - ML model monitoring and drift detection
  3. Arize AI Phoenix - Open-source ML observability
  4. WhyLabs - Data logging and monitoring for ML
  5. Fiddler AI - ML monitoring and explainability
LLM Serving & Inference
  1. vLLM - Fast LLM inference and serving
  2. Text Generation Inference - Hugging Face LLM serving
  3. TensorRT-LLM - NVIDIA's LLM inference optimization
  4. Ray Serve - Scalable model serving framework
  5. Seldon Core - ML model deployment on Kubernetes
MLOps & Model Deployment
  1. MLflow - ML lifecycle platform
  2. Kubeflow - ML toolkit for Kubernetes
  3. BentoML - Model serving framework
  4. Cortex - ML model serving platform
  5. Triton Inference Server - NVIDIA's inference server
SRE & Reliability Tools
  1. Google SRE Book - Site Reliability Engineering book
  2. Awesome SRE - Curated SRE resources
  3. Chaos Engineering - Chaos engineering for Kubernetes
  4. Litmus - Cloud-native chaos engineering
  5. Gremlin - Chaos engineering platform
OpenTelemetry & Observability
  1. OpenTelemetry - Observability standard
  2. OpenTelemetry Python - Python instrumentation
  3. OpenTelemetry Operator - K8s operator for OTel
  4. Prometheus - Metrics collection and alerting
  5. Grafana - Visualization and dashboards
Cost Optimization & GPU Management
  1. Kubernetes GPU Scheduler - NVIDIA GPU support for K8s
  2. GKE GPU Sharing - GPU sharing in GKE
  3. KubeCost - Kubernetes cost monitoring
  4. Cluster Autoscaler - K8s cluster autoscaling

🎥 Videos & Courses

SRE Courses
  1. Google SRE Training - Google's SRE education resources
  2. Site Reliability Engineering Course (Coursera) - SRE fundamentals
  3. Chaos Engineering Course - O'Reilly training
MLOps & Production ML
  1. Full Stack Deep Learning - Production ML course
  2. MLOps Specialization (Coursera) - MLOps fundamentals
  3. Made With ML - Production ML best practices
  4. Stanford CS329S: Machine Learning Systems Design - ML systems course
LLM Serving & Inference
  1. Efficient LLM Inference (YouTube) - Various tutorials
  2. vLLM Tutorial - vLLM documentation and guides
  3. Hugging Face Inference Course - Model serving with HF
Observability & Monitoring
  1. OpenTelemetry Course - Official OTel documentation
  2. Prometheus & Grafana Tutorials - Monitoring stack
  3. Datadog Learning Center - Observability best practices

📰 Articles & Blogs

SRE for AI/ML
  1. Google AI Blog - ML Reliability - Google's AI reliability practices
  2. Netflix Tech Blog - ML Infrastructure - Netflix ML systems
  3. Uber Engineering - ML Platform - Uber's ML infrastructure
  4. LinkedIn Engineering - ML Serving - LinkedIn ML systems
LLM Serving & Performance
  1. Anyscale Blog - LLM Serving - LLM inference optimization
  2. Together AI Blog - LLM infrastructure insights
  3. vLLM Blog Posts - vLLM performance and features
  4. Hugging Face Blog - Inference - Model serving best practices
ML Monitoring & Observability
  1. Arize AI Blog - ML observability and monitoring
  2. Evidently AI Blog - ML model monitoring
  3. WhyLabs Blog - Data quality and monitoring
  4. Fiddler AI Blog - ML explainability and monitoring
SLO/SLA & Reliability
  1. Google SRE Book - SLOs - SLO best practices
  2. SLO Engineering Guide - SLO documentation
  3. Error Budgets - Error budget management
  4. Reliability Engineering Blog - USENIX reliability articles
Cost Optimization
  1. AWS Cost Optimization for ML - AWS ML cost optimization
  2. Google Cloud - ML Cost Optimization - GCP cost strategies
  3. GPU Cost Optimization - Databricks cost optimization
Incident Response & Chaos Engineering
  1. Chaos Engineering Principles - Chaos engineering manifesto
  2. Netflix Chaos Engineering - Netflix chaos practices
  3. Gremlin Blog - Chaos engineering insights
  4. Incident Response Playbooks - PagerDuty incident response
AI-Specific Observability Tools
  1. OpenInference Specification - ML observability standard
  2. OpenTelemetry for ML - Observability for AI systems
  3. MLflow Tracking - ML experiment tracking
  4. Weights & Biases - ML experiment tracking and monitoring
  5. Neptune AI - ML experiment management
SLO/SLA Frameworks for AI
  1. SLOs for ML Systems (Paper) - Academic research
  2. Google SRE SLO Guide - Practical SLO implementation
  3. SLI/SLO/SLA Best Practices - Google's approach
  4. AI Metrics: Accuracy, Fairness, Latency - Fairness in ML
LLM Performance Metrics
  1. TTFT (Time To First Token) Optimization - vLLM paper
  2. TPOT (Time Per Output Token) Metrics - Ray Serve blog
  3. LLM Latency Benchmarks - LMSYS benchmarks
  4. Inference Performance Guide - Hugging Face optimization
Model Drift & Continuous Evaluation
  1. Data Drift Detection - Research paper
  2. Concept Drift Adaptation - Survey paper
  3. Continuous ML Monitoring - Production monitoring
  4. ML Model Validation - Data validation
AI Incident Response
  1. AI Circuit Breakers - Fault tolerance for ML
  2. Automated Rollback Strategies - Continuous training
  3. ML Incident Response Playbooks - SRE incident management
  4. Chaos Engineering for ML - Testing ML resilience
GPU Scheduling & Resource Management
  1. Kubernetes GPU Scheduling - K8s GPU docs
  2. GPU Sharing Strategies - Multi-tenant GPU sharing
  3. Resource Optimization for ML - Gandiva paper
  4. GPU Cost Optimization - GKE GPU guide
AI Gateways & Load Balancing
  1. AI Gateway Architecture - AI gateway patterns
  2. Load Balancing for ML Inference - Clipper paper
  3. Caching Strategies for LLMs - vLLM caching
  4. Traffic Management for AI - Istio for ML
Cost Optimization for AI Infrastructure
  1. AWS Cost Optimization - AWS strategies
  2. GCP ML Cost Management - GCP best practices
  3. Kubernetes Cost Optimization - Kubecost tooling
  4. GPU Cost Analysis - Databricks insights
Multi-Region & Multi-Cloud AI
  1. Multi-Region ML Deployment - AWS guide
  2. Kubernetes Multi-Cluster - K8s federation
  3. Disaster Recovery for ML - SRE disaster recovery
  4. Failover Strategies - ML system failover
Security for AI Infrastructure
  1. ML Security Best Practices - OWASP ML Top 10
  2. AI Model Security - Adversarial attacks
  3. Secure ML Deployment - NIST guidelines
  4. Kubernetes Security - K8s security practices
Chaos Engineering & Resilience Testing
  1. Chaos Engineering Principles - Core principles
  2. Chaos Mesh Documentation - Chaos engineering tool
  3. Litmus Chaos - Cloud-native chaos
  4. Resilience Testing for ML - ML system resilience
Continuous Integration for ML
  1. ML CI/CD Best Practices - MLOps principles
  2. GitHub Actions for ML - CI/CD automation
  3. ML Testing Strategies - Testing ML systems
  4. Pre-deployment Validation - ML validation