Site Reliability Engineer AIML - Associate Resources¶
🤖 SRE for AI/ML Systems
Comprehensive resources for Site Reliability Engineering in AI/ML infrastructure, LLM serving, monitoring, and incident response.
📖 Books¶
SRE & Reliability Engineering
- Site Reliability Engineering: How Google Runs Production Systems - Google SRE Team (Free)
- The Site Reliability Workbook - Google SRE Team (Free)
- Building Secure and Reliable Systems - Google SRE Team (Free)
- Reliability Engineering Handbook - Bryan Dodson
- Chaos Engineering - Casey Rosenthal & Nora Jones
ML Operations & MLOps
- MLOps: Continuous delivery and automation pipelines in machine learning - Mark Treveil & Dataiku Team
- Practical MLOps - Noah Gift & Alfredo Deza
- Building Machine Learning Powered Applications - Emmanuel Ameisen
- Designing Machine Learning Systems - Chip Huyen
AI/ML Infrastructure & Serving
- High Performance Browser Networking - Ilya Grigorik
- Designing Data-Intensive Applications - Martin Kleppmann
- Systems Performance: Enterprise and the Cloud - Brendan Gregg
📄 Research Papers¶
SLO/SLA for AI Systems
- SLOs for ML Systems - Service Level Objectives for Machine Learning
- Monitoring and Explainability of Models in Production - ML Monitoring Best Practices
- Continuous Training for Production ML - Continuous Learning Systems
- The Tail at Scale - Google research on latency optimization
ML Model Serving & Inference
- Clipper: A Low-Latency Online Prediction Serving System - NSDI 2017
- InferLine: ML Inference Pipeline Composition Framework - USENIX ATC 2020
- Serving DNNs like Clockwork: Performance Predictability from the Bottom Up - OSDI 2020
- Batch Processing for ML Inference - Efficient Batch Inference
Model Drift & Monitoring
- Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift - Dataset Shift Detection
- Monitoring ML Models in Production - Production ML Monitoring
- A Survey on Concept Drift Adaptation - Concept Drift in ML
- Data Validation for Machine Learning - Data Quality Monitoring
LLM Serving & Performance
- Efficiently Scaling Transformer Inference - Transformer Inference Optimization
- vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention - Efficient LLM Serving
- Orca: A Distributed Serving System for Transformer-Based Generative Models - OSDI 2022
- Fast Inference from Transformers via Speculative Decoding - Speculative Decoding
Cost Optimization & Resource Management
- Gandiva: Introspective Cluster Scheduling for Deep Learning - GPU Scheduling
- Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters - Resource Scheduling
- The Case for Learned Index Structures - Learned Indexes
⭐ GitHub Repositories¶
ML Observability & Monitoring
- OpenInference - Open standard for ML observability
- Evidently AI - ML model monitoring and drift detection
- Arize AI Phoenix - Open-source ML observability
- WhyLabs - Data logging and monitoring for ML
- Fiddler AI - ML monitoring and explainability
LLM Serving & Inference
- vLLM - Fast LLM inference and serving
- Text Generation Inference - Hugging Face LLM serving
- TensorRT-LLM - NVIDIA's LLM inference optimization
- Ray Serve - Scalable model serving framework
- Seldon Core - ML model deployment on Kubernetes
MLOps & Model Deployment
- MLflow - ML lifecycle platform
- Kubeflow - ML toolkit for Kubernetes
- BentoML - Model serving framework
- Cortex - ML model serving platform
- Triton Inference Server - NVIDIA's inference server
SRE & Reliability Tools
- Google SRE Book - Site Reliability Engineering book
- Awesome SRE - Curated SRE resources
- Chaos Engineering - Chaos engineering for Kubernetes
- Litmus - Cloud-native chaos engineering
- Gremlin - Chaos engineering platform
OpenTelemetry & Observability
- OpenTelemetry - Observability standard
- OpenTelemetry Python - Python instrumentation
- OpenTelemetry Operator - K8s operator for OTel
- Prometheus - Metrics collection and alerting
- Grafana - Visualization and dashboards
Cost Optimization & GPU Management
- Kubernetes GPU Scheduler - NVIDIA GPU support for K8s
- GKE GPU Sharing - GPU sharing in GKE
- KubeCost - Kubernetes cost monitoring
- Cluster Autoscaler - K8s cluster autoscaling
🎥 Videos & Courses¶
SRE Courses
- Google SRE Training - Google's SRE education resources
- Site Reliability Engineering Course (Coursera) - SRE fundamentals
- Chaos Engineering Course - O'Reilly training
MLOps & Production ML
- Full Stack Deep Learning - Production ML course
- MLOps Specialization (Coursera) - MLOps fundamentals
- Made With ML - Production ML best practices
- Stanford CS329S: Machine Learning Systems Design - ML systems course
LLM Serving & Inference
- Efficient LLM Inference (YouTube) - Various tutorials
- vLLM Tutorial - vLLM documentation and guides
- Hugging Face Inference Course - Model serving with HF
Observability & Monitoring
- OpenTelemetry Course - Official OTel documentation
- Prometheus & Grafana Tutorials - Monitoring stack
- Datadog Learning Center - Observability best practices
📰 Articles & Blogs¶
SRE for AI/ML
- Google AI Blog - ML Reliability - Google's AI reliability practices
- Netflix Tech Blog - ML Infrastructure - Netflix ML systems
- Uber Engineering - ML Platform - Uber's ML infrastructure
- LinkedIn Engineering - ML Serving - LinkedIn ML systems
LLM Serving & Performance
- Anyscale Blog - LLM Serving - LLM inference optimization
- Together AI Blog - LLM infrastructure insights
- vLLM Blog Posts - vLLM performance and features
- Hugging Face Blog - Inference - Model serving best practices
ML Monitoring & Observability
- Arize AI Blog - ML observability and monitoring
- Evidently AI Blog - ML model monitoring
- WhyLabs Blog - Data quality and monitoring
- Fiddler AI Blog - ML explainability and monitoring
SLO/SLA & Reliability
- Google SRE Book - SLOs - SLO best practices
- SLO Engineering Guide - SLO documentation
- Error Budgets - Error budget management
- Reliability Engineering Blog - USENIX reliability articles
Cost Optimization
- AWS Cost Optimization for ML - AWS ML cost optimization
- Google Cloud - ML Cost Optimization - GCP cost strategies
- GPU Cost Optimization - Databricks cost optimization
Incident Response & Chaos Engineering
- Chaos Engineering Principles - Chaos engineering manifesto
- Netflix Chaos Engineering - Netflix chaos practices
- Gremlin Blog - Chaos engineering insights
- Incident Response Playbooks - PagerDuty incident response
🔗 Recommended Reading¶
AI-Specific Observability Tools
- OpenInference Specification - ML observability standard
- OpenTelemetry for ML - Observability for AI systems
- MLflow Tracking - ML experiment tracking
- Weights & Biases - ML experiment tracking and monitoring
- Neptune AI - ML experiment management
SLO/SLA Frameworks for AI
- SLOs for ML Systems (Paper) - Academic research
- Google SRE SLO Guide - Practical SLO implementation
- SLI/SLO/SLA Best Practices - Google's approach
- AI Metrics: Accuracy, Fairness, Latency - Fairness in ML
LLM Performance Metrics
- TTFT (Time To First Token) Optimization - vLLM paper
- TPOT (Time Per Output Token) Metrics - Ray Serve blog
- LLM Latency Benchmarks - LMSYS benchmarks
- Inference Performance Guide - Hugging Face optimization
Model Drift & Continuous Evaluation
- Data Drift Detection - Research paper
- Concept Drift Adaptation - Survey paper
- Continuous ML Monitoring - Production monitoring
- ML Model Validation - Data validation
AI Incident Response
- AI Circuit Breakers - Fault tolerance for ML
- Automated Rollback Strategies - Continuous training
- ML Incident Response Playbooks - SRE incident management
- Chaos Engineering for ML - Testing ML resilience
GPU Scheduling & Resource Management
- Kubernetes GPU Scheduling - K8s GPU docs
- GPU Sharing Strategies - Multi-tenant GPU sharing
- Resource Optimization for ML - Gandiva paper
- GPU Cost Optimization - GKE GPU guide
AI Gateways & Load Balancing
- AI Gateway Architecture - AI gateway patterns
- Load Balancing for ML Inference - Clipper paper
- Caching Strategies for LLMs - vLLM caching
- Traffic Management for AI - Istio for ML
Cost Optimization for AI Infrastructure
- AWS Cost Optimization - AWS strategies
- GCP ML Cost Management - GCP best practices
- Kubernetes Cost Optimization - Kubecost tooling
- GPU Cost Analysis - Databricks insights
Multi-Region & Multi-Cloud AI
- Multi-Region ML Deployment - AWS guide
- Kubernetes Multi-Cluster - K8s federation
- Disaster Recovery for ML - SRE disaster recovery
- Failover Strategies - ML system failover
Security for AI Infrastructure
- ML Security Best Practices - OWASP ML Top 10
- AI Model Security - Adversarial attacks
- Secure ML Deployment - NIST guidelines
- Kubernetes Security - K8s security practices
Chaos Engineering & Resilience Testing
- Chaos Engineering Principles - Core principles
- Chaos Mesh Documentation - Chaos engineering tool
- Litmus Chaos - Cloud-native chaos
- Resilience Testing for ML - ML system resilience
Continuous Integration for ML
- ML CI/CD Best Practices - MLOps principles
- GitHub Actions for ML - CI/CD automation
- ML Testing Strategies - Testing ML systems
- Pre-deployment Validation - ML validation