Chapter 14: Transfer Learning¶
🚀 Learning Objectives
- Understand transfer learning concepts
- Implement feature extraction and fine-tuning
- Choose the right pre-trained models
- Optimize transfer learning workflows
Transfer learning leverages knowledge from pre-trained models to solve new tasks with less data and training time.
When to Use Transfer Learning
Use transfer learning when: you have limited data (< 10k samples), your task is similar to ImageNet (classification), or you need quick results. For very different tasks or large datasets, training from scratch might be better.
Feature Extraction vs Fine-tuning
- Feature Extraction: Freeze backbone, train only classifier. Faster, less memory, good for very small datasets.
- Fine-tuning: Update all/some layers. Slower, more memory, better for larger datasets or different domains.
Why Transfer Learning?¶
Benefits: - ✅ Faster training (hours vs days) - ✅ Better performance with limited data - ✅ Leverage pre-trained features - ✅ Reduced computational cost
When to use: - Small datasets - Similar task/domain - Limited computational resources - Quick prototyping
Two Main Approaches¶
1. Feature Extraction (Frozen Backbone)¶
Idea: Use pre-trained model as fixed feature extractor
import torch
import torch.nn as nn
import torchvision.models as models
# Load pretrained model
model = models.resnet18(pretrained=True)
# Freeze all parameters
for param in model.parameters():
param.requires_grad = False
# Replace final layer
num_features = model.fc.in_features
model.fc = nn.Linear(num_features, 10) # 10 classes
# Only final layer will be trained
print("Trainable parameters:")
for name, param in model.named_parameters():
if param.requires_grad:
print(f" {name}")
2. Fine-Tuning (Update All/Some Layers)¶
Idea: Update pre-trained weights with small learning rate
import torch.optim as optim
# Load pretrained model
model = models.resnet18(pretrained=True)
# Replace final layer
num_features = model.fc.in_features
model.fc = nn.Linear(num_features, 10)
# Different learning rates for different parts
optimizer = optim.SGD([
{'params': model.fc.parameters(), 'lr': 1e-3}, # New layer
{'params': model.layer4.parameters(), 'lr': 1e-4}, # Last block
{'params': model.layer3.parameters(), 'lr': 1e-5}, # Earlier layers
], momentum=0.9)
Complete Transfer Learning Example¶
Feature Extraction¶
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import models, transforms, datasets
from torch.utils.data import DataLoader
# 1. Data preparation
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
train_dataset = datasets.ImageFolder('data/train', transform=transform)
val_dataset = datasets.ImageFolder('data/val', transform=transform)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
# 2. Load pretrained model
model = models.resnet50(pretrained=True)
# 3. Freeze backbone
for param in model.parameters():
param.requires_grad = False
# 4. Replace classifier
num_classes = len(train_dataset.classes)
model.fc = nn.Sequential(
nn.Dropout(0.5),
nn.Linear(model.fc.in_features, num_classes)
)
# 5. Setup training
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.fc.parameters(), lr=0.001)
# 6. Train
def train_epoch(model, loader, criterion, optimizer, device):
model.train()
total_loss = 0
correct = 0
total = 0
for data, target in loader:
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
total_loss += loss.item()
_, predicted = output.max(1)
total += target.size(0)
correct += predicted.eq(target).sum().item()
return total_loss / len(loader), 100. * correct / total
# Train for a few epochs
for epoch in range(10):
train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
print(f'Epoch {epoch+1}: Loss={train_loss:.4f}, Acc={train_acc:.2f}%')
Progressive Fine-Tuning¶
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.models as models
def freeze_layers(model, freeze_until='layer4'):
"""Freeze layers up to specified layer"""
freeze = True
for name, child in model.named_children():
if name == freeze_until:
freeze = False
for param in child.parameters():
param.requires_grad = not freeze
# Load model
model = models.resnet50(pretrained=True)
model.fc = nn.Linear(model.fc.in_features, 10)
# Stage 1: Train only classifier
print("Stage 1: Train classifier only")
freeze_layers(model, freeze_until='fc')
optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-3)
# Train for 5 epochs...
# Stage 2: Fine-tune layer4
print("Stage 2: Fine-tune layer4")
freeze_layers(model, freeze_until='layer4')
optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-4)
# Train for 5 epochs...
# Stage 3: Fine-tune all layers
print("Stage 3: Fine-tune all layers")
for param in model.parameters():
param.requires_grad = True
optimizer = optim.Adam(model.parameters(), lr=1e-5)
# Train for 5 epochs...
Working with Different Architectures¶
ResNet¶
import torchvision.models as models
# ResNet family
resnet18 = models.resnet18(pretrained=True)
resnet34 = models.resnet34(pretrained=True)
resnet50 = models.resnet50(pretrained=True)
resnet101 = models.resnet101(pretrained=True)
# Modify for custom classes
num_classes = 10
resnet18.fc = nn.Linear(resnet18.fc.in_features, num_classes)
VGG¶
# VGG family
vgg16 = models.vgg16(pretrained=True)
vgg19 = models.vgg19(pretrained=True)
# Modify classifier
num_classes = 10
vgg16.classifier[6] = nn.Linear(4096, num_classes)
EfficientNet¶
# EfficientNet family
efficientnet_b0 = models.efficientnet_b0(pretrained=True)
efficientnet_b7 = models.efficientnet_b7(pretrained=True)
# Modify classifier
num_classes = 10
efficientnet_b0.classifier[1] = nn.Linear(
efficientnet_b0.classifier[1].in_features,
num_classes
)
Vision Transformer (ViT)¶
# Vision Transformer
vit = models.vit_b_16(pretrained=True)
# Modify head
num_classes = 10
vit.heads = nn.Linear(vit.heads.head.in_features, num_classes)
Custom Classifier Head¶
Simple Head¶
class SimpleHead(nn.Module):
def __init__(self, in_features, num_classes):
super().__init__()
self.fc = nn.Linear(in_features, num_classes)
def forward(self, x):
return self.fc(x)
model = models.resnet50(pretrained=True)
model.fc = SimpleHead(model.fc.in_features, 10)
Advanced Head¶
class AdvancedHead(nn.Module):
def __init__(self, in_features, num_classes, hidden_dim=512):
super().__init__()
self.head = nn.Sequential(
nn.Linear(in_features, hidden_dim),
nn.ReLU(inplace=True),
nn.Dropout(0.5),
nn.Linear(hidden_dim, hidden_dim // 2),
nn.ReLU(inplace=True),
nn.Dropout(0.5),
nn.Linear(hidden_dim // 2, num_classes)
)
def forward(self, x):
return self.head(x)
model = models.resnet50(pretrained=True)
model.fc = AdvancedHead(model.fc.in_features, 10)
Feature Extraction for Embeddings¶
Extract Features¶
import torch
import torchvision.models as models
class FeatureExtractor(nn.Module):
def __init__(self, model_name='resnet50'):
super().__init__()
# Load pretrained model
if model_name == 'resnet50':
model = models.resnet50(pretrained=True)
# Remove final FC layer
self.features = nn.Sequential(*list(model.children())[:-1])
# Freeze
for param in self.features.parameters():
param.requires_grad = False
def forward(self, x):
x = self.features(x)
x = x.view(x.size(0), -1)
return x
# Usage
extractor = FeatureExtractor()
extractor.eval()
with torch.no_grad():
features = extractor(images)
print(f"Features shape: {features.shape}") # [batch, 2048]
Similarity Search¶
import torch
import torch.nn.functional as F
def extract_embeddings(model, dataloader, device):
"""Extract embeddings for all images"""
model.eval()
embeddings = []
labels = []
with torch.no_grad():
for data, target in dataloader:
data = data.to(device)
embedding = model(data)
embeddings.append(embedding.cpu())
labels.append(target)
return torch.cat(embeddings), torch.cat(labels)
def find_similar(query_embedding, embeddings, k=5):
"""Find k most similar embeddings"""
# Cosine similarity
similarities = F.cosine_similarity(
query_embedding.unsqueeze(0),
embeddings
)
# Get top k
top_k = similarities.topk(k)
return top_k.indices, top_k.values
# Extract embeddings
extractor = FeatureExtractor().to(device)
embeddings, labels = extract_embeddings(extractor, dataloader, device)
# Find similar images
query_emb = embeddings[0]
indices, scores = find_similar(query_emb, embeddings, k=5)
print(f"Most similar: {indices}")
print(f"Scores: {scores}")
Multi-Task Learning¶
Multiple Heads¶
class MultiTaskModel(nn.Module):
def __init__(self, num_classes_task1, num_classes_task2):
super().__init__()
# Shared backbone
backbone = models.resnet50(pretrained=True)
self.features = nn.Sequential(*list(backbone.children())[:-1])
# Task-specific heads
in_features = 2048
self.head_task1 = nn.Linear(in_features, num_classes_task1)
self.head_task2 = nn.Linear(in_features, num_classes_task2)
def forward(self, x):
# Shared features
features = self.features(x)
features = features.view(features.size(0), -1)
# Task outputs
out1 = self.head_task1(features)
out2 = self.head_task2(features)
return out1, out2
# Training
model = MultiTaskModel(num_classes_task1=10, num_classes_task2=5)
criterion1 = nn.CrossEntropyLoss()
criterion2 = nn.CrossEntropyLoss()
for data, (target1, target2) in dataloader:
out1, out2 = model(data)
loss1 = criterion1(out1, target1)
loss2 = criterion2(out2, target2)
# Combined loss
loss = loss1 + 0.5 * loss2 # Weight tasks differently
loss.backward()
optimizer.step()
Domain Adaptation¶
Fine-tune on New Domain¶
def adapt_to_new_domain(model, source_loader, target_loader, epochs=10):
"""Adapt model from source to target domain"""
# Freeze early layers
for name, param in model.named_parameters():
if 'layer1' in name or 'layer2' in name:
param.requires_grad = False
optimizer = optim.Adam(
filter(lambda p: p.requires_grad, model.parameters()),
lr=1e-4
)
for epoch in range(epochs):
model.train()
# Train on target domain
for data, target in target_loader:
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
return model
Best Practices¶
1. Learning Rate Selection¶
def get_optimizer(model, base_lr=1e-3):
"""Different LR for different parts"""
# Identify layer groups
backbone_params = []
head_params = []
for name, param in model.named_parameters():
if 'fc' in name or 'classifier' in name:
head_params.append(param)
else:
backbone_params.append(param)
optimizer = optim.Adam([
{'params': backbone_params, 'lr': base_lr * 0.1}, # 10x smaller
{'params': head_params, 'lr': base_lr}
])
return optimizer
2. Gradual Unfreezing¶
def unfreeze_gradually(model, epoch, unfreeze_schedule):
"""Unfreeze layers according to schedule"""
for layer_name, unfreeze_epoch in unfreeze_schedule.items():
if epoch >= unfreeze_epoch:
for name, param in model.named_parameters():
if layer_name in name:
param.requires_grad = True
# Usage
schedule = {
'layer4': 5,
'layer3': 10,
'layer2': 15,
'layer1': 20
}
for epoch in range(25):
unfreeze_gradually(model, epoch, schedule)
train_epoch(...)
3. Data Augmentation¶
from torchvision import transforms
# Stronger augmentation for fine-tuning
train_transform = transforms.Compose([
transforms.RandomResizedCrop(224, scale=(0.6, 1.0)),
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(15),
transforms.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.3),
transforms.RandomAffine(degrees=0, translate=(0.1, 0.1)),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
Decision Guide¶
When to use Feature Extraction:¶
- Very small dataset (<1000 samples)
- Limited compute resources
- Very similar task to pre-trained model
- Quick prototyping
When to use Fine-Tuning:¶
- Medium dataset (1000-100k samples)
- Sufficient compute resources
- Somewhat different task
- Want best performance
Training from Scratch:¶
- Very large dataset (>100k samples)
- Very different domain
- Unique architecture needed
- Lots of compute resources
Next Steps¶
Continue to Chapter 15: Model Saving & Loading to learn about: - Saving checkpoints - Loading models - Model versioning - Deployment preparation
Key Takeaways¶
- ✅ Transfer learning saves time and improves performance
- ✅ Feature extraction: freeze backbone, train new head
- ✅ Fine-tuning: use smaller learning rate for pre-trained layers
- ✅ Progressive unfreezing often works better
- ✅ Use different learning rates for different parts
- ✅ Always normalize inputs correctly for pre-trained models
Recommended Reads¶
📚 Official Documentation
- Transfer Learning Tutorial - Complete transfer learning guide
- torchvision.models - Pre-trained models
- Model Zoo - Available models
- Fine-Tuning Guide - Fine-tuning strategies
📖 Essential Articles
- Transfer Learning Explained - Transfer learning concepts
- Feature Extraction vs Fine-Tuning - When to use each
- Progressive Unfreezing - Advanced techniques
- Domain Adaptation - Adapting to new domains
🎓 Learning Resources
- Computer Vision Transfer Learning - CV examples
- NLP Transfer Learning - Hugging Face transformers
- Transfer Learning Best Practices - Optimization tips
💡 Best Practices
- Learning Rate Selection - Choosing LRs for fine-tuning
- Freezing Layers - When to freeze
- Data Augmentation - Augmentation for transfer learning
🔬 Research Papers
- How transferable are features in deep neural networks? - Transfer learning analysis
- Rethinking ImageNet Pre-training - Pre-training effectiveness
- BERT: Pre-training of Deep Bidirectional Transformers - NLP transfer learning