Project 3: Transfer Learning Summary

Two-Phase Approach

v1: Feature Extraction

93.40%

Frozen backbone, only FC layer trained

51,300 trainable params (0.5%)

LR: 0.001 • 16-17 epochs

→

v2: Fine-Tuning

96.60%

All layers unfrozen, initialized from v1

11M trainable params (100%)

LR: 0.0001 (10x smaller) • 8-12 epochs

v1: Trainable parameters51,300 / 11.2M (0.5%)

v2: Trainable parameters11.2M / 11.2M (100%)

11.2M

Architecture: ResNet18 → SportsClassifier

Input

224×224 RGB

(N, 3, 224, 224)

→

Conv1 + BN

64 filters, k=7, s=2

(N, 64, 112, 112)

→

Layers 1-4

ResNet BasicBlocks

(N, 512, 7, 7)

→

AdaptiveAvgPool

Global average

(N, 512)

→

FC (replaced)

512 → 100

(N, 100)

Blue = frozen in v1 | Green = always trainable | v2 unfreezes everything

Dataset: 100 Sports Classification

Split	Images	Per Class	Purpose
Train	13,493	~135	Model training
Validation	500	5	Early stopping & LR decisions
Test	500	5	Final evaluation (never seen during training)

Proper 3-way split — validation guides training decisions, test measures true generalization. This is the first project using a validation set separately from the test set.

Training Results

v1 — Feature Extraction

v1: Rapid convergence in first 6 epochs — only 51K params to tune. Most of the "learning" is already in the frozen backbone.

v1: 100×100 matrix. Mostly diagonal even with frozen features — ImageNet representations transfer well to sports.

v2 — Fine-Tuning

v2: Starting from v1's weights, fine-tuning jumps accuracy immediately. Early stopping prevents the model from overfitting to training set.

v2: Near-perfect diagonal. Only 17 errors on 500 test images. Remaining confusions are context-based (similar sports).

Training Configuration Comparison

Parameter	v1 (Feature Extraction)	v2 (Fine-Tuning)
Learning Rate	0.001	0.0001 (10x smaller)
Trainable Params	51,300 (FC only)	11.2M (all layers)
Optimizer	Adam	Adam
Starting Weights	ImageNet pretrained	v1 best checkpoint
Early Stopping	patience=5	patience=5
Augmentation	RandomHorizontalFlip + RandomCrop(224, pad=8) + ImageNet normalization

Concepts Study Guide

⇅

Transfer Learning

Reusing a model trained on one task to solve a different but related task

▼

Instead of training a model from random weights, you start with a model already trained on a large dataset (like ImageNet's 1.2M images, 1000 classes). The early layers have already learned universal visual features (edges, textures, shapes) that transfer to almost any image task. You only need to adapt the final layers to your specific problem.

Analogy

Imagine learning to play tennis after years of badminton. You don't start from scratch — your hand-eye coordination, footwork, and court sense all transfer. You just need to adapt to the heavier racket and different ball physics. Transfer learning works the same way: the "visual coordination" learned on ImageNet transfers to sports classification.

When transfer learning works best:

• Your dataset is small (ours: 13K images vs ImageNet's 1.2M)
• Your task is similar to what the pretrained model learned (both are image classification)
• You don't have the compute resources to train a deep network from scratch

From-scratch CNN on CIFAR-10 (10 classes): 81% ceiling
Transfer learning on Sports (100 classes): 93-98%

More classes, harder task, WAY better results. That's the power of transfer.

Why in this project

The previous project proved that a scratch CNN caps at ~81%. Transfer learning shatters this ceiling by leveraging features from a 11M-parameter network pretrained on millions of images. It's how the industry builds real image classifiers — almost nobody trains from scratch anymore.

❄

Feature Extraction vs Fine-Tuning

Two strategies for using pretrained models: freeze or unfreeze

▼

Feature Extraction (v1): Freeze the entire pretrained backbone. Only train the new classification head (FC layer). The backbone acts as a fixed feature extractor — like plugging in a camera that already knows how to see.

Fine-Tuning (v2): Start from the feature extraction checkpoint, then unfreeze all layers and train everything with a much smaller learning rate. This lets the backbone adapt its features specifically to your task.

Analogy

Feature extraction is like hiring an experienced photographer (pretrained backbone) and only teaching them what your company considers a "good photo" (new FC head). The photographer's skills are fixed.

Fine-tuning is like letting that photographer adjust their techniques specifically for your industry — they subtly change how they frame shots, handle lighting, etc. The foundation stays, but everything gets slightly customized.

Feature Extraction (v1): • Trainable: 51,300 / 11.2M params (0.5%) • LR: 0.001 (can be aggressive, only training the head) • Result: 93.4% Fine-Tuning (v2): • Trainable: 11.2M / 11.2M params (100%) • LR: 0.0001 (10x smaller to protect pretrained weights) • Result: 96.6% (+3.2% improvement)

Always do v1 first, then v2. Fine-tuning from random weights would destroy the pretrained features. Feature extraction gives the head a stable starting point, then fine-tuning refines the whole model gently.

Why in this project

We follow the standard 2-phase recipe: v1 establishes 93% with minimal training (only 51K params), then v2 pushes to 96.6% by letting the backbone adapt to sports-specific visual features. This progression demonstrates why both phases exist.

↪

ResNet & Skip Connections

Shortcut connections that let gradients flow directly through deep networks

▼

Deeper networks should be more powerful, but in practice, networks deeper than ~20 layers started performing worse due to the degradation problem — gradients vanish or explode through too many layers. ResNet solves this with skip (residual) connections: instead of learning output = F(x), each block learns output = F(x) + x. The + x shortcut lets gradients bypass layers entirely.

Analogy

Imagine a 20-floor building with only stairs (regular deep network). Getting from floor 1 to 20 is exhausting, and messages (gradients) between floors get garbled. ResNet adds an elevator (skip connection) alongside the stairs. Important signals can take the elevator directly while each floor still does its processing. The result: even 100-floor buildings work perfectly.

Regular block: output = F(x) ← if F learns nothing, output = noise
Residual block: output = F(x) + x ← if F learns nothing, output = x (identity)

The "worst case" for a residual block is passing input through unchanged.
This makes it safe to stack many layers — useless layers do no harm.

ResNet-18 has 18 layers organized into 4 groups of BasicBlocks, each containing 2 conv layers with a skip connection. It was the 2015 ImageNet winner and remains a standard baseline.

Why in this project

We use ResNet18 as our pretrained backbone. Its skip connections are why it can be 18 layers deep without suffering from vanishing gradients — the exact problem that limited our 3-layer CIFAR-10 CNN. The architecture is the backbone; transfer learning is the strategy.

⚠

Catastrophic Forgetting

When fine-tuning destroys the pretrained knowledge instead of adapting it

▼

If you fine-tune a pretrained model with a learning rate that's too large, the gradient updates overwhelm the carefully learned weights. The model "forgets" the useful features from ImageNet and effectively becomes a randomly initialized network. The pretrained knowledge is lost.

Analogy

Imagine an experienced surgeon retraining to specialize in a new procedure. With gentle, focused practice (small LR), they adapt their existing skills. But if you threw them into a boot camp that completely overrides everything they know (large LR), they'd lose their foundational skills and perform worse than a new student. That's catastrophic forgetting.

v1 (feature extraction) LR: 0.001 ← only FC head, can be aggressive
v2 (fine-tuning) LR: 0.0001 ← 10x smaller to protect backbone

Rule of thumb: fine-tuning LR should be 3-10x smaller than training LR

Why in this project

We used lr=0.0001 for fine-tuning (v2) vs lr=0.001 for feature extraction (v1). This 10x reduction ensures the backbone's ImageNet features are gently adapted, not destroyed. It's one of the most important practical details in transfer learning.

✂

Train / Validation / Test Split

Three separate datasets, each with a distinct purpose, to prevent data leakage

▼

Training set: Used to update model weights (the model learns from this).
Validation set: Used to make decisions during training — early stopping, LR scheduling, hyperparameter tuning. The model never trains on this data, but your decisions are influenced by it.
Test set: Touched ONLY once, at the very end, to get the final unbiased performance estimate. No decisions are made based on test results.

Analogy

Training set = homework problems (you learn from these).
Validation set = practice exams (you use these to decide if you're ready, adjust study strategy).
Test set = the final exam (seen exactly once, determines your actual grade).

If you peek at the final exam while studying, your grade is no longer a fair measure of what you know. That's data leakage.

Previous projects (MNIST, CIFAR-10): only train/test split
This project: train (13,493) / val (500) / test (500)

Validation guides: early stopping patience, LR scheduling
Test measures: true generalization (96.6% final answer)

Why in this project

This is the first project with a proper 3-way split. In previous projects, the test set influenced early stopping decisions (a mild form of data leakage). Here, validation and test are separate — the industry standard for any serious model evaluation.

μ

ImageNet Normalization

Using the exact mean/std from ImageNet so pretrained features work correctly

▼

Pretrained models expect inputs normalized with ImageNet's statistics: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]. Using different normalization means the pixel values land in unexpected ranges, making all the pretrained features meaningless.

Analogy

Imagine a translator trained on formal English. If you feed them text in internet slang, they'll produce nonsense — not because they're bad at translating, but because the input format doesn't match what they learned. ImageNet normalization ensures your images "speak the same language" as the training data.

transforms.Normalize(
mean=[0.485, 0.456, 0.406], # ImageNet RGB means
std=[0.229, 0.224, 0.225] # ImageNet RGB stds
)

Applied to ALL splits: train, val, and test (this is preprocessing, not augmentation)

Why in this project

Our ResNet18 was pretrained on ImageNet. Every pixel value must be normalized with ImageNet's statistics before feeding into the model. Forgetting this would silently produce garbage predictions without any obvious error — one of the most common transfer learning bugs.

Lessons Learned

0.5% of parameters can get you 93%

Feature extraction trains only the final FC layer (51K params out of 11.2M). The pretrained backbone already knows edges, textures, shapes — you just teach it what those features mean for your task.

Fine-tuning LR must be 10x smaller

A large learning rate (0.001) destroys the pretrained features — catastrophic forgetting. Using 0.0001 lets the model adapt gently while preserving what it already knows.

Only pass trainable parameters to the optimizer

In v1: optim.Adam(model.resnet.fc.parameters()), not model.parameters(). Passing frozen params wastes memory on optimizer states that will never be used.

ImageNet normalization is mandatory for pretrained models

The backbone was trained with specific mean/std values. Using different normalization means the features are meaningless — like speaking a different language to the model.

3-way splits prevent data leakage

Previous projects used only train/test. Here, validation guides early stopping and hyperparameter choices, while test remains completely untouched until final evaluation. This is the industry standard.

Remaining confusions are semantic, not visual

Errors like "sidecar racing vs motorcycle racing" or "cheerleading vs football" happen because the sports share visual context (same stadium, similar equipment). These need domain knowledge, not better features.

The Journey So Far

Project	Task	Accuracy	Key Insight
1. MNIST CNN	10 digits	99.35%	Learned the fundamentals
2. CIFAR-10 CNN	10 objects	81.04%	Hit the scratch ceiling
3. Transfer Learning	100 sports	96.60%	Pretrained features shatter the ceiling

The narrative: harder tasks + better techniques = better results. Each project motivates the next.