Project 2: CIFAR-10 CNN Summary

Version Progression

v1 — Baseline

Plain Conv + ReLU + Pool

77.31%

20 epochs • ~15% overfitting gap

Train ~92%, Test 77%

v2 — +BatchNorm

Conv → BN → ReLU → Pool

79.48%

17 epochs (early stop) • ~8% gap

+2.17% ↑ • Overfitting halved

v3 — +LR Scheduler

ReduceLROnPlateau (factor=0.5, p=3)

81.04%

32 epochs (early stop) • ~16% gap

+1.56% ↑ • Diminishing returns

v1: Baseline

77.31%

v2: +BatchNorm

79.48%

v3: +Scheduler

81.04%

Architecture (v2/v3 with BatchNorm)

Input

Color Image

(N, 3, 32, 32)

→

Conv1+BN+ReLU+Pool
32 filters, k=3
(N, 32, 16, 16)

→

Conv2+BN+ReLU+Pool
64 filters, k=3
(N, 64, 8, 8)

→

Conv3+BN+ReLU+Pool
128 filters, k=3
(N, 128, 4, 4)

→

FC + Dropout

2048 → 256

(N, 256)

→

Output

256 → 10

(N, 10)

Green = BatchNorm added in v2 (normalizes activations per-channel per-batch)

v3 Learning Rate Schedule

Epoch Range	Learning Rate	Effect
1 – 19	0.001000	Initial training phase
20 – 25	0.000500	1st reduction — +0.48% accuracy gain
26 – 30	0.000250	2nd reduction — +0.28% gain
31 – 32	0.000125	3rd reduction — no further gain, early stop

Each halving yields less improvement — a clear sign the architecture itself is the bottleneck, not the optimizer.

Training Results

v1 — Baseline

v1: Large overfitting gap (~15%) — train accuracy keeps climbing while test stagnates.

v1: Cat/dog confusion dominates. Many classes above 5% error rate.

v2 — BatchNorm

v2: Overfitting gap halved (~8%). BN stabilizes training and enables early stopping at epoch 17.

v2: Slightly improved confusion matrix, but same problem pairs persist.

v3 — LR Scheduler

v3: LR reductions create visible "steps" in the curve. Longer training but overfitting returns (~16%).

v3: Best confusion matrix, but cat/dog remain fundamentally hard at 32×32.

Hardest Class Pairs (Consistent Across All Versions)

cat ↔ dog

Similar body shapes & textures

bird ↔ deer

Overlapping backgrounds

truck ↔ automobile

Same object category

These confusions are inherent to 32×32 resolution — even humans struggle. Higher resolution + pretrained features solve this.

Concepts Study Guide

≡

Batch Normalization (BatchNorm)

Normalizes each layer's inputs to stabilize and accelerate training

▼

BatchNorm normalizes the outputs of a layer across the current mini-batch to have mean=0 and variance=1. Then it applies two learnable parameters (γ and β) that let the network undo the normalization if it's not helpful. This combats internal covariate shift — the problem where each layer's input distribution keeps changing as earlier layers update.

Analogy

Imagine a relay race where each runner expects the baton at a certain height. If runner 1 randomly changes their handoff height each lap, runner 2 wastes energy adapting. BatchNorm forces every handoff to happen at a standardized height, so each runner (layer) can focus on their own technique instead of constantly adjusting to unpredictable inputs.

For each channel c in the batch:
μ_c = mean of all activations in channel c across the batch
σ_c = std of all activations in channel c across the batch
x_norm = (x - μ_c) / √(σ_c² + ε)
output = γ * x_norm + β (γ, β are learnable)

Training vs Evaluation: During training, it uses batch statistics. During evaluation, it uses running averages computed over the entire training set. This is another reason model.eval() matters.

Why in this project

Adding BatchNorm to v2 (Conv→BN→ReLU→Pool) boosted accuracy from 77.31% to 79.48% while halving the overfitting gap (15%→8%). It's the single most impactful regularization technique we've used — more effective than Dropout for conv layers.

⤵

Learning Rate Scheduling (ReduceLROnPlateau)

Automatically shrinks the learning rate when progress stalls

▼

A fixed learning rate is a compromise: too large and training overshoots minima; too small and training is slow. ReduceLROnPlateau starts with a larger LR for fast initial progress, then automatically reduces it by a factor when the monitored metric stops improving for patience epochs.

Analogy

Imagine pouring water into a cup. At first, you pour quickly to fill it fast (large LR). As it nears the top, you slow the pour to avoid spilling (smaller LR). ReduceLROnPlateau is like a sensor that says "the cup hasn't gotten fuller in 3 seconds, slow down the pour by half."

scheduler = ReduceLROnPlateau(optimizer,
  mode='max', # monitoring accuracy (higher is better)
  factor=0.5, # multiply LR by 0.5 when triggered
  patience=3 # wait 3 epochs of no improvement
)

0.001 → 0.0005 → 0.00025 → 0.000125 (each halving)

Key insight: The scheduler doesn't just "optimize better" — it rescues models from premature early stopping. Without it, v2 stopped at epoch 17. With it, v3 trained to epoch 32 because each LR reduction opened a new valley to explore.

Why in this project

v3 used ReduceLROnPlateau to squeeze +1.56% more accuracy beyond BatchNorm. But diminishing returns per reduction (0.48%→0.28%→0%) proved the architecture itself was the bottleneck, not the optimizer settings.

↔

Overfitting vs Underfitting

The central tradeoff in machine learning: memorization vs generalization

▼

Overfitting: The model memorizes training data patterns (including noise) but fails on new data. Symptom: high train accuracy, much lower test accuracy.
Underfitting: The model is too simple to capture the real patterns. Symptom: both train and test accuracy are low.

Analogy

Overfitting is like a student who memorizes every practice exam word-for-word but can't answer rephrased questions. Underfitting is like a student who only skimmed the textbook — they can't answer any questions well. The sweet spot is understanding the concepts behind the examples.

v1: Train 92% / Test 77% → 15% gap = significant overfitting
v2: Train 87% / Test 79% → 8% gap = moderate (BN helped)
v3: Train 97% / Test 81% → 16% gap = overfitting returned

But v3 has the HIGHEST test accuracy — overfitting gap alone
doesn't tell you which model is best. The absolute test metric does.

Tools to fight overfitting: Dropout, BatchNorm, data augmentation, early stopping, weight decay, more training data, simpler model.

Why in this project

The v1→v2→v3 progression is a masterclass in the overfitting/accuracy tradeoff. v2's BatchNorm reduced the gap but v3's scheduler increased it — yet v3 was still the best model by test accuracy. The lesson: watch the absolute test metric, not just the gap.

⇄

Data Augmentation

Artificially expanding the training set by applying random transformations

▼

Data augmentation applies random transformations (flips, crops, rotations, color jitter) to training images on-the-fly. Each epoch, the model sees slightly different versions of the same images. This acts as a regularizer — the model can't memorize specific pixel patterns.

Analogy

Imagine learning to recognize your friend's face. You've only seen 100 photos. Data augmentation is like seeing those same photos in different lighting, from slightly different angles, sometimes flipped — suddenly your 100 photos feel like 1,000, and you recognize them in any condition.

transforms.Compose([
  transforms.RandomHorizontalFlip(), # 50% chance to mirror
  transforms.RandomCrop(32, padding=4), # shift up to 4px
  transforms.ToTensor(),
  transforms.Normalize(mean, std)
])

Note: augmentation is ONLY applied to training data, never test/val

Why in this project

CIFAR-10 has only 50K training images of 32×32 pixels — much harder than MNIST. RandomHorizontalFlip and RandomCrop help the model generalize without needing more data. These same augmentations carry forward to every future project.

▁

The "Scratch CNN Ceiling"

Why shallow CNNs hit a performance wall on complex images

▼

A shallow network (2-3 conv layers) can only learn low-to-mid-level features: edges, corners, simple textures. Distinguishing a cat from a dog requires high-level semantic features (ear shape, fur patterns, body posture) that need many more layers to build up.

Analogy

Imagine describing the Mona Lisa. With 3 words, you might say "woman, dark, painting." With 30 words, you can describe her pose, expression, background. With 300 words, you capture subtleties. A 3-layer CNN is limited to "3-word descriptions" of images — enough for digits (simple shapes), not enough for cats vs dogs (complex objects).

MNIST (1 channel, 10 simple classes): 99.35% with 2 conv layers ✔
CIFAR-10 (3 channels, 10 complex classes): 81% with 3 conv layers ✘
CIFAR-10 with ResNet-18 (18 layers): 93-95% ✔

Depth matters more than width for complex visual recognition.

Why in this project

This entire project exists to experience the ceiling. We tried every trick (BatchNorm, scheduling) and still couldn't break 81%. This makes the motivation for transfer learning visceral, not theoretical. You don't just read "deeper networks work better" — you prove it.

Lessons Learned

BatchNorm does two things at once

+2.17% accuracy AND halved overfitting gap (15% → 8%). It stabilizes training so the model generalizes better, not just trains faster.

LR scheduling rescues premature early stopping

Without the scheduler, v2 early-stopped at epoch 17. The scheduler allowed training to continue 15 more epochs by finding new loss valleys with smaller learning rates.

Diminishing returns reveal architecture bottlenecks

v1→v2: +2.17%. v2→v3: +1.56%. The decreasing marginal gain is a clear signal: no amount of training tricks will push a 3-layer CNN past ~81% on CIFAR-10. You need a fundamentally deeper/better architecture.

The "scratch CNN ceiling" is a real phenomenon

This entire project exists to experience hitting that ceiling firsthand. The 77%→81% progression makes the motivation for transfer learning visceral, not abstract.

Overfitting vs. accuracy is a tradeoff

v3 achieved the highest test accuracy (81%) but also the worst overfitting gap (16%). Sometimes more accuracy means more overfitting — the question is whether the absolute test metric improved.

What's Next

The ceiling is clear: ~81% on CIFAR-10 with a scratch CNN. The next project uses Transfer Learning (ResNet18) on a harder task (100 sports classes) to show how pretrained features shatter this ceiling: 91% with feature extraction alone, 96-98% with fine-tuning.