Project 1 of 4

MNIST CNN

From-scratch CNN for handwritten digit classification. The foundation project: understanding convolutions, pooling, dropout, and the PyTorch training loop.

99.35%
Test Accuracy
2
Conv Layers
13
Epochs (Early Stop)
~107K
Parameters

Model Architecture

SimpleCNN

Input
Grayscale Image
(N, 1, 28, 28)
Conv1 + ReLU + Pool
32 filters, k=3, pad=1
(N, 32, 14, 14)
Conv2 + ReLU + Pool
64 filters, k=3, pad=1
(N, 64, 7, 7)
Flatten + FC
3136 → 128 + Dropout(0.5)
(N, 128)
Output
128 → 10 logits
(N, 10)

Training Configuration

ParameterValueWhy
OptimizerAdam (lr=0.001)Faster convergence than SGD for this simple task
LossCrossEntropyLossStandard for multi-class classification
Batch Size64Good balance of speed and gradient stability
Dropout0.5 on FC layerPrevents overfitting; without it, train acc diverges from test
Early Stoppingpatience=3Stopped at epoch 13; best checkpoint at epoch 10
Data Split60K train / 10K testStandard MNIST split; large test set is reliable

Training Results

Loss and Accuracy Curves
Loss and accuracy curves over 13 epochs. Test accuracy plateaus around epoch 10 while train continues rising — the classic overfitting signal that triggers early stopping.
Confusion Matrix
10×10 confusion matrix. Nearly diagonal — most errors cluster around visually similar digit pairs.
Wrong Predictions
Gallery of 16 misclassified digits. Many are genuinely ambiguous even to humans.

Hardest Digit Pairs

4 ↔ 9
Similar loop shapes
3 → 5
Similar curve structure
7 → 2
Angled strokes overlap

Concepts Study Guide

Convolution (Conv2d)

A sliding filter that detects local patterns in the input

A convolution slides a small filter (kernel) across the input image, computing a dot product at each position. Each filter learns to detect a specific pattern — edges, corners, textures, etc.

Analogy

Imagine reading a page with a magnifying glass. You scan a small area at a time (the kernel), looking for specific patterns. One magnifying glass might highlight vertical lines, another might highlight curves. Together, dozens of these "magnifying glasses" (filters) capture everything on the page.

Key parameters:

in_channels / out_channels — how many input feature maps go in, how many filters (output maps) come out.
kernel_size — the size of the sliding window (3×3 is most common).
padding — adding zeros around the border so the output keeps the same spatial size.

Output size = (Input + 2×padding - kernel_size) / stride + 1
Example: (28 + 2×1 - 3) / 1 + 1 = 28  →  same size with padding=1
Why in this project

Conv layers are the backbone of any CNN. Our 2 conv layers extract increasingly abstract features: layer 1 finds edges, layer 2 combines edges into digit shapes.

ReLU (Rectified Linear Unit)

The simplest nonlinearity: keep positives, zero out negatives

ReLU applies f(x) = max(0, x) element-wise. Without a nonlinear activation, stacking multiple linear layers would collapse into a single linear transformation — the network couldn't learn complex patterns.

Analogy

Think of a neuron that only "fires" when excited. If the signal is positive (interesting pattern detected), pass it through. If negative (not relevant), shut it off completely. This simple on/off behavior, when combined across thousands of neurons, creates powerful pattern recognition.

ReLU(x) = max(0, x)
ReLU(-3) = 0  |  ReLU(0) = 0  |  ReLU(5) = 5
Why in this project

ReLU is applied after every Conv layer and the first FC layer. It's the industry default because it's fast, simple, and avoids the vanishing gradient problem that plagues sigmoid/tanh in deep networks.

MaxPooling

Downsamples by keeping only the strongest activation in each region

MaxPool2d(2) slides a 2×2 window across the feature map and keeps only the maximum value. This halves the spatial dimensions (H and W), reducing computation and making the model invariant to small translations.

Analogy

Imagine you divided a photo into 2×2 pixel blocks. From each block, you only keep the brightest pixel. You lose fine detail but keep the important structure — and the image is now 1/4 the size. This is exactly what MaxPooling does to feature maps.

Input: (N, 32, 28, 28) → MaxPool2d(2) → (N, 32, 14, 14)
Spatial dims halved: 28/2 = 14. Channels unchanged.
Why in this project

After each Conv+ReLU, we pool to shrink the feature map. This creates a hierarchy: early layers see fine details (14×14), later layers see abstract patterns (7×7). It also adds translation invariance — a digit shifted by 1 pixel still activates the same pooled region.

Dropout

Randomly silences neurons during training to prevent overfitting

During training, Dropout(p=0.5) randomly sets 50% of neuron outputs to zero each forward pass. This forces the network to not rely on any single neuron, building redundancy. During evaluation, all neurons are active (scaled to compensate).

Analogy

Imagine a team where half the members randomly call in sick each day. The team learns to be resilient — no single person becomes a bottleneck, and everyone develops broader skills. On the day of the final presentation (evaluation), everyone shows up and the team performs at its best.

Training: randomly zero 50% of activations each batch
Evaluation: use all activations, multiply by (1-p) to compensate
model.train() enables dropout  |  model.eval() disables it
Why in this project

Applied on the FC layer (3136→128) where overfitting is most likely. Without Dropout, train accuracy hit ~99.5% but test stagnated — the model was memorizing, not generalizing.

CrossEntropyLoss

Measures how far the model's predicted probabilities are from the true label

CrossEntropyLoss combines LogSoftmax + NLLLoss in one step. It takes raw logits (unnormalized scores) and the true class label, then computes how "surprised" the model is by the correct answer. Lower loss = the model assigned high probability to the correct class.

Analogy

Imagine a student taking a multiple-choice test. If the student is 90% confident in the right answer, they barely get penalized. If they're only 10% confident, they get heavily penalized. CrossEntropy measures this "penalty for being wrong" — and the model trains to minimize it.

Loss = -log(p_correct)
If model says P(correct class) = 0.9 → Loss = -log(0.9) = 0.105 (small)
If model says P(correct class) = 0.1 → Loss = -log(0.1) = 2.302 (large)
Why in this project

The standard loss for multi-class classification. Our model outputs 10 raw logits — CrossEntropyLoss converts them to probabilities internally and computes the loss. That's why we don't add softmax in the model's forward() method.

Adam Optimizer

Adaptive learning rates per parameter — the go-to optimizer

Adam (Adaptive Moment Estimation) combines two ideas: momentum (smoothing gradient direction over time) and RMSprop (scaling learning rate per parameter based on gradient magnitude). Parameters with large gradients get smaller updates; parameters with small gradients get larger updates.

Analogy

Imagine hiking down a mountain in fog. Plain SGD takes equal-sized steps in whichever direction looks steepest right now. Adam is smarter: it remembers which direction it's been trending (momentum) and adjusts step size based on how rough the terrain is (adaptive rate). Smooth slope? Take bigger steps. Rocky terrain? Take careful small steps.

Adam = Momentum + RMSprop
m_t = β1 * m_(t-1) + (1-β1) * gradient   (direction)
v_t = β2 * v_(t-1) + (1-β2) * gradient²   (scale)
update = lr * m_t / (√v_t + ε)
Why in this project

Adam with lr=0.001 is the standard starting point for most deep learning tasks. It converges faster than plain SGD on MNIST because it adapts to each parameter's gradient landscape automatically.

Early Stopping

Stop training when the model starts getting worse on unseen data

Monitor a validation metric (test accuracy or validation loss) each epoch. If it hasn't improved for patience consecutive epochs, stop training and revert to the best checkpoint. This prevents the model from overfitting by training too long.

Analogy

Imagine studying for an exam. At first, practice test scores improve. But after a while, you start memorizing specific practice questions rather than understanding concepts — your real exam score would actually drop. Early stopping is like a study buddy who says "you peaked 3 days ago, stop cramming and use that version of yourself."

Epoch 10: test_acc = 99.34% ← best so far, save checkpoint
Epoch 11: test_acc = 99.30% ← patience count: 1
Epoch 12: test_acc = 99.32% ← patience count: 2
Epoch 13: test_acc = 99.31% ← patience count: 3 = patience → STOP
Load checkpoint from epoch 10 (99.34%)
Why in this project

With patience=3, training stopped at epoch 13 instead of running all 20. The best model (epoch 10, 99.34%) was better than the final model — proving early stopping rescued us from overfitting.

Backpropagation

Computing gradients by applying the chain rule backwards through the network

After the forward pass computes the loss, backpropagation works backwards through every layer using the chain rule of calculus to compute how much each weight contributed to the error. These gradients then tell the optimizer which direction to adjust each weight.

Analogy

Imagine a factory assembly line that produces a defective product. To fix it, you trace backwards through each station asking "how much did YOUR step contribute to this defect?" That's backpropagation — assigning blame to each weight proportional to its contribution to the final error.

Forward: input → conv1 → relu → pool → conv2 → ... → loss
Backward: loss → ∂loss/∂fc → ∂loss/∂conv2 → ... → ∂loss/∂conv1

In PyTorch: loss.backward() computes all gradients automatically
Why in this project

This is the same backprop you implemented by hand in DLFS — but now PyTorch's autograd handles it. The loss.backward() + optimizer.step() + optimizer.zero_grad() trio is the core training loop pattern used in every PyTorch project.

Lessons Learned

Conv → ReLU → MaxPool is the fundamental building block
Understanding how each operation transforms the tensor shape (spatial dimensions halved by pool, channels set by filter count) is essential for designing any CNN.
model.train() vs model.eval() matters
Dropout behaves differently during training (randomly zeros neurons) vs evaluation (uses all neurons). Forgetting eval() leads to artificially lower test accuracy.
optimizer.zero_grad() is needed every batch
PyTorch accumulates gradients by default. Without zeroing, gradients from previous batches contaminate the current update — a common beginner bug.
Early Stopping + Checkpointing preserves the best model
The model at epoch 10 (99.34%) was better than epoch 13 (99.31%). Without checkpointing, you'd keep the worse final model.
Confusion Matrix reveals systematic weaknesses
Raw accuracy (99.35%) hides that certain digit pairs (4↔9) are consistently confused. This diagnostic skill carries forward to every classification project.
padding=1 with kernel_size=3 preserves spatial dimensions
Without padding, each conv layer shrinks the feature map. Padding=1 keeps the same H×W, making architecture design much simpler.