From-scratch CNN for handwritten digit classification. The foundation project: understanding convolutions, pooling, dropout, and the PyTorch training loop.
| Parameter | Value | Why |
|---|---|---|
| Optimizer | Adam (lr=0.001) | Faster convergence than SGD for this simple task |
| Loss | CrossEntropyLoss | Standard for multi-class classification |
| Batch Size | 64 | Good balance of speed and gradient stability |
| Dropout | 0.5 on FC layer | Prevents overfitting; without it, train acc diverges from test |
| Early Stopping | patience=3 | Stopped at epoch 13; best checkpoint at epoch 10 |
| Data Split | 60K train / 10K test | Standard MNIST split; large test set is reliable |
A convolution slides a small filter (kernel) across the input image, computing a dot product at each position. Each filter learns to detect a specific pattern — edges, corners, textures, etc.
Imagine reading a page with a magnifying glass. You scan a small area at a time (the kernel), looking for specific patterns. One magnifying glass might highlight vertical lines, another might highlight curves. Together, dozens of these "magnifying glasses" (filters) capture everything on the page.
Key parameters:
• in_channels / out_channels — how many input feature maps go in, how many filters (output maps) come out.
• kernel_size — the size of the sliding window (3×3 is most common).
• padding — adding zeros around the border so the output keeps the same spatial size.
Conv layers are the backbone of any CNN. Our 2 conv layers extract increasingly abstract features: layer 1 finds edges, layer 2 combines edges into digit shapes.
ReLU applies f(x) = max(0, x) element-wise. Without a nonlinear activation, stacking multiple linear layers would collapse into a single linear transformation — the network couldn't learn complex patterns.
Think of a neuron that only "fires" when excited. If the signal is positive (interesting pattern detected), pass it through. If negative (not relevant), shut it off completely. This simple on/off behavior, when combined across thousands of neurons, creates powerful pattern recognition.
ReLU is applied after every Conv layer and the first FC layer. It's the industry default because it's fast, simple, and avoids the vanishing gradient problem that plagues sigmoid/tanh in deep networks.
MaxPool2d(2) slides a 2×2 window across the feature map and keeps only the maximum value. This halves the spatial dimensions (H and W), reducing computation and making the model invariant to small translations.
Imagine you divided a photo into 2×2 pixel blocks. From each block, you only keep the brightest pixel. You lose fine detail but keep the important structure — and the image is now 1/4 the size. This is exactly what MaxPooling does to feature maps.
After each Conv+ReLU, we pool to shrink the feature map. This creates a hierarchy: early layers see fine details (14×14), later layers see abstract patterns (7×7). It also adds translation invariance — a digit shifted by 1 pixel still activates the same pooled region.
During training, Dropout(p=0.5) randomly sets 50% of neuron outputs to zero each forward pass. This forces the network to not rely on any single neuron, building redundancy. During evaluation, all neurons are active (scaled to compensate).
Imagine a team where half the members randomly call in sick each day. The team learns to be resilient — no single person becomes a bottleneck, and everyone develops broader skills. On the day of the final presentation (evaluation), everyone shows up and the team performs at its best.
Applied on the FC layer (3136→128) where overfitting is most likely. Without Dropout, train accuracy hit ~99.5% but test stagnated — the model was memorizing, not generalizing.
CrossEntropyLoss combines LogSoftmax + NLLLoss in one step. It takes raw logits (unnormalized scores) and the true class label, then computes how "surprised" the model is by the correct answer. Lower loss = the model assigned high probability to the correct class.
Imagine a student taking a multiple-choice test. If the student is 90% confident in the right answer, they barely get penalized. If they're only 10% confident, they get heavily penalized. CrossEntropy measures this "penalty for being wrong" — and the model trains to minimize it.
The standard loss for multi-class classification. Our model outputs 10 raw logits — CrossEntropyLoss converts them to probabilities internally and computes the loss. That's why we don't add softmax in the model's forward() method.
Adam (Adaptive Moment Estimation) combines two ideas: momentum (smoothing gradient direction over time) and RMSprop (scaling learning rate per parameter based on gradient magnitude). Parameters with large gradients get smaller updates; parameters with small gradients get larger updates.
Imagine hiking down a mountain in fog. Plain SGD takes equal-sized steps in whichever direction looks steepest right now. Adam is smarter: it remembers which direction it's been trending (momentum) and adjusts step size based on how rough the terrain is (adaptive rate). Smooth slope? Take bigger steps. Rocky terrain? Take careful small steps.
Adam with lr=0.001 is the standard starting point for most deep learning tasks. It converges faster than plain SGD on MNIST because it adapts to each parameter's gradient landscape automatically.
Monitor a validation metric (test accuracy or validation loss) each epoch. If it hasn't improved for patience consecutive epochs, stop training and revert to the best checkpoint. This prevents the model from overfitting by training too long.
Imagine studying for an exam. At first, practice test scores improve. But after a while, you start memorizing specific practice questions rather than understanding concepts — your real exam score would actually drop. Early stopping is like a study buddy who says "you peaked 3 days ago, stop cramming and use that version of yourself."
With patience=3, training stopped at epoch 13 instead of running all 20. The best model (epoch 10, 99.34%) was better than the final model — proving early stopping rescued us from overfitting.
After the forward pass computes the loss, backpropagation works backwards through every layer using the chain rule of calculus to compute how much each weight contributed to the error. These gradients then tell the optimizer which direction to adjust each weight.
Imagine a factory assembly line that produces a defective product. To fix it, you trace backwards through each station asking "how much did YOUR step contribute to this defect?" That's backpropagation — assigning blame to each weight proportional to its contribution to the final error.
This is the same backprop you implemented by hand in DLFS — but now PyTorch's autograd handles it. The loss.backward() + optimizer.step() + optimizer.zero_grad() trio is the core training loop pattern used in every PyTorch project.
MNIST is "solved" — 99.35% with a simple 2-layer CNN. The real question: can this architecture handle harder images?
CIFAR-10 (32×32 color images, 10 classes) will expose the limitations of shallow networks. Spoiler: the same approach hits a ceiling around 77%, motivating BatchNorm, LR scheduling, and eventually transfer learning.