Project 5 of 5

LSTM Sentiment

IMDB review classification with a packed LSTM baseline in v1, trainable GloVe embeddings in v2, and frozen GloVe embeddings in v3. This page closes the sequence-modeling story by comparing scratch, pretrained-trainable, and pretrained-frozen word representations.

84.48%
Best Test Accuracy
86.68%
Best Val Accuracy
2.87M
Parameters
63.25%
GloVe Coverage
6
Epochs (Early Stop)
RNN → LSTM: from short-term memory to gated memory

The vanilla RNN in the Shakespeare project learned formatting and short-range character patterns, but it could not preserve meaning over longer spans. IMDB sentiment classification makes that limitation impossible to ignore: the model must read a full review, keep context alive, and reduce the whole sequence to one final decision.

Why LSTM Was the Next Step

What broke in the vanilla RNN

The Shakespeare model could generate local patterns well, but long-range memory was fragile. That was acceptable for short spelling fragments, but not for review sentiment where phrases like not good, slow at first but worth it, or decent idea, bad execution depend on sequence-level context.

What LSTM adds

LSTM introduces a cell state and gating mechanism. Instead of forcing one hidden vector to do everything, the model can keep useful information alive longer and update it more selectively. That makes many-to-one review classification much more realistic.

Model Architecture

SentimentLSTM

Input
Word indices
(N, seq_len_in_batch)
Embedding
100-dim learned vectors
(N, seq_len, 100)
Packed LSTM
hidden=256, 1 layer
h_n: (1, N, 256)
Linear
256 → 1 logit
(N, 1)

Training uses BCEWithLogitsLoss, and predictions use sigmoid(logits) > 0.5.

Padding Problem and the Fix

This ended up being the most important implementation lesson of the project. The first naive version took the final hidden state after feeding a fully padded review through the LSTM. That meant short reviews were summarized after many meaningless padding steps.

Naive

Use the final hidden state after a globally padded sequence of length 256. Result: near-random performance around 51% test accuracy.

Correctness fix

Switch to pack_padded_sequence so the LSTM ignores padded timesteps. Result: the model now summarizes the review after the final real word.

Efficiency fix

Add dynamic batch trimming in the collate_fn. Each batch is cut to its own longest real review instead of always computing all 256 steps.

The big takeaway: in sequence models, padding is not just formatting. If handled carelessly, it can directly damage the representation the classifier depends on.

Training Results

Version Comparison

VersionMain ChangeBest Val AccTest AccInterpretation
v1Learn embeddings from scratch83.08%81.91%Packed sequences made the baseline viable
v2_gloveInitialize embeddings from pretrained GloVe86.68%84.48%Better word representations improved early learning and final accuracy
v3_glove_frozenKeep the same GloVe vectors frozen84.96%83.62%Pretrained features help, but fine-tuning still works better

This comparison is intentionally fair: same split, same sequence length, same hidden size, same optimizer, same learning rate, and the same EPOCHS=15 / PATIENCE=3. The only experiment axis in v2/v3 is what happens to the GloVe embedding layer after initialization.

The final takeaway is clean: frozen GloVe still beats the scratch baseline, but trainable GloVe gives the strongest result overall.

Loss and Accuracy Curves
With GloVe initialization, the model starts stronger much earlier and reaches a better final result. Overfitting still appears after the best checkpoint, so pretrained embeddings help representation quality without removing the need for early stopping.
Confusion Matrix
Compared with v1, v2 sharply reduces false positives. The classifier becomes more balanced overall, suggesting that pretrained word vectors help it resist shallow local praise cues inside otherwise negative reviews.

Error Analysis

Confusion Pattern

CellCount
True Negative10,685
False Positive1,815
False Negative2,066
True Positive10,434

The key shift from v1 to v2 is that false positives fall from 2,512 to 1,815. In v3, they fall a bit further to 1,707, but false negatives rise to 2,387. That means freezing GloVe makes the model more conservative, which is useful to see but not the best overall tradeoff.

Qualitative Pattern

False positives still exist, but in v2 they are more concentrated in genre-heavy action or franchise reviews where local praise language is strong even though the final judgment is negative.

False negatives stand out more clearly in formal, socially serious, or thematically heavy reviews whose wording feels negative even when the final verdict is positive. v3 reinforces this pattern even more, which fits the idea that frozen embeddings adapt less well to subtle task-specific sentiment cues.

False Positives

Negative reviews predicted as positive. These usually contain local praise or genre enthusiasm that seems to outweigh the final negative judgment.

Example 1 Negative -> Positive Confidence 0.98

Excerpt: "if you haven't enjoyed a van damme movie ... you probably will not like this movie ... i enjoy these kinds ..."

Example 2 Negative -> Positive Confidence 0.96

Excerpt: "has made some of the best western martial arts action movies ever produced ... action classics ... real passion for ..."

Example 3 Negative -> Positive Confidence 0.87

Excerpt: "was a decent film, but i have a few issues with this film ... i have a problem with ..."

False Negatives

Positive reviews predicted as negative. These often use formal, socially serious, or thematically heavy wording even though the final evaluation is favorable.

Example 1 Positive -> Negative Confidence 0.94

Excerpt: "overall, a well done movie ... i came away with something more than i gone in with ..."

Example 2 Positive -> Negative Confidence 0.84

Excerpt: "it is an amazing film because it dares to investigate the hypocrisy ... concerning their women and sexuality ..."

Example 3 Positive -> Negative Confidence 0.94

Excerpt: "the theme is controversial ... lack of continuity and lack of ..."

Training Configuration

ParameterValueWhy
Vocabulary Size25,000 + 2 special tokensLarge enough to cover common review language while staying manageable
Max Length256 tokensReasonable ceiling for an educational baseline
Embedding Dim100Good compromise, and it matches the pretrained GloVe 100d vectors used in v2
Hidden Size256Enough capacity for sequence classification without making v1 too heavy
Batch Size64Stable and consistent with previous projects
OptimizerAdam (lr=0.001)Reliable baseline optimizer for NLP classification
Early Stoppingpatience=3Good balance between fair comparison and avoiding wasted epochs

Concepts Study Guide

E

Embeddings

Dense word vectors replace huge sparse one-hot inputs

Character one-hot encoding was acceptable when the vocabulary was only 65 symbols. Word-level NLP changes the scale completely. A 25k-word vocabulary makes one-hot vectors huge and inefficient, so nn.Embedding is the natural next step.

Analogy

One-hot is like giving each word a locker number. Embeddings are like giving each word a meaningful coordinate in a semantic map, where similar words can move closer together.

Why in this project

The review classifier needs word-level information, not just token identity. Learned embeddings let the model build a useful representation of review language directly from the IMDB dataset.

L

Hidden State vs Cell State

LSTM separates short-term output state from longer-lived memory

A vanilla RNN has just one running state. An LSTM carries both a hidden state and a cell state. The cell state is what gives the architecture a more stable memory path across longer sequences.

LSTM output: out, (h_n, c_n)
h_n = final hidden state
c_n = final cell state
Why in this project

For sequence classification, the final hidden state becomes the review summary vector. The cell state is not directly used as output, but it helps the model maintain context more effectively while reading the review.

B

BCEWithLogitsLoss

Binary classification uses one output logit, not a softmax over many classes

The model outputs one logit per review. That logit is converted to a probability with sigmoid during evaluation, and BCEWithLogitsLoss handles the stable binary-loss computation during training.

logits: (N, 1)
loss = BCEWithLogitsLoss(logits, labels)
preds = sigmoid(logits) > 0.5
Why in this project

This is the first project in the series that uses binary classification rather than multi-class argmax prediction. That shift is part of the conceptual jump from previous CNN projects.

P

Packed Sequences

Tell the LSTM which timesteps are real and which ones are just padding

pack_padded_sequence removes the padded tail from the LSTM's computation path. That matters because the final hidden state should summarize the last real word, not a long tail of meaningless padding tokens.

Analogy

Without packing, the model reads the review and then flips through dozens of blank pages before writing its final summary. Packing makes it stop reading at the true end of the review.

Why in this project

This single change rescued the baseline from near-random behavior to a usable result above 81% test accuracy. It was not a small optimization; it was a correctness fix.

D

Dynamic Padding

Trim each batch to its own maximum real length before the model sees it

The dataset is still stored at max length 256 for simplicity, but each batch is trimmed to its longest real review inside the custom collate_fn. This keeps the code simple while reducing wasted embedding/LSTM work.

Why in this project

It complements packed sequences nicely: dynamic padding removes batch-level waste, and packing removes sample-level padding inside that batch.

G

GloVe Initialization

Pretrained word vectors give the embedding layer a stronger semantic starting point

v2 keeps the same LSTM classifier but changes how the embedding layer starts. Instead of learning every word representation from scratch, it loads GloVe vectors for the words it can match and leaves the rest random.

Why in this project

GloVe matched 15,814 of the 25,002 vocabulary entries (63.25%). That was enough to improve early learning and raise test accuracy from 81.91% to 84.48% without changing the rest of the architecture.

Lessons Learned

Padding can quietly destroy a sequence summary
The first naive implementation looked perfectly reasonable on paper, but the final hidden state was polluted by long padding tails. Packed sequences fixed the real bug, not just the speed.
The output/loss/metric trio must match the task
This project moved from multi-class argmax and perplexity to one-logit binary classification with sigmoid thresholding, accuracy, and confusion-matrix analysis.
Error analysis revealed a real bias pattern
v1 leaned too positive. v2 reduced that false-positive bias substantially, which suggests that better word representations help the model resist shallow local praise cues.
Pretrained embeddings help, but they do not solve everything
GloVe gave a meaningful gain over the baseline, especially early in training, but the model still struggles with formal tone, mixed sentiment, and socially heavy positive reviews.
Frozen pretrained features are useful, but trainable ones are better here
v3 still beat the scratch baseline, which shows that pretrained semantics help by themselves. But the best result came from letting GloVe adapt to the IMDB task, not from keeping it fixed.
Simple preprocessing still teaches core NLP ideas
Whitespace tokenization and a top-25k vocabulary are not fancy, but they are enough to make vocab building, unknown tokens, padding, variable-length handling, and sequence summarization concrete.
v2 improves the baseline, but sequence understanding is still the real bottleneck
The main gain from v2 came from better word representations. The next major jump will likely require a stronger sequence encoder or a cleaner representation experiment rather than just longer training.

The Full Journey

ProjectDomainKey MetricArchitecture Highlight
1. MNIST CNNImage Classification99.35% accConv + Pool + FC basics
2. CIFAR-10 CNNImage Classification81.04% accBatchNorm, LR Scheduler, scratch ceiling
3. Transfer LearningImage Classification96.60% accResNet18, freeze/unfreeze, fine-tuning
4. RNN ShakespeareText Generation6.33 perplexityHidden state, BPTT, temperature
5. LSTM SentimentText Classification84.48% accWord embeddings, packed LSTM, GloVe initialization

From image classification to sequence generation, then from sequence generation to sequence understanding. Each project responds directly to a limitation exposed by the previous one.

What's Next

Project 5 now has a clean final story: padding correctness mattered, pretrained embeddings helped, and trainable GloVe beat frozen GloVe. That makes this a strong stopping point for the LSTM project, and the next natural move is Transformer from scratch.