Project 5: LSTM Sentiment Summary

Why LSTM Was the Next Step

What broke in the vanilla RNN

The Shakespeare model could generate local patterns well, but long-range memory was fragile. That was acceptable for short spelling fragments, but not for review sentiment where phrases like not good, slow at first but worth it, or decent idea, bad execution depend on sequence-level context.

What LSTM adds

LSTM introduces a cell state and gating mechanism. Instead of forcing one hidden vector to do everything, the model can keep useful information alive longer and update it more selectively. That makes many-to-one review classification much more realistic.

Model Architecture

SentimentLSTM

Input

Word indices

(N, seq_len_in_batch)

→

Embedding

100-dim learned vectors

(N, seq_len, 100)

→

Packed LSTM

hidden=256, 1 layer

h_n: (1, N, 256)

→

Linear

256 → 1 logit

(N, 1)

Training uses BCEWithLogitsLoss, and predictions use sigmoid(logits) > 0.5.

Padding Problem and the Fix

This ended up being the most important implementation lesson of the project. The first naive version took the final hidden state after feeding a fully padded review through the LSTM. That meant short reviews were summarized after many meaningless padding steps.

Naive

Use the final hidden state after a globally padded sequence of length 256. Result: near-random performance around 51% test accuracy.

Correctness fix

Switch to pack_padded_sequence so the LSTM ignores padded timesteps. Result: the model now summarizes the review after the final real word.

Efficiency fix

Add dynamic batch trimming in the collate_fn. Each batch is cut to its own longest real review instead of always computing all 256 steps.

The big takeaway: in sequence models, padding is not just formatting. If handled carelessly, it can directly damage the representation the classifier depends on.

Training Results

Version Comparison

Version	Main Change	Best Val Acc	Test Acc	Interpretation
v1	Learn embeddings from scratch	83.08%	81.91%	Packed sequences made the baseline viable
v2_glove	Initialize embeddings from pretrained GloVe	86.68%	84.48%	Better word representations improved early learning and final accuracy
v3_glove_frozen	Keep the same GloVe vectors frozen	84.96%	83.62%	Pretrained features help, but fine-tuning still works better

This comparison is intentionally fair: same split, same sequence length, same hidden size, same optimizer, same learning rate, and the same EPOCHS=15 / PATIENCE=3. The only experiment axis in v2/v3 is what happens to the GloVe embedding layer after initialization.

The final takeaway is clean: frozen GloVe still beats the scratch baseline, but trainable GloVe gives the strongest result overall.

With GloVe initialization, the model starts stronger much earlier and reaches a better final result. Overfitting still appears after the best checkpoint, so pretrained embeddings help representation quality without removing the need for early stopping.

Compared with v1, v2 sharply reduces false positives. The classifier becomes more balanced overall, suggesting that pretrained word vectors help it resist shallow local praise cues inside otherwise negative reviews.

Error Analysis

Confusion Pattern

Cell	Count
True Negative	10,685
False Positive	1,815
False Negative	2,066
True Positive	10,434

The key shift from v1 to v2 is that false positives fall from 2,512 to 1,815. In v3, they fall a bit further to 1,707, but false negatives rise to 2,387. That means freezing GloVe makes the model more conservative, which is useful to see but not the best overall tradeoff.

Qualitative Pattern

False positives still exist, but in v2 they are more concentrated in genre-heavy action or franchise reviews where local praise language is strong even though the final judgment is negative.

False negatives stand out more clearly in formal, socially serious, or thematically heavy reviews whose wording feels negative even when the final verdict is positive. v3 reinforces this pattern even more, which fits the idea that frozen embeddings adapt less well to subtle task-specific sentiment cues.

False Positives

Negative reviews predicted as positive. These usually contain local praise or genre enthusiasm that seems to outweigh the final negative judgment.

Example 1 Negative -> Positive Confidence 0.98

Excerpt: "if you haven't enjoyed a van damme movie ... you probably will not like this movie ... i enjoy these kinds ..."

Example 2 Negative -> Positive Confidence 0.96

Excerpt: "has made some of the best western martial arts action movies ever produced ... action classics ... real passion for ..."

Example 3 Negative -> Positive Confidence 0.87

Excerpt: "was a decent film, but i have a few issues with this film ... i have a problem with ..."

False Negatives

Positive reviews predicted as negative. These often use formal, socially serious, or thematically heavy wording even though the final evaluation is favorable.

Example 1 Positive -> Negative Confidence 0.94

Excerpt: "overall, a well done movie ... i came away with something more than i gone in with ..."

Example 2 Positive -> Negative Confidence 0.84

Excerpt: "it is an amazing film because it dares to investigate the hypocrisy ... concerning their women and sexuality ..."

Example 3 Positive -> Negative Confidence 0.94

Excerpt: "the theme is controversial ... lack of continuity and lack of ..."

Training Configuration

Parameter	Value	Why
Vocabulary Size	25,000 + 2 special tokens	Large enough to cover common review language while staying manageable
Max Length	256 tokens	Reasonable ceiling for an educational baseline
Embedding Dim	100	Good compromise, and it matches the pretrained GloVe 100d vectors used in v2
Hidden Size	256	Enough capacity for sequence classification without making v1 too heavy
Batch Size	64	Stable and consistent with previous projects
Optimizer	Adam (lr=0.001)	Reliable baseline optimizer for NLP classification
Early Stopping	patience=3	Good balance between fair comparison and avoiding wasted epochs

Concepts Study Guide

E

Embeddings

Dense word vectors replace huge sparse one-hot inputs

▼

Character one-hot encoding was acceptable when the vocabulary was only 65 symbols. Word-level NLP changes the scale completely. A 25k-word vocabulary makes one-hot vectors huge and inefficient, so nn.Embedding is the natural next step.

Analogy

One-hot is like giving each word a locker number. Embeddings are like giving each word a meaningful coordinate in a semantic map, where similar words can move closer together.

Why in this project

The review classifier needs word-level information, not just token identity. Learned embeddings let the model build a useful representation of review language directly from the IMDB dataset.

L

Hidden State vs Cell State

LSTM separates short-term output state from longer-lived memory

▼

A vanilla RNN has just one running state. An LSTM carries both a hidden state and a cell state. The cell state is what gives the architecture a more stable memory path across longer sequences.

LSTM output: out, (h_n, c_n)
h_n = final hidden state
c_n = final cell state

Why in this project

For sequence classification, the final hidden state becomes the review summary vector. The cell state is not directly used as output, but it helps the model maintain context more effectively while reading the review.

B

BCEWithLogitsLoss

Binary classification uses one output logit, not a softmax over many classes

▼

The model outputs one logit per review. That logit is converted to a probability with sigmoid during evaluation, and BCEWithLogitsLoss handles the stable binary-loss computation during training.

logits: (N, 1)
loss = BCEWithLogitsLoss(logits, labels)
preds = sigmoid(logits) > 0.5

Why in this project

This is the first project in the series that uses binary classification rather than multi-class argmax prediction. That shift is part of the conceptual jump from previous CNN projects.

P

Packed Sequences

Tell the LSTM which timesteps are real and which ones are just padding

▼

pack_padded_sequence removes the padded tail from the LSTM's computation path. That matters because the final hidden state should summarize the last real word, not a long tail of meaningless padding tokens.

Analogy

Without packing, the model reads the review and then flips through dozens of blank pages before writing its final summary. Packing makes it stop reading at the true end of the review.

Why in this project

This single change rescued the baseline from near-random behavior to a usable result above 81% test accuracy. It was not a small optimization; it was a correctness fix.

D

Dynamic Padding

Trim each batch to its own maximum real length before the model sees it

▼

The dataset is still stored at max length 256 for simplicity, but each batch is trimmed to its longest real review inside the custom collate_fn. This keeps the code simple while reducing wasted embedding/LSTM work.

Why in this project

It complements packed sequences nicely: dynamic padding removes batch-level waste, and packing removes sample-level padding inside that batch.

G

GloVe Initialization

Pretrained word vectors give the embedding layer a stronger semantic starting point

▼

v2 keeps the same LSTM classifier but changes how the embedding layer starts. Instead of learning every word representation from scratch, it loads GloVe vectors for the words it can match and leaves the rest random.

Why in this project

GloVe matched 15,814 of the 25,002 vocabulary entries (63.25%). That was enough to improve early learning and raise test accuracy from 81.91% to 84.48% without changing the rest of the architecture.

Lessons Learned

Padding can quietly destroy a sequence summary

The first naive implementation looked perfectly reasonable on paper, but the final hidden state was polluted by long padding tails. Packed sequences fixed the real bug, not just the speed.

The output/loss/metric trio must match the task

This project moved from multi-class argmax and perplexity to one-logit binary classification with sigmoid thresholding, accuracy, and confusion-matrix analysis.

Error analysis revealed a real bias pattern

v1 leaned too positive. v2 reduced that false-positive bias substantially, which suggests that better word representations help the model resist shallow local praise cues.

Pretrained embeddings help, but they do not solve everything

GloVe gave a meaningful gain over the baseline, especially early in training, but the model still struggles with formal tone, mixed sentiment, and socially heavy positive reviews.

Frozen pretrained features are useful, but trainable ones are better here

v3 still beat the scratch baseline, which shows that pretrained semantics help by themselves. But the best result came from letting GloVe adapt to the IMDB task, not from keeping it fixed.

Simple preprocessing still teaches core NLP ideas

Whitespace tokenization and a top-25k vocabulary are not fancy, but they are enough to make vocab building, unknown tokens, padding, variable-length handling, and sequence summarization concrete.

v2 improves the baseline, but sequence understanding is still the real bottleneck

The main gain from v2 came from better word representations. The next major jump will likely require a stronger sequence encoder or a cleaner representation experiment rather than just longer training.

The Full Journey

Project	Domain	Key Metric	Architecture Highlight
1. MNIST CNN	Image Classification	99.35% acc	Conv + Pool + FC basics
2. CIFAR-10 CNN	Image Classification	81.04% acc	BatchNorm, LR Scheduler, scratch ceiling
3. Transfer Learning	Image Classification	96.60% acc	ResNet18, freeze/unfreeze, fine-tuning
4. RNN Shakespeare	Text Generation	6.33 perplexity	Hidden state, BPTT, temperature
5. LSTM Sentiment	Text Classification	84.48% acc	Word embeddings, packed LSTM, GloVe initialization

From image classification to sequence generation, then from sequence generation to sequence understanding. Each project responds directly to a limitation exposed by the previous one.

What's Next

Project 5 now has a clean final story: padding correctness mattered, pretrained embeddings helped, and trainable GloVe beat frozen GloVe. That makes this a strong stopping point for the LSTM project, and the next natural move is Transformer from scratch.