Project Hub - Project Hub - Tweet Emotion Classification

Assignment

Classify Emotions in Tweets Using an LSTM-based Architecture (PyTorch) Build an emotion classifier for tweets using:

  • Classical baselines (simple averaging embeddings + MLP)

  • An attention-enhanced LSTM

  • Emotions can include: Joy, sadness, anger, fear, love, surprise

  • Datasets:

  • EmoTweet Dataset

  • Kaggle datasets: Emotion Dataset for NLP (6 labels)

Brainstorming

Resources and dev log

  • 08/01/2026 - 01:00 : find and downladed dataset emotion-dataset. Project scaffolding done. Downloaded glove embedding Trained model. Created simple dictionary called “word2idx” tied to Glove embeddings. Created mapping “emotion to id” and viceversa “id to emotion”. Modified database, added label_id and sequences column. Sequences is the word-by-word conversion to id for Glove embeddings. Created a neural network classfier with simple architecture Embedding Glove -> LSTM (NO ATTENTION) -> Output (size )
  • 08/01/2026 - 14:00: given my experience with my previous failed project. I think it’s better to keep the old one. Save results somewhere and make a plot graph. Make a renowed list of task, added section to my notebook.
  • 09/01/2026 - 01:10 - downloaded second dataset, removed duplicates, cleaned, i merged the two. There are no duplicates. Performed exploratory data analysis and understood that classes are umbalanced (this will have conseguences on the classification and the metric). Also data contains outliers. Computed 1,2,3-grams but it lack an explanation or there is something i’m not seeing with these. I looked also at most frequent 4-gram and 5-gram, but 5-gram are too rare compared to dataset. Most frequent 4-gram are almost like 1-2-3 suggesting that data is sparse. Gemini talked about “stop words” but i don’t know what they are.
  • 09/01/2026 - 12:26 defined preprocessing pipeline. Defined a preprocessing pipeline in PreprocessPipeline class with all the rules and applied it to text.
  • 09/01/2026 - 18:00 Plot graph with bi-grams (and only for bi-grams analysis: removed the stop words) for each category. Removed outliers on the right tail of the distribution. N-gram analysis was fundamental because i find a lot of noise. I cleaned data, thankfully over the entire dataset of 423k labeled samples only like <500 are noise. I can remove them later it is not urgent since most. From n-gram analysis visualization i removed “feel like” and “i m feeling” since these two bigrams were present and predominant across all the 6 labels. Original dataset is “unchanged” in the sense that stop words and strings are still present, only outliers removal and preprocessing pipeline are applied.
  • 10/01/2026 17:10 - fixed a bug which uncorrectly compute outliers to remove.
  • 10/01/2026 19:18 - implemented baseline embed + MLP and LSTM with attetnion, impliemented F1-score instead of loss, i consider both macro and weighted since classes are unbalanced and my goal is to maximize the F1-score. Added Confusion matrix visualization for 6 labels after the training of each model. Implemented the following test: Considered a small subset and overfit the LSTM with attention on the small subset, this should be enough to say that there are no bug or errors in the data. For each model training, created a history of f1 score (both weighed and macro) and history of train and val loss. Plotted f1 score. Compute confusion matrix for both
  • 11/01/2026 15:18 switched from glove 100d to glove 300d since i have a dict of unique 80k words. The model was slightly better and i got on the baseline a maximum of 0.8960 F1 weighted score (record for baseline)
  • 11/01/2026 17:48 added manual stop with try-except loop. Added Early stop and save model to google drive. Added prior initialization and xavier to boh models hopefully improving training. Added bias to cross entropy loss
  • 11/01/2026 22:34 Added more samples and retrained. Baseline reached lowest validation of f1 macro 8613. While LSTM (1 layer) with hyperparameters of LSTM (with 1 layer) lr=5e-4, weight_decay 1e-5 and hidden_dim to 128 reached a maximum f1 score macro of 0.9087. Note the second model is bidirectional.
  • 11/01/2026 23:37 implemented “set_seed” for reproducibility. Documented the dataset composition
  • 12/01/2026 23:10 documented Datasets compositions. Explained what is GloVe. documented weighted cross entropy loss
  • 13/01/2026 00:01 in the colab i wrote: - “Validation Loss (Weighted Cross Entropy) is the loss computed over the set, during each epoch after training; - Top-2 Validation Accuracy is the accuracy computed over the top 2 highest probability guesses. In my case both Validation Accuracy and Top-2 Val Accuracy are high, so the model is confident. But Top-2 Val Accuracy is higher.” ”- F1 Score can be weighted and not weighted. The not weighed (macro) is lower but a more honest evaluation because it does not take in account that data is unbalanced.” Provided an explanation for the results of the model. Documented in colab that precision and recall wasn’t balanced for all classes.
  • 13/01/2026 01:05 - documented the Dictionary implementation and Out of Vocabulary problem and stuff. Documented attention mechanism
  • 13/01/2026 13:21 - documented attention, now models return weights so i modified and verified all the functions. I collected a sample and printed the weights of each token from the sample of tweets, for each label and plotted them. Then i computed the top words across the validation set using weights values and plotted also them. I then commented the results.
  • 13/01/2026 15:28 Added Error Analysis Section. Printed 10 tweets from the dataset that the model didn’t correctly labeled. For these i also shown the weights heatmap like before. Added sarcasm example: selected 5 random samples from another dataset of sarcastic tweets and just show the tweet + what model predict. In some of these the model predicted the correct way, in the others it won’t. But i think it’s good. Added Limitations and future improvements section of text only at the end.
  • 13/01/2026 16:03 added emoji python library, added demoji text into the preprocess pipeline. need to a do a final rerun. Since i have to retrain, i added ReduceLROnPlateau to finetune the model and possibly reach a lower validation loss.
  • 28/01/2026 18:21 - Learned what is pyproject.toml and added to the repository. Structured repository as Data Science Project Structure
  • 29/01/2026 18:00 - Created script for downloading data
  • 30/01/2026 13:00 - Create a notebook that merge the two datasets. Saved it into Parquet format to a file
  • 30/01/2026 18:00 - I’m half the road before finishing the refactoring. I learned a lot so it’s good.
  • 31/01/2026 16:00 - Added VocabBuilder class in src, a script called make_dictionary that instantiate the object from this class, load dataframe, create the sequences and make the dictionary and save it to vocab.json under data/

Preprocessing pipeline:

  • Lowercase everything
  • Remove noise: html, urls
  • hashtags are easy to separate just remove the # and then separate words
  • Keep ’!’ and ’?’ because they are primary indicators of surprise/anger
  • TweetTokenizer from nltk auotmatically manages contractions.
  • Remove emojis (i didn’t find any of them in the database)

Experiments

Modellrweight_decayhidden_dimdropoutF1-macroNotes
Base1e-3default2560.30.84Overfit
Base v23e-41e-52560.00.8618Baseline optmized
LSTM v13e-41e-51280 (1 layer)0.9086Slower but best, more accurate

Tasks

Optional/Remaiing Tasks

  • Visualize Attention weights
  • Add a p-test between the f1s of baseline and f1s of the lstm model
  • Retrain with smaller hidden dimensionality
  • Ablation Study isolating attention’s contribution
  • Draw architecture graph for the two models
  • Improve Github repository
  • Add project to hugging face

Github Repository and project architecture

  • Add pyproject.toml
  • Rewrite the README
  • Scripts
    • Add script for windows and linux that download dependencies automatically
    • Script to download glove
    • Script to download semeval dataset
    • Script to download eltea
    • Moved under datasets/raw
  • Create a preprocess notebook that merge the two dataset and save it in parquet format to dataset/process
  • Sources
    • Create a class that computes dictionary, will be used by training function. This will use the downloaded glove for make the vocabulary
  • Scripts
    • Add a script that removes outliers
    • Modify the script so that outliers are added to folder datasets/final
  • Tests
    • Make a test that overfit the training of the baseline on a smaller sample of 1000 takens randomly from the final dataset
  • Edit Notebook:
    • split preprocessing and move to another smaller notebook
    • split remove outliers (to create final database) and create dictionary and vocabulary and to a smaller script
    • Split train and move to another notebook
    • Split visualization to another smaller notebook

Comments

The EmoTweet-28 dataset isn’t available in official platforms.