Skip to main navigation Skip to search Skip to main content

Combining Synthetic Minority Over-Sampling Technique and Multinomial Naive Bayes for Sentiment Analysis on Imbalanced Social Media Datasets

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Sentiment analysis on social media poses a significant challenge for researchers in the field of Natural Language Processing (NLP) due to the informal, ambiguous, and dynamic nature of the language used by users. This research proposes a methodology that combines the Synthetic Minority Over-sampling Technique (SMOTE) with the Multinomial Naive Bayes (MNB) classifier to enhance performance in sentiment classification tasks on imbalanced datasets. The methodological process includes text cleaning, stopword removal, and lemmatization, followed by vectorization using Term Frequency–Inverse Document Frequency (TF-IDF) to represent lexical features. The Chi-squared test is applied to select the most discriminative features, and hyperparameter optimization is carried out using GridSearchCV with cross-validation. The method was evaluated using a cyberbullying dataset of posts labeled with positive and negative polarity. Evaluation metrics include accuracy, precision, recall, F1-score, and the confusion matrix. Experimental results demonstrate that the proposed approach improves model performance, achieving an accuracy of 88.99%, a precision of 89.14%, a recall of 88.99%, and an F1-score of 88.85%, showing the effectiveness of the SMOTE + Naive Bayes combination in mitigating class imbalance.

Original languageEnglish
Title of host publicationInformation and Communication Technologies - 13th Ecuadorian Conference, TICEC 2025, Proceedings
EditorsSantiago Berrezueta, Tatiana Gualotuña, Efrain R. Fonseca C., Germania Rodriguez Morales, Jorge Maldonado-Mahauad
PublisherSpringer Science and Business Media Deutschland GmbH
Pages3-17
Number of pages15
ISBN (Print)9783032083654
DOIs
StatePublished - 2026
Event13th Ecuadorian Conference on Information and Communication Technologies, TICEC 2025 - Quito, Ecuador
Duration: 16 Oct 202517 Oct 2025

Publication series

NameCommunications in Computer and Information Science
Volume2707 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference13th Ecuadorian Conference on Information and Communication Technologies, TICEC 2025
Country/TerritoryEcuador
CityQuito
Period16/10/2517/10/25

Bibliographical note

Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.

Keywords

  • Machine Learning
  • Multinomial Naive Bayes
  • Natural Language Processing (NLP)
  • Sentiment Analysis
  • SMOTE (Synthetic Minority Over-sampling Technique)

Cite this