Performance of Machine Learning Classifiers for Malware Detection Over Imbalanced Data

Paulina Morillo, Diego Bahamonde, Wilian Tapia

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review


Detecting malware is crucial to avoid severe damage to a computer system. However, doing it by training Machine Learning algorithms can present complications since often there is imbalanced data. Therefore, one of the challenges faced by binary classification is learning to clearly distinguish between two classes when you have a much larger number of instances of one class than another. To decrease bias and to handle imbalance, some techniques increase or reduce the number of cases of the minority and majority classes, respectively. This paper analyzes the performance of three cost-sensitive classifiers, LR, DT, and RF, trained with an imbalanced malware detection dataset and four artificial datasets built using Near Miss, SMOTE, SMOTEENN, and SMOTETomek re-sample techniques. The results show that Near Miss achieves a proper balance between the classes so that the algorithms increase their overall performance, reaching balanced accuracies greater than 95%. On the other hand, the rest of the techniques slightly increase the ability of the classifiers to identify objects of the minority class. Meanwhile, Random Forest achieved balanced and high performance. Besides, the training and testing times for oversampling or hybrid techniques are far superior to those obtained by undersampling since the latter reduces the number of instances processed by the models.

Original languageEnglish
Title of host publicationIntelligent Systems and Applications - Proceedings of the 2023 Intelligent Systems Conference IntelliSys Volume 1
EditorsKohei Arai
PublisherSpringer Science and Business Media Deutschland GmbH
Number of pages12
ISBN (Print)9783031477201
StatePublished - 2024
EventIntelligent Systems Conference, IntelliSys 2023 - Amsterdam, Netherlands
Duration: 7 Sep 20238 Sep 2023

Publication series

NameLecture Notes in Networks and Systems


ConferenceIntelligent Systems Conference, IntelliSys 2023

Bibliographical note

Publisher Copyright:
© 2024, The Author(s), under exclusive license to Springer Nature Switzerland AG.


  • AUC
  • Balance Accuracy
  • Binary Classification
  • Confusion Matrix
  • G-Mean
  • Hybrid
  • Oversampling
  • Re-Sample
  • Undersampling


Dive into the research topics of 'Performance of Machine Learning Classifiers for Malware Detection Over Imbalanced Data'. Together they form a unique fingerprint.

Cite this