Performance of Machine Learning Classifiers for Malware Detection Over Imbalanced Data

Paulina Morillo; Diego Bahamonde; Wilian Tapia

doi:10.1007/978-3-031-47721-8_33

Performance of Machine Learning Classifiers for Malware Detection Over Imbalanced Data

Paulina Morillo, Diego Bahamonde, Wilian Tapia

Spatial Data Infrastructure Research Group Artificial Intelligence Geoportals and Applied Computing (IDE IA GEO CA)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

Detecting malware is crucial to avoid severe damage to a computer system. However, doing it by training Machine Learning algorithms can present complications since often there is imbalanced data. Therefore, one of the challenges faced by binary classification is learning to clearly distinguish between two classes when you have a much larger number of instances of one class than another. To decrease bias and to handle imbalance, some techniques increase or reduce the number of cases of the minority and majority classes, respectively. This paper analyzes the performance of three cost-sensitive classifiers, LR, DT, and RF, trained with an imbalanced malware detection dataset and four artificial datasets built using Near Miss, SMOTE, SMOTEENN, and SMOTETomek re-sample techniques. The results show that Near Miss achieves a proper balance between the classes so that the algorithms increase their overall performance, reaching balanced accuracies greater than 95%. On the other hand, the rest of the techniques slightly increase the ability of the classifiers to identify objects of the minority class. Meanwhile, Random Forest achieved balanced and high performance. Besides, the training and testing times for oversampling or hybrid techniques are far superior to those obtained by undersampling since the latter reduces the number of instances processed by the models.

Original language	English
Title of host publication	Intelligent Systems and Applications - Proceedings of the 2023 Intelligent Systems Conference IntelliSys Volume 1
Editors	Kohei Arai
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	496-507
Number of pages	12
ISBN (Print)	9783031477201
DOIs	https://doi.org/10.1007/978-3-031-47721-8_33
State	Published - 2024
Event	Intelligent Systems Conference, IntelliSys 2023 - Amsterdam, Netherlands Duration: 7 Sep 2023 → 8 Sep 2023

Publication series

Name	Lecture Notes in Networks and Systems
Volume	822
ISSN (Print)	2367-3370
ISSN (Electronic)	2367-3389

Conference

Conference	Intelligent Systems Conference, IntelliSys 2023
Country/Territory	Netherlands
City	Amsterdam
Period	7/09/23 → 8/09/23

Bibliographical note

Publisher Copyright:
© 2024, The Author(s), under exclusive license to Springer Nature Switzerland AG.

Keywords

AUC
Balance Accuracy
Binary Classification
Confusion Matrix
G-Mean
Hybrid
Oversampling
Re-Sample
Undersampling

Access to Document

10.1007/978-3-031-47721-8_33

Cite this

Morillo, P., Bahamonde, D., & Tapia, W. (2024). Performance of Machine Learning Classifiers for Malware Detection Over Imbalanced Data. In K. Arai (Ed.), Intelligent Systems and Applications - Proceedings of the 2023 Intelligent Systems Conference IntelliSys Volume 1 (pp. 496-507). (Lecture Notes in Networks and Systems; Vol. 822). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-47721-8_33

Morillo, Paulina ; Bahamonde, Diego ; Tapia, Wilian. / Performance of Machine Learning Classifiers for Malware Detection Over Imbalanced Data. Intelligent Systems and Applications - Proceedings of the 2023 Intelligent Systems Conference IntelliSys Volume 1. editor / Kohei Arai. Springer Science and Business Media Deutschland GmbH, 2024. pp. 496-507 (Lecture Notes in Networks and Systems).

@inproceedings{1013ff9da556411b8164daed348b3407,

title = "Performance of Machine Learning Classifiers for Malware Detection Over Imbalanced Data",

abstract = "Detecting malware is crucial to avoid severe damage to a computer system. However, doing it by training Machine Learning algorithms can present complications since often there is imbalanced data. Therefore, one of the challenges faced by binary classification is learning to clearly distinguish between two classes when you have a much larger number of instances of one class than another. To decrease bias and to handle imbalance, some techniques increase or reduce the number of cases of the minority and majority classes, respectively. This paper analyzes the performance of three cost-sensitive classifiers, LR, DT, and RF, trained with an imbalanced malware detection dataset and four artificial datasets built using Near Miss, SMOTE, SMOTEENN, and SMOTETomek re-sample techniques. The results show that Near Miss achieves a proper balance between the classes so that the algorithms increase their overall performance, reaching balanced accuracies greater than 95%. On the other hand, the rest of the techniques slightly increase the ability of the classifiers to identify objects of the minority class. Meanwhile, Random Forest achieved balanced and high performance. Besides, the training and testing times for oversampling or hybrid techniques are far superior to those obtained by undersampling since the latter reduces the number of instances processed by the models.",

keywords = "AUC, Balance Accuracy, Binary Classification, Confusion Matrix, G-Mean, Hybrid, Oversampling, Re-Sample, Undersampling",

author = "Paulina Morillo and Diego Bahamonde and Wilian Tapia",

note = "Publisher Copyright: {\textcopyright} 2024, The Author(s), under exclusive license to Springer Nature Switzerland AG.; Intelligent Systems Conference, IntelliSys 2023 ; Conference date: 07-09-2023 Through 08-09-2023",

year = "2024",

doi = "10.1007/978-3-031-47721-8_33",

language = "English",

isbn = "9783031477201",

series = "Lecture Notes in Networks and Systems",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "496--507",

editor = "Kohei Arai",

booktitle = "Intelligent Systems and Applications - Proceedings of the 2023 Intelligent Systems Conference IntelliSys Volume 1",

address = "Germany",

}

Morillo, P, Bahamonde, D & Tapia, W 2024, Performance of Machine Learning Classifiers for Malware Detection Over Imbalanced Data. in K Arai (ed.), Intelligent Systems and Applications - Proceedings of the 2023 Intelligent Systems Conference IntelliSys Volume 1. Lecture Notes in Networks and Systems, vol. 822, Springer Science and Business Media Deutschland GmbH, pp. 496-507, Intelligent Systems Conference, IntelliSys 2023, Amsterdam, Netherlands, 7/09/23. https://doi.org/10.1007/978-3-031-47721-8_33

Performance of Machine Learning Classifiers for Malware Detection Over Imbalanced Data. / Morillo, Paulina; Bahamonde, Diego; Tapia, Wilian.
Intelligent Systems and Applications - Proceedings of the 2023 Intelligent Systems Conference IntelliSys Volume 1. ed. / Kohei Arai. Springer Science and Business Media Deutschland GmbH, 2024. p. 496-507 (Lecture Notes in Networks and Systems; Vol. 822).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Performance of Machine Learning Classifiers for Malware Detection Over Imbalanced Data

AU - Morillo, Paulina

AU - Bahamonde, Diego

AU - Tapia, Wilian

PY - 2024

Y1 - 2024

N2 - Detecting malware is crucial to avoid severe damage to a computer system. However, doing it by training Machine Learning algorithms can present complications since often there is imbalanced data. Therefore, one of the challenges faced by binary classification is learning to clearly distinguish between two classes when you have a much larger number of instances of one class than another. To decrease bias and to handle imbalance, some techniques increase or reduce the number of cases of the minority and majority classes, respectively. This paper analyzes the performance of three cost-sensitive classifiers, LR, DT, and RF, trained with an imbalanced malware detection dataset and four artificial datasets built using Near Miss, SMOTE, SMOTEENN, and SMOTETomek re-sample techniques. The results show that Near Miss achieves a proper balance between the classes so that the algorithms increase their overall performance, reaching balanced accuracies greater than 95%. On the other hand, the rest of the techniques slightly increase the ability of the classifiers to identify objects of the minority class. Meanwhile, Random Forest achieved balanced and high performance. Besides, the training and testing times for oversampling or hybrid techniques are far superior to those obtained by undersampling since the latter reduces the number of instances processed by the models.

AB - Detecting malware is crucial to avoid severe damage to a computer system. However, doing it by training Machine Learning algorithms can present complications since often there is imbalanced data. Therefore, one of the challenges faced by binary classification is learning to clearly distinguish between two classes when you have a much larger number of instances of one class than another. To decrease bias and to handle imbalance, some techniques increase or reduce the number of cases of the minority and majority classes, respectively. This paper analyzes the performance of three cost-sensitive classifiers, LR, DT, and RF, trained with an imbalanced malware detection dataset and four artificial datasets built using Near Miss, SMOTE, SMOTEENN, and SMOTETomek re-sample techniques. The results show that Near Miss achieves a proper balance between the classes so that the algorithms increase their overall performance, reaching balanced accuracies greater than 95%. On the other hand, the rest of the techniques slightly increase the ability of the classifiers to identify objects of the minority class. Meanwhile, Random Forest achieved balanced and high performance. Besides, the training and testing times for oversampling or hybrid techniques are far superior to those obtained by undersampling since the latter reduces the number of instances processed by the models.

KW - AUC

KW - Balance Accuracy

KW - Binary Classification

KW - Confusion Matrix

KW - G-Mean

KW - Hybrid

KW - Oversampling

KW - Re-Sample

KW - Undersampling

UR - http://www.scopus.com/inward/record.url?scp=85182509408&partnerID=8YFLogxK

U2 - 10.1007/978-3-031-47721-8_33

DO - 10.1007/978-3-031-47721-8_33

M3 - Conference contribution

AN - SCOPUS:85182509408

SN - 9783031477201

T3 - Lecture Notes in Networks and Systems

SP - 496

EP - 507

BT - Intelligent Systems and Applications - Proceedings of the 2023 Intelligent Systems Conference IntelliSys Volume 1

A2 - Arai, Kohei

PB - Springer Science and Business Media Deutschland GmbH

T2 - Intelligent Systems Conference, IntelliSys 2023

Y2 - 7 September 2023 through 8 September 2023

ER -

Morillo P, Bahamonde D, Tapia W. Performance of Machine Learning Classifiers for Malware Detection Over Imbalanced Data. In Arai K, editor, Intelligent Systems and Applications - Proceedings of the 2023 Intelligent Systems Conference IntelliSys Volume 1. Springer Science and Business Media Deutschland GmbH. 2024. p. 496-507. (Lecture Notes in Networks and Systems). doi: 10.1007/978-3-031-47721-8_33