Skip to main navigation Skip to search Skip to main content

A Structured Approach to Software Defect Classification and Explanation: Random Forest and Gradient Boosting Ensembles with a Focus on Prediction Interpretability

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Software defect prediction is crucial for reducing costs and improving quality. According to a Cutter Consortium report, software defects cause an estimated annual loss of $1.56 trillion in global productivity. Additionally, Tricentis reported that over 30% of software development projects failed due to undetected defects. Undetected defects can increase maintenance costs, delay deliveries, and compromise security, particularly in critical applications such as financial or medical systems. A significant challenge is dealing with imbalanced data, where there are more defect-free modules than defective ones, making detection difficult. This study proposes a four-phase approach: loading and transforming data, using balancing techniques, applying machine learning models, and explaining predictions. Techniques such as SMOTE, ADASYN, and RandomUnderSampling were used to balance the data, applied to models like Random Forest, Gradient Boosting, and SVM. The JM1 dataset, containing software quality metrics and 80% defect-free modules, was used for analysis. Data preprocessing involved imputation, encoding, and normalization. Results show that Random Forest and Gradient Boosting, combined with balancing techniques, achieved the best performance in defect identification. In the future, advanced algorithms such as XGBoost and LightGBM will be explored, and parameter optimization will be conducted to further enhance results. This approach aims to improve defect detection in software and to be applied in other fields.

Original languageEnglish
Title of host publicationProceedings of 10th International Congress on Information and Communication Technology - ICICT 2025
EditorsXin-She Yang, Simon Sherratt, Nilanjan Dey, Amit Joshi
PublisherSpringer Science and Business Media Deutschland GmbH
Pages409-420
Number of pages12
ISBN (Print)9789819664405
DOIs
StatePublished - 2025
Event10th International Congress on Information and Communication Technology, ICICT 2025 - London, United Kingdom
Duration: 18 Feb 202521 Feb 2025

Publication series

NameLecture Notes in Networks and Systems
Volume1416 LNNS
ISSN (Print)2367-3370
ISSN (Electronic)2367-3389

Conference

Conference10th International Congress on Information and Communication Technology, ICICT 2025
Country/TerritoryUnited Kingdom
CityLondon
Period18/02/2521/02/25

Bibliographical note

Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.

Keywords

  • Class balancing
  • Evaluation metrics
  • Machine learning models
  • Random forest and gradient boosting
  • Software defect prediction

Cite this