An Investigation Towards Resampling Techniques and Classification Algorithms on CM1 NASA PROMISE Dataset for Software Defect Prediction
Software defect prediction is a practical approach to improving the quality and efficiency of software testing processes. However, establishing robust and trustworthy models for software defect prediction is quite challenging due to the limitation of historical datasets that most developers are capa...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Ikatan Ahli Informatika Indonesia
2024-10-01
|
Series: | Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) |
Subjects: | |
Online Access: | https://jurnal.iaii.or.id/index.php/RESTI/article/view/5910 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1841544056255021056 |
---|---|
author | Agung Fatwanto Muh Nur Aslam Rebbecah Ndugi Muhammad Syafrudin |
author_facet | Agung Fatwanto Muh Nur Aslam Rebbecah Ndugi Muhammad Syafrudin |
author_sort | Agung Fatwanto |
collection | DOAJ |
description | Software defect prediction is a practical approach to improving the quality and efficiency of software testing processes. However, establishing robust and trustworthy models for software defect prediction is quite challenging due to the limitation of historical datasets that most developers are capable of collecting. The inherently imbalanced nature of most software defect datasets also posed another problem. Therefore, an insight into how to properly construct software defect prediction models on a small, yet imbalanced, dataset is required. The objective of this study is therefore to provide the required insight by way of investigating and comparing a number of resampling techniques, classification algorithms, and evaluation measurements (metrics) for building software defect prediction models on CM1 NASA PROMISE data as the representation of a small yet unbalanced dataset. This study is comparative descriptive research. It follows a positivist (quantitative) approach. Data were collected through observation towards experiments on four categories of resampling techniques (oversampling, under sampling, ensemble, and combine) combined with three categories of machine learning classification algorithms (traditional, ensemble, and neural network) to predict defective software modules on CM1 NASA PROMISE dataset. Training processes were carried out twice, each of which used the 5-fold cross-validation and the 70% training and 30% testing data splitting (holdout) method. Our result shows that the combined and oversampling techniques provide a positive effect on the performance of the models. In the context of classification models, ensemble-based algorithms, which extend the decision tree classification mechanism such as Random Forest and eXtreme Gradient Boosting, achieved sufficiently good performance for predicting defective software modules. Regarding the evaluation measurements, the combined and rank-based performance metrics yielded modest variance values, which is deemed suitable for evaluating the performance of the models in this context. |
format | Article |
id | doaj-art-c6162587b34144bfbc7309c87594eae2 |
institution | Kabale University |
issn | 2580-0760 |
language | English |
publishDate | 2024-10-01 |
publisher | Ikatan Ahli Informatika Indonesia |
record_format | Article |
series | Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) |
spelling | doaj-art-c6162587b34144bfbc7309c87594eae22025-01-13T03:31:56ZengIkatan Ahli Informatika IndonesiaJurnal RESTI (Rekayasa Sistem dan Teknologi Informasi)2580-07602024-10-018563164310.29207/resti.v8i5.59105910An Investigation Towards Resampling Techniques and Classification Algorithms on CM1 NASA PROMISE Dataset for Software Defect PredictionAgung Fatwanto0Muh Nur Aslam1Rebbecah Ndugi2Muhammad Syafrudin3UIN Sunan Kalijaga YogyakartaUIN Sunan Kalijaga YogyakartaSt. Petersburg State UniversitySejong UniversitySoftware defect prediction is a practical approach to improving the quality and efficiency of software testing processes. However, establishing robust and trustworthy models for software defect prediction is quite challenging due to the limitation of historical datasets that most developers are capable of collecting. The inherently imbalanced nature of most software defect datasets also posed another problem. Therefore, an insight into how to properly construct software defect prediction models on a small, yet imbalanced, dataset is required. The objective of this study is therefore to provide the required insight by way of investigating and comparing a number of resampling techniques, classification algorithms, and evaluation measurements (metrics) for building software defect prediction models on CM1 NASA PROMISE data as the representation of a small yet unbalanced dataset. This study is comparative descriptive research. It follows a positivist (quantitative) approach. Data were collected through observation towards experiments on four categories of resampling techniques (oversampling, under sampling, ensemble, and combine) combined with three categories of machine learning classification algorithms (traditional, ensemble, and neural network) to predict defective software modules on CM1 NASA PROMISE dataset. Training processes were carried out twice, each of which used the 5-fold cross-validation and the 70% training and 30% testing data splitting (holdout) method. Our result shows that the combined and oversampling techniques provide a positive effect on the performance of the models. In the context of classification models, ensemble-based algorithms, which extend the decision tree classification mechanism such as Random Forest and eXtreme Gradient Boosting, achieved sufficiently good performance for predicting defective software modules. Regarding the evaluation measurements, the combined and rank-based performance metrics yielded modest variance values, which is deemed suitable for evaluating the performance of the models in this context.https://jurnal.iaii.or.id/index.php/RESTI/article/view/5910software defect predictionmachine learningclassification algorithmimbalanced dataresampling |
spellingShingle | Agung Fatwanto Muh Nur Aslam Rebbecah Ndugi Muhammad Syafrudin An Investigation Towards Resampling Techniques and Classification Algorithms on CM1 NASA PROMISE Dataset for Software Defect Prediction Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) software defect prediction machine learning classification algorithm imbalanced data resampling |
title | An Investigation Towards Resampling Techniques and Classification Algorithms on CM1 NASA PROMISE Dataset for Software Defect Prediction |
title_full | An Investigation Towards Resampling Techniques and Classification Algorithms on CM1 NASA PROMISE Dataset for Software Defect Prediction |
title_fullStr | An Investigation Towards Resampling Techniques and Classification Algorithms on CM1 NASA PROMISE Dataset for Software Defect Prediction |
title_full_unstemmed | An Investigation Towards Resampling Techniques and Classification Algorithms on CM1 NASA PROMISE Dataset for Software Defect Prediction |
title_short | An Investigation Towards Resampling Techniques and Classification Algorithms on CM1 NASA PROMISE Dataset for Software Defect Prediction |
title_sort | investigation towards resampling techniques and classification algorithms on cm1 nasa promise dataset for software defect prediction |
topic | software defect prediction machine learning classification algorithm imbalanced data resampling |
url | https://jurnal.iaii.or.id/index.php/RESTI/article/view/5910 |
work_keys_str_mv | AT agungfatwanto aninvestigationtowardsresamplingtechniquesandclassificationalgorithmsoncm1nasapromisedatasetforsoftwaredefectprediction AT muhnuraslam aninvestigationtowardsresamplingtechniquesandclassificationalgorithmsoncm1nasapromisedatasetforsoftwaredefectprediction AT rebbecahndugi aninvestigationtowardsresamplingtechniquesandclassificationalgorithmsoncm1nasapromisedatasetforsoftwaredefectprediction AT muhammadsyafrudin aninvestigationtowardsresamplingtechniquesandclassificationalgorithmsoncm1nasapromisedatasetforsoftwaredefectprediction AT agungfatwanto investigationtowardsresamplingtechniquesandclassificationalgorithmsoncm1nasapromisedatasetforsoftwaredefectprediction AT muhnuraslam investigationtowardsresamplingtechniquesandclassificationalgorithmsoncm1nasapromisedatasetforsoftwaredefectprediction AT rebbecahndugi investigationtowardsresamplingtechniquesandclassificationalgorithmsoncm1nasapromisedatasetforsoftwaredefectprediction AT muhammadsyafrudin investigationtowardsresamplingtechniquesandclassificationalgorithmsoncm1nasapromisedatasetforsoftwaredefectprediction |