An Investigation Towards Resampling Techniques and Classification Algorithms on CM1 NASA PROMISE Dataset for Software Defect Prediction

Software defect prediction is a practical approach to improving the quality and efficiency of software testing processes. However, establishing robust and trustworthy models for software defect prediction is quite challenging due to the limitation of historical datasets that most developers are capa...

Full description

Saved in:

Bibliographic Details
Main Authors:	Agung Fatwanto, Muh Nur Aslam, Rebbecah Ndugi, Muhammad Syafrudin
Format:	Article
Language:	English
Published:	Ikatan Ahli Informatika Indonesia 2024-10-01
Series:	Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi)
Subjects:	software defect prediction machine learning classification algorithm imbalanced data resampling
Online Access:	https://jurnal.iaii.or.id/index.php/RESTI/article/view/5910
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841544056255021056
author	Agung Fatwanto Muh Nur Aslam Rebbecah Ndugi Muhammad Syafrudin
author_facet	Agung Fatwanto Muh Nur Aslam Rebbecah Ndugi Muhammad Syafrudin
author_sort	Agung Fatwanto
collection	DOAJ
description	Software defect prediction is a practical approach to improving the quality and efficiency of software testing processes. However, establishing robust and trustworthy models for software defect prediction is quite challenging due to the limitation of historical datasets that most developers are capable of collecting. The inherently imbalanced nature of most software defect datasets also posed another problem. Therefore, an insight into how to properly construct software defect prediction models on a small, yet imbalanced, dataset is required. The objective of this study is therefore to provide the required insight by way of investigating and comparing a number of resampling techniques, classification algorithms, and evaluation measurements (metrics) for building software defect prediction models on CM1 NASA PROMISE data as the representation of a small yet unbalanced dataset. This study is comparative descriptive research. It follows a positivist (quantitative) approach. Data were collected through observation towards experiments on four categories of resampling techniques (oversampling, under sampling, ensemble, and combine) combined with three categories of machine learning classification algorithms (traditional, ensemble, and neural network) to predict defective software modules on CM1 NASA PROMISE dataset. Training processes were carried out twice, each of which used the 5-fold cross-validation and the 70% training and 30% testing data splitting (holdout) method. Our result shows that the combined and oversampling techniques provide a positive effect on the performance of the models. In the context of classification models, ensemble-based algorithms, which extend the decision tree classification mechanism such as Random Forest and eXtreme Gradient Boosting, achieved sufficiently good performance for predicting defective software modules. Regarding the evaluation measurements, the combined and rank-based performance metrics yielded modest variance values, which is deemed suitable for evaluating the performance of the models in this context.
format	Article
id	doaj-art-c6162587b34144bfbc7309c87594eae2
institution	Kabale University
issn	2580-0760
language	English
publishDate	2024-10-01
publisher	Ikatan Ahli Informatika Indonesia
record_format	Article
series	Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi)
spelling	doaj-art-c6162587b34144bfbc7309c87594eae22025-01-13T03:31:56ZengIkatan Ahli Informatika IndonesiaJurnal RESTI (Rekayasa Sistem dan Teknologi Informasi)2580-07602024-10-018563164310.29207/resti.v8i5.59105910An Investigation Towards Resampling Techniques and Classification Algorithms on CM1 NASA PROMISE Dataset for Software Defect PredictionAgung Fatwanto0Muh Nur Aslam1Rebbecah Ndugi2Muhammad Syafrudin3UIN Sunan Kalijaga YogyakartaUIN Sunan Kalijaga YogyakartaSt. Petersburg State UniversitySejong UniversitySoftware defect prediction is a practical approach to improving the quality and efficiency of software testing processes. However, establishing robust and trustworthy models for software defect prediction is quite challenging due to the limitation of historical datasets that most developers are capable of collecting. The inherently imbalanced nature of most software defect datasets also posed another problem. Therefore, an insight into how to properly construct software defect prediction models on a small, yet imbalanced, dataset is required. The objective of this study is therefore to provide the required insight by way of investigating and comparing a number of resampling techniques, classification algorithms, and evaluation measurements (metrics) for building software defect prediction models on CM1 NASA PROMISE data as the representation of a small yet unbalanced dataset. This study is comparative descriptive research. It follows a positivist (quantitative) approach. Data were collected through observation towards experiments on four categories of resampling techniques (oversampling, under sampling, ensemble, and combine) combined with three categories of machine learning classification algorithms (traditional, ensemble, and neural network) to predict defective software modules on CM1 NASA PROMISE dataset. Training processes were carried out twice, each of which used the 5-fold cross-validation and the 70% training and 30% testing data splitting (holdout) method. Our result shows that the combined and oversampling techniques provide a positive effect on the performance of the models. In the context of classification models, ensemble-based algorithms, which extend the decision tree classification mechanism such as Random Forest and eXtreme Gradient Boosting, achieved sufficiently good performance for predicting defective software modules. Regarding the evaluation measurements, the combined and rank-based performance metrics yielded modest variance values, which is deemed suitable for evaluating the performance of the models in this context.https://jurnal.iaii.or.id/index.php/RESTI/article/view/5910software defect predictionmachine learningclassification algorithmimbalanced dataresampling
spellingShingle	Agung Fatwanto Muh Nur Aslam Rebbecah Ndugi Muhammad Syafrudin An Investigation Towards Resampling Techniques and Classification Algorithms on CM1 NASA PROMISE Dataset for Software Defect Prediction Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) software defect prediction machine learning classification algorithm imbalanced data resampling
title	An Investigation Towards Resampling Techniques and Classification Algorithms on CM1 NASA PROMISE Dataset for Software Defect Prediction
title_full	An Investigation Towards Resampling Techniques and Classification Algorithms on CM1 NASA PROMISE Dataset for Software Defect Prediction
title_fullStr	An Investigation Towards Resampling Techniques and Classification Algorithms on CM1 NASA PROMISE Dataset for Software Defect Prediction
title_full_unstemmed	An Investigation Towards Resampling Techniques and Classification Algorithms on CM1 NASA PROMISE Dataset for Software Defect Prediction
title_short	An Investigation Towards Resampling Techniques and Classification Algorithms on CM1 NASA PROMISE Dataset for Software Defect Prediction
title_sort	investigation towards resampling techniques and classification algorithms on cm1 nasa promise dataset for software defect prediction
topic	software defect prediction machine learning classification algorithm imbalanced data resampling
url	https://jurnal.iaii.or.id/index.php/RESTI/article/view/5910
work_keys_str_mv	AT agungfatwanto aninvestigationtowardsresamplingtechniquesandclassificationalgorithmsoncm1nasapromisedatasetforsoftwaredefectprediction AT muhnuraslam aninvestigationtowardsresamplingtechniquesandclassificationalgorithmsoncm1nasapromisedatasetforsoftwaredefectprediction AT rebbecahndugi aninvestigationtowardsresamplingtechniquesandclassificationalgorithmsoncm1nasapromisedatasetforsoftwaredefectprediction AT muhammadsyafrudin aninvestigationtowardsresamplingtechniquesandclassificationalgorithmsoncm1nasapromisedatasetforsoftwaredefectprediction AT agungfatwanto investigationtowardsresamplingtechniquesandclassificationalgorithmsoncm1nasapromisedatasetforsoftwaredefectprediction AT muhnuraslam investigationtowardsresamplingtechniquesandclassificationalgorithmsoncm1nasapromisedatasetforsoftwaredefectprediction AT rebbecahndugi investigationtowardsresamplingtechniquesandclassificationalgorithmsoncm1nasapromisedatasetforsoftwaredefectprediction AT muhammadsyafrudin investigationtowardsresamplingtechniquesandclassificationalgorithmsoncm1nasapromisedatasetforsoftwaredefectprediction

An Investigation Towards Resampling Techniques and Classification Algorithms on CM1 NASA PROMISE Dataset for Software Defect Prediction

Similar Items