Source Code Error Understanding Using BERT for Multi-Label Classification
Programming is an essential skill in computer science and across a wide range of engineering disciplines. However, errors, often referred to as ‘bugs’ in code, can be challenging to identify and rectify for both students learning to program and experienced professionals. Unders...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10820190/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Programming is an essential skill in computer science and across a wide range of engineering disciplines. However, errors, often referred to as ‘bugs’ in code, can be challenging to identify and rectify for both students learning to program and experienced professionals. Understanding, identifying, and effectively addressing these errors are critical aspects of programming education and software development. To aid in understanding and classifying these errors, we propose a multi-label error classification approach for source code using fine-tuned BERT models (BERT_Uncased and BERT_Cased). The models achieved average classification accuracies of 90.58% and 90.80%, exact match accuracies of 48.28% and 49.13%, and weighted F1 scores of 0.796 and 0.799, respectively. Precision, Recall, Hamming Loss, and ROC-AUC metrics further evaluate the effectiveness of our models. Additionally, we employed several combinations of large language models (CodeT5, CodeBERT) with machine learning classifiers (Decision Tree, Random Forest, Ensemble Learning, ML-KNN), demonstrating the superiority of our proposed approach. These findings highlight the potential of multi-label error classification to advance programming education, software engineering, and related research fields. |
---|---|
ISSN: | 2169-3536 |