LogiTriBlend: A Novel Hybrid Stacking Approach for Enhanced Phishing Email Detection Using ML Models and Vectorization Approach
Email phishing remains a prevalent and sophisticated cyber threat, targeting individuals and organizations by disguising malicious intent in seemingly legitimate communications. Effective classification of phishing and legitimate emails is crucial for cybersecurity. In this study, we investigated va...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2024-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10804110/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1846107051883757568 |
|---|---|
| author | Aqsa Khalid Maria Hanif Abdul Hameed Zeeshan Ashraf Mrim M. Alnfiai Salma M. Mohsen Alnefaie |
| author_facet | Aqsa Khalid Maria Hanif Abdul Hameed Zeeshan Ashraf Mrim M. Alnfiai Salma M. Mohsen Alnefaie |
| author_sort | Aqsa Khalid |
| collection | DOAJ |
| description | Email phishing remains a prevalent and sophisticated cyber threat, targeting individuals and organizations by disguising malicious intent in seemingly legitimate communications. Effective classification of phishing and legitimate emails is crucial for cybersecurity. In this study, we investigated various text vectorization techniques and machine learning models to address the challenge of email classification. We utilized three vectorization techniques: TF-IDF, Word2Vec, and Doc2Vec. These techniques were applied to traditional machine learning algorithms, and their performance was evaluated against a proposed stacking model, LogiTriBlend. The dataset comprised 501 phishing and 4090 legitimate emails, undergoing preprocessing steps like stemming, lemmatization, and noise removal. To handle the dataset’s imbalance, Synthetic Minority Over-sampling Technique (SMOTE) was employed. The model combines multiple base learners, including Support Vector Machine (SVM), Logistic Regression, Random Forest, and XGBoost, with a Logistic Regression meta-learner. The experimental results indicated that the LogiTriBlend model achieved an accuracy of 99.34% using Doc2Vec, outperforming Word2Vec and TF-IDF feature extraction methods, which obtained accuracies of 99.12% and 98.80%, respectively. The Doc2Vec method resulting in superior email classification performance. Among the models tested, the proposed stacking model, LogiTriBlend, demonstrated robust results; however, the highest accuracy was consistently achieved using Doc2Vec. |
| format | Article |
| id | doaj-art-3bd4a7ca1cc94ff5bf5cf9a8b2adb97b |
| institution | Kabale University |
| issn | 2169-3536 |
| language | English |
| publishDate | 2024-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-3bd4a7ca1cc94ff5bf5cf9a8b2adb97b2024-12-27T00:00:52ZengIEEEIEEE Access2169-35362024-01-011219380719382110.1109/ACCESS.2024.351892310804110LogiTriBlend: A Novel Hybrid Stacking Approach for Enhanced Phishing Email Detection Using ML Models and Vectorization ApproachAqsa Khalid0Maria Hanif1Abdul Hameed2https://orcid.org/0000-0002-6842-8631Zeeshan Ashraf3https://orcid.org/0000-0002-2700-5982Mrim M. Alnfiai4https://orcid.org/0000-0003-3837-6313Salma M. Mohsen Alnefaie5School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad, PakistanSchool of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad, PakistanDepartment of Computer Science, The University of Chenab Gujrat, Gujrat, Punjab, PakistanDepartment of Computer Science, Faculty of Computing and IT, IISAT, Gujranwala, Punjab, PakistanDepartment of Information Technology, College of Computers and Information Technology, Taif University, Taif, Saudi ArabiaPhysics Department, Taif University, Taif, Saudi ArabiaEmail phishing remains a prevalent and sophisticated cyber threat, targeting individuals and organizations by disguising malicious intent in seemingly legitimate communications. Effective classification of phishing and legitimate emails is crucial for cybersecurity. In this study, we investigated various text vectorization techniques and machine learning models to address the challenge of email classification. We utilized three vectorization techniques: TF-IDF, Word2Vec, and Doc2Vec. These techniques were applied to traditional machine learning algorithms, and their performance was evaluated against a proposed stacking model, LogiTriBlend. The dataset comprised 501 phishing and 4090 legitimate emails, undergoing preprocessing steps like stemming, lemmatization, and noise removal. To handle the dataset’s imbalance, Synthetic Minority Over-sampling Technique (SMOTE) was employed. The model combines multiple base learners, including Support Vector Machine (SVM), Logistic Regression, Random Forest, and XGBoost, with a Logistic Regression meta-learner. The experimental results indicated that the LogiTriBlend model achieved an accuracy of 99.34% using Doc2Vec, outperforming Word2Vec and TF-IDF feature extraction methods, which obtained accuracies of 99.12% and 98.80%, respectively. The Doc2Vec method resulting in superior email classification performance. Among the models tested, the proposed stacking model, LogiTriBlend, demonstrated robust results; however, the highest accuracy was consistently achieved using Doc2Vec.https://ieeexplore.ieee.org/document/10804110/TF-IDFWord2VecDoc2VecLogiTriBlendSVMXGBoost |
| spellingShingle | Aqsa Khalid Maria Hanif Abdul Hameed Zeeshan Ashraf Mrim M. Alnfiai Salma M. Mohsen Alnefaie LogiTriBlend: A Novel Hybrid Stacking Approach for Enhanced Phishing Email Detection Using ML Models and Vectorization Approach IEEE Access TF-IDF Word2Vec Doc2Vec LogiTriBlend SVM XGBoost |
| title | LogiTriBlend: A Novel Hybrid Stacking Approach for Enhanced Phishing Email Detection Using ML Models and Vectorization Approach |
| title_full | LogiTriBlend: A Novel Hybrid Stacking Approach for Enhanced Phishing Email Detection Using ML Models and Vectorization Approach |
| title_fullStr | LogiTriBlend: A Novel Hybrid Stacking Approach for Enhanced Phishing Email Detection Using ML Models and Vectorization Approach |
| title_full_unstemmed | LogiTriBlend: A Novel Hybrid Stacking Approach for Enhanced Phishing Email Detection Using ML Models and Vectorization Approach |
| title_short | LogiTriBlend: A Novel Hybrid Stacking Approach for Enhanced Phishing Email Detection Using ML Models and Vectorization Approach |
| title_sort | logitriblend a novel hybrid stacking approach for enhanced phishing email detection using ml models and vectorization approach |
| topic | TF-IDF Word2Vec Doc2Vec LogiTriBlend SVM XGBoost |
| url | https://ieeexplore.ieee.org/document/10804110/ |
| work_keys_str_mv | AT aqsakhalid logitriblendanovelhybridstackingapproachforenhancedphishingemaildetectionusingmlmodelsandvectorizationapproach AT mariahanif logitriblendanovelhybridstackingapproachforenhancedphishingemaildetectionusingmlmodelsandvectorizationapproach AT abdulhameed logitriblendanovelhybridstackingapproachforenhancedphishingemaildetectionusingmlmodelsandvectorizationapproach AT zeeshanashraf logitriblendanovelhybridstackingapproachforenhancedphishingemaildetectionusingmlmodelsandvectorizationapproach AT mrimmalnfiai logitriblendanovelhybridstackingapproachforenhancedphishingemaildetectionusingmlmodelsandvectorizationapproach AT salmammohsenalnefaie logitriblendanovelhybridstackingapproachforenhancedphishingemaildetectionusingmlmodelsandvectorizationapproach |