LogiTriBlend: A Novel Hybrid Stacking Approach for Enhanced Phishing Email Detection Using ML Models and Vectorization Approach

Email phishing remains a prevalent and sophisticated cyber threat, targeting individuals and organizations by disguising malicious intent in seemingly legitimate communications. Effective classification of phishing and legitimate emails is crucial for cybersecurity. In this study, we investigated va...

Full description

Saved in:
Bibliographic Details
Main Authors: Aqsa Khalid, Maria Hanif, Abdul Hameed, Zeeshan Ashraf, Mrim M. Alnfiai, Salma M. Mohsen Alnefaie
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10804110/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846107051883757568
author Aqsa Khalid
Maria Hanif
Abdul Hameed
Zeeshan Ashraf
Mrim M. Alnfiai
Salma M. Mohsen Alnefaie
author_facet Aqsa Khalid
Maria Hanif
Abdul Hameed
Zeeshan Ashraf
Mrim M. Alnfiai
Salma M. Mohsen Alnefaie
author_sort Aqsa Khalid
collection DOAJ
description Email phishing remains a prevalent and sophisticated cyber threat, targeting individuals and organizations by disguising malicious intent in seemingly legitimate communications. Effective classification of phishing and legitimate emails is crucial for cybersecurity. In this study, we investigated various text vectorization techniques and machine learning models to address the challenge of email classification. We utilized three vectorization techniques: TF-IDF, Word2Vec, and Doc2Vec. These techniques were applied to traditional machine learning algorithms, and their performance was evaluated against a proposed stacking model, LogiTriBlend. The dataset comprised 501 phishing and 4090 legitimate emails, undergoing preprocessing steps like stemming, lemmatization, and noise removal. To handle the dataset’s imbalance, Synthetic Minority Over-sampling Technique (SMOTE) was employed. The model combines multiple base learners, including Support Vector Machine (SVM), Logistic Regression, Random Forest, and XGBoost, with a Logistic Regression meta-learner. The experimental results indicated that the LogiTriBlend model achieved an accuracy of 99.34% using Doc2Vec, outperforming Word2Vec and TF-IDF feature extraction methods, which obtained accuracies of 99.12% and 98.80%, respectively. The Doc2Vec method resulting in superior email classification performance. Among the models tested, the proposed stacking model, LogiTriBlend, demonstrated robust results; however, the highest accuracy was consistently achieved using Doc2Vec.
format Article
id doaj-art-3bd4a7ca1cc94ff5bf5cf9a8b2adb97b
institution Kabale University
issn 2169-3536
language English
publishDate 2024-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-3bd4a7ca1cc94ff5bf5cf9a8b2adb97b2024-12-27T00:00:52ZengIEEEIEEE Access2169-35362024-01-011219380719382110.1109/ACCESS.2024.351892310804110LogiTriBlend: A Novel Hybrid Stacking Approach for Enhanced Phishing Email Detection Using ML Models and Vectorization ApproachAqsa Khalid0Maria Hanif1Abdul Hameed2https://orcid.org/0000-0002-6842-8631Zeeshan Ashraf3https://orcid.org/0000-0002-2700-5982Mrim M. Alnfiai4https://orcid.org/0000-0003-3837-6313Salma M. Mohsen Alnefaie5School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad, PakistanSchool of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad, PakistanDepartment of Computer Science, The University of Chenab Gujrat, Gujrat, Punjab, PakistanDepartment of Computer Science, Faculty of Computing and IT, IISAT, Gujranwala, Punjab, PakistanDepartment of Information Technology, College of Computers and Information Technology, Taif University, Taif, Saudi ArabiaPhysics Department, Taif University, Taif, Saudi ArabiaEmail phishing remains a prevalent and sophisticated cyber threat, targeting individuals and organizations by disguising malicious intent in seemingly legitimate communications. Effective classification of phishing and legitimate emails is crucial for cybersecurity. In this study, we investigated various text vectorization techniques and machine learning models to address the challenge of email classification. We utilized three vectorization techniques: TF-IDF, Word2Vec, and Doc2Vec. These techniques were applied to traditional machine learning algorithms, and their performance was evaluated against a proposed stacking model, LogiTriBlend. The dataset comprised 501 phishing and 4090 legitimate emails, undergoing preprocessing steps like stemming, lemmatization, and noise removal. To handle the dataset’s imbalance, Synthetic Minority Over-sampling Technique (SMOTE) was employed. The model combines multiple base learners, including Support Vector Machine (SVM), Logistic Regression, Random Forest, and XGBoost, with a Logistic Regression meta-learner. The experimental results indicated that the LogiTriBlend model achieved an accuracy of 99.34% using Doc2Vec, outperforming Word2Vec and TF-IDF feature extraction methods, which obtained accuracies of 99.12% and 98.80%, respectively. The Doc2Vec method resulting in superior email classification performance. Among the models tested, the proposed stacking model, LogiTriBlend, demonstrated robust results; however, the highest accuracy was consistently achieved using Doc2Vec.https://ieeexplore.ieee.org/document/10804110/TF-IDFWord2VecDoc2VecLogiTriBlendSVMXGBoost
spellingShingle Aqsa Khalid
Maria Hanif
Abdul Hameed
Zeeshan Ashraf
Mrim M. Alnfiai
Salma M. Mohsen Alnefaie
LogiTriBlend: A Novel Hybrid Stacking Approach for Enhanced Phishing Email Detection Using ML Models and Vectorization Approach
IEEE Access
TF-IDF
Word2Vec
Doc2Vec
LogiTriBlend
SVM
XGBoost
title LogiTriBlend: A Novel Hybrid Stacking Approach for Enhanced Phishing Email Detection Using ML Models and Vectorization Approach
title_full LogiTriBlend: A Novel Hybrid Stacking Approach for Enhanced Phishing Email Detection Using ML Models and Vectorization Approach
title_fullStr LogiTriBlend: A Novel Hybrid Stacking Approach for Enhanced Phishing Email Detection Using ML Models and Vectorization Approach
title_full_unstemmed LogiTriBlend: A Novel Hybrid Stacking Approach for Enhanced Phishing Email Detection Using ML Models and Vectorization Approach
title_short LogiTriBlend: A Novel Hybrid Stacking Approach for Enhanced Phishing Email Detection Using ML Models and Vectorization Approach
title_sort logitriblend a novel hybrid stacking approach for enhanced phishing email detection using ml models and vectorization approach
topic TF-IDF
Word2Vec
Doc2Vec
LogiTriBlend
SVM
XGBoost
url https://ieeexplore.ieee.org/document/10804110/
work_keys_str_mv AT aqsakhalid logitriblendanovelhybridstackingapproachforenhancedphishingemaildetectionusingmlmodelsandvectorizationapproach
AT mariahanif logitriblendanovelhybridstackingapproachforenhancedphishingemaildetectionusingmlmodelsandvectorizationapproach
AT abdulhameed logitriblendanovelhybridstackingapproachforenhancedphishingemaildetectionusingmlmodelsandvectorizationapproach
AT zeeshanashraf logitriblendanovelhybridstackingapproachforenhancedphishingemaildetectionusingmlmodelsandvectorizationapproach
AT mrimmalnfiai logitriblendanovelhybridstackingapproachforenhancedphishingemaildetectionusingmlmodelsandvectorizationapproach
AT salmammohsenalnefaie logitriblendanovelhybridstackingapproachforenhancedphishingemaildetectionusingmlmodelsandvectorizationapproach