Machine Learning Framework for Phishing Detection through Email using Imbalanced Data
Mher Mkrtumyan
Affiliation: Horizon Academic Research Program
IJSCAR Vol. 3, Issue 1 (2026) · pp. 24–30
Abstract
Phishing emails remain a persistent cybersecurity threat as attackers continue to employ increasingly sophisticated techniques to deceive users into disclosing sensitive information. Detection is particularly challenging in real-world environments where legitimate emails vastly outnumber malicious ones. This study presents a dual-layer machine learning framework for phishing detection that independently analyzes sender metadata and email body content. The sender layer evaluates structural characteristics of email addresses while the content layer extracts linguistic and statistical features from email text. Each layer produces a probability score representing the likelihood of phishing; these are subsequently integrated using a meta-classification model to generate a final decision. The framework is evaluated on a large real-world dataset containing over 500000 emails with a highly imbalanced class distribution. Experimental results demonstrate that the proposed approach provides robust and reliable performance under realistic conditions highlighting the effectiveness of integrating multiple analytical perspectives for practical phishing detection.
Keywords: Phishing Detection, Machine Learning, Dual-Layer Framework, Email Security, Imbalanced Data, Sender Metadata, Email Content Analysis, Random Forest, XGBoost, LightGBM, ROC Curve, Logistic Regression, Cybersecurity, Dataset Preprocessing