Explainable Machine Learning Framework for Phishing URL Detection with a Realistic Large-Scale Dataset

Ozcelik E., Elsilk F., OSMANOĞLU M.

8th International Congress on Human-Computer Interaction, Optimization and Robotic Applications, ICHORA 2026, Ankara, Türkiye, 21 - 23 Mayıs 2026, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Doi Numarası: 10.1109/ichora69329.2026.11537041
Basıldığı Şehir: Ankara
Basıldığı Ülke: Türkiye
Anahtar Kelimeler: ensemble learning, explainable artificial intelligence, machine learning, phishing detection
Ankara Üniversitesi Adresli: Evet

Özet

Phishing attacks remain a major cybersecurity threat, often evading traditional blacklist-based defenses through continuously evolving URLs. While machine learning (ML) methods achieve high detection performance, their lack of interpretability limits practical adoption. This study proposes an explainable phishing URL detection framework supported by a newly constructed large-scale dataset of 579,920 URLs. Unlike conventional domain-based collections, realistic and structurally diverse legitimate URLs are generated using protocol variations and a two-stage web crawling strategy, reducing dataset bias and improving generalization. Six ML models and three ensemble strategies are evaluated, where LightGBM achieves the best base performance and stacking ensembles provide marginal gains. For interpretability, global explanations are obtained using SHAP and permutation importance, while local decisions are analyzed with SHAP and LIME. Additionally, a SHAP stability analysis based on Spearman correlation demonstrates highly consistent feature importance rankings across different sample sizes. The results show that the proposed framework enables accurate, robust, and reliable explainable phishing detection for real-world applications.