Towards Benchmark Datasets for Machine Learning Based Website Phishing Detection: An experimental study

10/24/2020
by   Abdelhakim Hannousse, et al.
0

In this paper, we present a general scheme for building reproducible and extensible datasets for website phishing detection. The aim is to (1) enable comparison of systems using different features, (2) overtake the short-lived nature of phishing websites, and (3) keep track of the evolution of phishing tactics. For experimenting the proposed scheme, we start by adopting a refined classification of website phishing features and we systematically select a total of 87 commonly recognized ones, we classify them, and we made them subjects for relevance and runtime analysis. We use the collected set of features to build a dataset in light of the proposed scheme. Thereafter, we use a conceptual replication approach to check the genericity of former findings for the built dataset. Specifically, we evaluate the performance of classifiers on individual classes and on combinations of classes, we investigate different combinations of models, and we explore the effects of filter and wrapper methods on the selection of discriminative features. The results show that Random Forest is the most predictive classifier. Features gathered from external services are found the most discriminative where features extracted from web page contents are found less distinguishing. Besides external service based features, some web page content features are found time consuming and not suitable for runtime detection. The use of hybrid features provided the best accuracy score of 96.61 filter-based ranking together with incremental removal of less important features improved the performance up to 96.83

READ FULL TEXT

page 14

page 15

page 20

research
03/13/2019

Fuzzy Rough Set Feature Selection to Enhance Phishing Attack Detection

Phishing as one of the most well-known cybercrime activities is a decept...
research
12/18/2019

A Web Page Classifier Library Based on Random Image Content Analysis Using Deep Learning

In this paper, we present a methodology and the corresponding Python lib...
research
02/12/2022

The Impact of Using Regression Models to Build Defect Classifiers

It is common practice to discretize continuous defect counts into defect...
research
03/16/2019

Pythia: a Framework for the Automated Analysis of Web Hosting Environments

A common approach when setting up a website is to utilize third party We...
research
07/13/2022

PhishSim: Aiding Phishing Website Detection with a Feature-Free Tool

In this paper, we propose a feature-free method for detecting phishing w...
research
04/28/2021

What Did It Look Like: A service for creating website timelapses using the Memento framework

Popular web pages are archived frequently, which makes it difficult to v...
research
03/18/2019

Galaxy classification: A machine learning analysis of GAMA catalogue data

We present a machine learning analysis of five labelled galaxy catalogue...

Please sign up or login with your details

Forgot password? Click here to reset