Classification of Spam Emails through Hierarchical Clustering and Supervised Learning

05/18/2020
by   Francisco Jañez-Martino, et al.
0

Spammers take advantage of email popularity to send indiscriminately unsolicited emails. Although researchers and organizations continuously develop anti-spam filters based on binary classification, spammers bypass them through new strategies, like word obfuscation or image-based spam. For the first time in literature, we propose to classify spam email in categories to improve the handle of already detected spam emails, instead of just using a binary model. First, we applied a hierarchical clustering algorithm to create SPEMC-11K (SPam EMail Classification), the first multi-class dataset, which contains three types of spam emails: Health and Technology, Personal Scams, and Sexual Content. Then, we used SPEMC-11K to evaluate the combination of TF-IDF and BOW encodings with Naïve Bayes, Decision Trees and SVM classifiers. Finally, we recommend for the task of multi-class spam classification the use of (i) TF-IDF combined with SVM for the best micro F1 score performance, 95.39%, and (ii) TD-IDF along with NB for the fastest spam classification, analyzing an email in 2.13ms.

READ FULL TEXT
research
07/18/2019

Comparing Multi-class, Binary and Hierarchical Machine Learning Classification schemes for variable stars

Upcoming synoptic surveys are set to generate an unprecedented amount of...
research
09/21/2022

MulBot: Unsupervised Bot Detection Based on Multivariate Time Series

Online social networks are actively involved in the removal of malicious...
research
10/13/2020

Enhancing the Identification of Cyberbullying through Participant Roles

Cyberbullying is a prevalent social problem that inflicts detrimental co...
research
09/21/2023

Generating Hierarchical Structures for Improved Time Series Classification Using Stochastic Splitting Functions

This study introduces a novel hierarchical divisive clustering approach ...
research
04/13/2020

COVID-19 identification in chest X-ray images on flat and hierarchical classification scenarios

The COVID-19 is estimated to have a high impact on the healthcare system...
research
05/29/2018

A Novel Multi-clustering Method for Hierarchical Clusterings, Based on Boosting

Bagging and boosting are proved to be the best methods of building multi...
research
01/16/2019

It's Only Words And Words Are All I Have

The central idea of this paper is to demonstrate the strength of lyrics ...

Please sign up or login with your details

Forgot password? Click here to reset