Massive Enhanced Extracted Email Features Tailored for Cosine Distance

05/10/2022
by   Farshad Barahimi, et al.
0

In this paper, the process of converting the Enron email dataset (the version cited in the preprint) to thousands of features per email for a selected set of 2400 labelled emails is explained and evaluated. The final features are tailored for Cosine distance so that the Cosine distance invertly reflect the number of top indicative words of each email that are common between the two emails in an explainable normalized fashion. The labelling is based on the leaf folder name in the Enron email dataset (the version cited in the preprint) folders tree and the 2400 emails selected consist 300 emails for each of the 8 labels. The evaluation is based on the accuracy of a k nearest neighbours majority voting classification using Cosine distance. In addition to KNN majority voting classification accuracy and confusion matrix, some statistics for the process is reported. The KNN majority voting classification accuracy using Cosine distance is 76.75 given the 8 labels involved. The result of conversion is 48557 features per selected email out of which exactly 40 features per email are non-zero. The result of conversion is a data set named MeeefTCD (Massive Enhanced Extracted Email Features Tailored for Cosine Distance) available at https://web.cs.dal.ca/ barahimi/data-sets/meeeftcd/ and on a github repository mentioned in this paper.

READ FULL TEXT
research
12/05/2013

A Gabor block based Kernel Discriminative Common Vector (KDCV) approach using cosine kernels for Human Face Recognition

In this paper a nonlinear Gabor Wavelet Transform (GWT) discriminant fea...
research
12/13/2013

An Extensive Evaluation of Filtering Misclassified Instances in Supervised Classification Tasks

Removing or filtering outliers and mislabeled instances prior to trainin...
research
02/22/2017

Learning Deep Features via Congenerous Cosine Loss for Person Recognition

Person recognition aims at recognizing the same identity across time and...
research
07/11/2018

Emotion Recognition from Speech based on Relevant Feature and Majority Voting

This paper proposes an approach to detect emotion from human speech empl...
research
02/21/2023

Why Majority Judgement is not yet the solution for political elections, but can help finding it

Like many other voting systems, Majority Judgement suffers from the weak...
research
05/26/2022

The Document Vectors Using Cosine Similarity Revisited

The current state-of-the-art test accuracy (97.42%) on the IMDB movie re...

Please sign up or login with your details

Forgot password? Click here to reset