EMFET: E-mail Features Extraction Tool

EMFET is an open source and flexible tool that can be used to extract a large number of features from any email corpus with emails saved in EML format. The extracted features can be categorized into three main groups: header features, payload (body) features, and attachment features. The purpose of the tool is to help practitioners and researchers to build datasets that can be used for training machine learning models for spam detection. So far, 140 features can be extracted using EMFET. EMFET is extensible and easy to use. The source code of EMFET is publicly available at GitHub (https://github.com/WadeaHijjawi/EmailFeaturesExtraction)

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

04/14/2022

Open Source HamNoSys Parser for Multilingual Sign Language Encoding

This paper presents our recent developments in the field of automatic pr...
03/17/2021

TNM: A Tool for Mining of Socio-Technical Data from Git Repositories

Networks of collaboration between engineers are reflected in traces of d...
05/12/2020

DeepFaceLab: A simple, flexible and extensible face swapping framework

DeepFaceLab is an open-source deepfake system created by iperov for face...
12/19/2019

VizWiz Dataset Browser: A Tool for Visualizing Machine Learning Datasets

We present a visualization tool to exhaustively search and browse throug...
12/30/2021

AntiCopyPaster: Extracting Code Duplicates As Soon As They Are Introduced in the IDE

We have developed a plugin for IntelliJ IDEA called AntiCopyPaster that ...
06/08/2021

Automatic Generation of Machine Learning Synthetic Data Using ROS

Data labeling is a time intensive process. As such, many data scientists...
11/03/2021

Automatic Embedding of Stories Into Collections of Independent Media

We look at how machine learning techniques that derive properties of ite...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Interface

The interface of EMFET is designed as a group of tab pages. Each tab page represents a feature category and contains checkboxes with unique tags for selecting the desired features to extract. Additionally, the interface provides the user a folder browser to select the corpus folder from which the features will be extracted as shown in Fig 1. A user-friendly progress bar is used to view the current status of the extraction process.

Figure 1: Features Extraction Tool interface.

2 Functionality

In EMFET, the user can load an email corpus with emails saved in EML format (e.g. SpamAssassin111SpamAssassin corpus. Available at:http://spamassassin.apache.org/downloads.cgi?update=201504291720, CSDMC2010222CSDMC2010 corpus. Available at:http://csmining.org/index.php/spam-email-datasets-.html and Ling-Spam333Ling-Spam corpus. Available at:http://www.csmining.org/index.php/ling-spam-datasets.html). EML file extension for an email message saved to a file in the Internet Message Format protocol for electronic mail messages. EML is one of the most common file extensions used by email applications such as Outlook, Thunderbird and Gmail. Several public email corpora in EML format are available on the web and can be used with EMFET.

The tool splits the emails’ into their main parts which are: From, To, CC, BCC, Subject, Text Body, and HTML Body and save each of them to an output file. Then the tool extracts the selected features from the corresponding email part and save them in another file in CSV format. If any error occurred during the extraction process, an error file will be generated.

Figure 2: The implementation structure of EMFET.

3 Implementation

The source code for features extraction process is organized in one main class with a controller to call each selected feature’s procedure, which will use each email object’s attributes and methods to extract the desired features. The variables, packages and emails objects are initialized in separated classes. Figure 2 describes the implementation structure.

Different external libraries are utilized in the implementation which include: Microsoft CDO for Windows 2000 COM Library to deal with each email file as an object and save the emails’ parts into an output file. For features extraction development, we used the following libraries:

  • HTML Agility Pack444External package to parse HTML. Available at:https://htmlagilitypack.codeplex.com/.

  • IKVM.NET555External package to enable Java and .NET interoperability. Available at:https://www.ikvm.net/.

  • Stanford.NLP.CoreNLP666External package to provides a set of natural language analysis tools. Available at:https://www.nuget.org/packages/Stanford.NLP.CoreNLP/.

  • RegularExpressions777Internal package to work with text such as search a pattern within text..

  • LINQ (.NET Language-Integrated Query)888Internal package to query and access data in objects such as XML documents, databases..

The simplified design and implementation structure of the tool makes it easy to use and enhances the ability to add more features to the tool later. More features can be added by creating a checkbox with a unique tag for the new feature, adding the extraction procedure in the main class and linking between this procedure’s name and checkbox’s tag in the controller. Additionally, passing parameters for each email as an object with a unique tag for each feature into the extraction procedures, this allows dealing with this procedures as independent problems. Moreover, using the internal and external packages reduces the source code lines and make the implementation easier to extend and modify.

4 List of E-mail features

Based on the related studies in the literature, the spam features can be classified into three main categories: header features, payload (body) features, and attachment features

[1]. Figure 3 shows the hierarchy of the features categories. So far there are 140 features implemented in EMFET (i.e 49 header features, 2 attachment features, and 89 Payload features). Each feature category is described as follows.

Figure 3: Hierarchy of features groups.

4.1 Header features

The email header is an essential element of any email message, which consists of a set of important features that helps in email delivery. These features are grouped into two classes, namely; email’s metadata and subject. Table 1 lists the header features included in TEMFET with brief descriptions.

ID Feature Details Type Ref. ID Feature Details Type Ref.
1 Year Metadata [2] 26 Replay to MIL? Metadata [2]
2 Month Metadata [2] 27 Replay to Yahoo? Metadata [2]
3 Day Metadata [3, 2] 28 Replay to AOL? Metadata [2]
4 Hour Metadata [3, 2] 29 Replay to Gov? Metadata [2]
5 Minute Metadata [3, 2] 30 X-Mailman-Version Metadata [2]
6 Second Metadata [3, 2] 31 Exist Text/Plain? Metadata [2]
7 From Google? Metadata [2] 32 Exist Multipart/Mixed? Metadata [2]
8 From AOL? Metadata [2] 33 Exist Multipart/Alternative? Metadata [2]
9 From Gov? Metadata [2] 34 No. of characters. Subject [3]
10 From HTML? Metadata [2] 35 No. of capitalised words. Subject [3]
11 From MIL? Metadata [2] 36 No. of words in all uppercase. Subject [3]
12 From Yahoo? Metadata [2] 37 No. of words that are digits. Subject [3]
13 From Example? Metadata [2] 38 No. of words containing only letters. Subject [3]
14 To Hotmail? Metadata [2] 39 No. of words containing letters and numbers. Subject [3]
15 To Yahoo? Metadata [2] 40 No. of words that are single letters. Subject [3]
16 To Example? Metadata [2] 41 No. of words that are single digits. Subject [3]
17 To MSN? Metadata [2] 42 No. of words that are single characters. Subject [3]
18 To Localhost? Metadata [2] 43 Max ratio of uppercase to lowercase letters of each word Subject [3]
19 To Google? Metadata [2] 44 Min character diversity of each word. Subject [3]
20 To AOL? Metadata [2] 45 Max ratio of uppercase letters to all characters of each word. Subject [3]
21 To Gov? Metadata [2] 46 Max ratio of digit characters to all characters of each word. Subject [3]
22 To MIL? Metadata [2] 47 Max ratio of non-alphanumerics to all characters of each word Subject [3]
23 Count of “To” Email Metadata [3] 48 Max of the longest repeated character. Subject [3]
24 Replay to Google? Metadata [2] 49 Max of the character lengths of words. Subject [3]
25 Replay to Hotmail? Metadata [2] - - - -
Table 1: THE HEADER FEATURES

4.2 Payload features

These features are categorized into three sub categories: Email body features, Readability features, and Lexical diversity features.

4.2.1 Email body features

The e-mail body contains unstructured data such as images, HTML tags, text, and other objects. This features set contains 59 features, which are described in Table 2.

ID Feature Details Ref. ID Feature Details Studies
1 Count of Spam Words [4, 2, 5] 31 Number of question marks [2, 5]
2 Count of Function Words [4, 5] 32 No. of multiple question marks [2]
3 Count of HTML Anchor [3, 4, 2, 5] 33 No. of exclamation marks [2, 5]
4 Count of Unique HTML Anchor [3, 2] 34 No. of multiple exclamation marks [2]
5 Count of HTML Not Anchor [4] 35 No. of colons [2, 5]
6 Count of HTML Image [5] 36 No. of ellipsis [2, 5]
7 Count of HTML All Tags [4, 5] 37 Total No. of sentences [4, 2, 5]
8 Count of Alpha-numeric Words [3, 4, 5] 38 Total No. of paragraphs [2]
9 TF-ISF [4] 39 Average No. of sentences per paragraph [2]
10 TF•ISF without stopwords [4] 40 Average number of words pre paragraph [2]
11 Count of duplicate words. [5] 41 Average No. of character per paragraph [2]
12 Minimum word length [5] 42 Average No. of word per sentences [2]
13 Count of lowercase letters [5] 43 No. of sentence begin with upper case [2]
14 Longest sequence of adjacent capital letters [2, 5] 44 No. of sentence begin with lower case [2]
15 Count of lines [2, 5] 45 Character frequency “$” [2, 5]
16 Total No. of digit character [2] 46 No. of capitalized words. [3]
17 Total No. of white space [2] 47 No. of words in all uppercase. [3]
18 Total No. of upper case character [2, 5] 48 Number of words that are digits. [3]
19 Total No. of characters [3, 2] 49 No. of words containing only letters. [3]
20 Total No. of tabs [2] 50 No. of words that are single letters. [3]
21 Total No. of special characters [2] 51 No. of words that are single digits. [3]
22 Total number of alpha characters [2] 52 Number of words that are single characters. [3]
23 Total No. of words [2, 5] 53 Max ratio of uppercase letters to lowercase letters of each word. [3]
24 Average word length [2, 5] 54 Min of character diversity of each word. [3]
25 Words longer than 6 characters [2] 55 Max ratio of uppercase letters to all characters of each word. [3]
26 Total No. of words (1 - 3 Characters) [2] 56 Max ratio of digit characters to all characters of each word. [3]
27 No. of single quotes [2, 5] 57 Max ratio of non-alphanumerics to all characters of each word. [3]
28 No. of commas [2, 5] 58 Max of the longest repeating character. [3]
29 No. of periods [2, 5] 59 Max of the character lengths of words. [3, 5]
30 No. of semi-colons [2, 5] - - -
Table 2: EMAIL BODY FEATURES

4.2.2 Readability features

Readability features represent the difficulty properties of reading a word, a sentence or a paragraph in the given email’s body [4],[5]. Readability features are extracted based on the syllables — sequence of speech sounds —, which are used to distinguish between the simple and complex words. This set of features contains 23 features that measure the difficulty of reading the email’s body. The readability features are discussed as follows:

  • Number of simple words features (with and without stopwords), such as each word has at most two syllables.

  • Number of complex words features (with and without stopwords), such as each word contains three or more syllables

  • Word length features (with and without stopwords) based on the number of syllables, and the number of the words in the text.

  • Fog Index (FI) features (with and without stopwords), which are the most popular readability measure that used estimate the years of education experience to understand the text at glance.

  • Flesch Reading Ease Score (FRES) features (with and without stopwords), which are used to asses the textual difficulty.

  • SMOG index features (with and without stopwords), which are used to measure the difficulty of the text writing.

  • FORCAST index features (with and without stopwords), which are used to measure the reading skills of the text that contains high percentage of simple words.

  • Flesch-Kincaid Readability Index (FKRI) features (with and without stopwords), which are similar to the previous features, but with different weighting factors.

  • Simple Word FI features (with and without stopwords), which are similar to the Fog Index but with the simple words.

  • Inverse FI features (with and without stopwords).

  • SMOG-I feature, which is used to measure the difficulty of the text writing.

  • Automated Readability Index (ARI), which are used to measure the understandability of a given text.

  • Coleman-Liau Index (CLI), which is similar ARI, but with of syllables factor instead of characters factor.

4.2.3 Lexical diversity features

Lexical Diversity Features are extracted based on the vocabulary size in the text, where the word occurrences are counted using different constraints [3],[6], [7]. This set of features contains seven features as follows:

  • Vocabulary Richness, which represents the number of distinct words in a text.

  • Hapax legomena or V(1,N): represents the number of words occurring one time in the text with N words.

  • Hapax dislegomena or V(2,N): represents the number of words occurring two times in the text with N words.

  • Entropy measure, which asses the average amount of information in the text.

  • YuleK, which is a text characteristic measure that is independent of text length.

  • SichelS, which is the ratio of dislegomena to the number of distinct words.

  • Honore, which is the ratio of hapax legomena (V (1,N)) to the vocabulary size N.

4.3 Attachment Features

We use two features related to attachments: Number of all attachment files in an email and Number of unique content types of attachment files in an email.

5 How to install and use

In this section we list instructions for downloading, installing, and using EMFET version 1.0.0, as follows:

  1. This tool runs over Windows operating system, either 32 bit or 64bit, and under .Net Framework 4.0 or above.

  2. Make sure you have Internet connection and an Internet browser (such as Google chrome or Firefox).

  3. Type in the search bar the address of the tool on Github as follows:

  4. In the middle of the page click ”Clone or download” button, after that a popup menu will appear, then click on ”Download ZIP” button.

  5. Decompress the downloaded ZIP folder using archiving manager software such as WinRAR on windows.

  6. Open the extracted folder and double click on ”EmailFeaturesExtraction” folder, then open the bin folder and then the Debug folder. Lastly, double click the executable file ”EmailFeaturesExtraction.exe”.

  7. A new window will appear, browse your corpus (make sure that it contains EML file extension). There are three tabs to choose the features among them, Header Features, Payload/ Email Body Features, and Attachment Features.

  8. By clicking the ”Extract” button, a new folder inside your corpus folder will be generated and contains a comma separated values (CSV) file of the extracted features.

6 Citing EMFET

Hijawi, W., Faris, H., Alqatawna, J., Aljarah, I., Ala’M, A. Z., and M. Habib (2017). E-mail Features Extraction Tool. Available at: (https://github.com/WadeaHijjawi/EmailFeaturesExtraction).

W. Hijawi, H. Faris, J. Alqatawna, Ala’M, A. Z., and I. Aljarah, “Improving email spam detection using content based feature engineering approach,” in Applied Electrical Engineering and Computing Technologies (AEECT), IEEE, 2017.

7 Acknowledgment

This work has been supported in part by funded project by The University of Jordan.

References

  • [1] W. Hijawi, H. Faris, J. Alqatawna, A. Z. Ala’M, and I. Aljarah, “Improving email spam detection using content based feature engineering approach,” in Applied Electrical Engineering and Computing Technologies (AEECT), IEEE, 2017.
  • [2] J. Alqatawna, H. Faris, K. Jaradat, M. Al-Zewairi, and A. Omar, “Improving knowledge based spam detection methods: The effect of malicious related features in imbalance data distribution,” International Journal of Communications, Network and System Sciences, vol. 8, no. 5, pp. 118–129, 2015.
  • [3] K.-N. Tran, M. Alazab, and R. Broadhurst, “Towards a feature rich model for predicting spam emails containing malicious attachments and urls,” in Eleventh Australasian Data Mining Conference Canberra, ACT, vol. 146, 2013.
  • [4] R. Shams and R. E. Mercer, “Supervised classification of spam emails with natural language stylometry,” Neural Computing and Applications, vol. 27, no. 8, pp. 2315–2331, 2016.
  • [5] B. A. Al-Shboul, H. Hakh, H. Faris, I. Aljarah, and H. Alsawalqah, “Voting-based classification for e-mail spam detection,” Journal of ICT Research and Applications, vol. 10, no. 1, pp. 29–42, 2016.
  • [6] F. J. Tweedie and R. H. Baayen, “How variable may a constant be? measures of lexical richness in perspective,” Computers and the Humanities, vol. 32, no. 5, pp. 323–352, 1998.
  • [7] W. H. Choi, “Finding appropriate lexical diversity measurements for small-size corpus,” in Applied Mechanics and Materials, vol. 121, pp. 1244–1248, Trans Tech Publ, 2012.