Autonomous devices are the integral part of cyber-Physical Systems (CPS) . These devices are powered by elegant software applications and utilities to protect against adversarial attacks. Information theoretical solutions along with AI-enabled autonomous agents are capable of protecting these critical components from cyber attacks. However, in addition to these hardware and software devices, there is another autonomous entity in CPS that is harder to harden their security defense systems, i.e., humans.
It is known that humans are the weakest link in the information security chain. Adversarial attacks often exploit this weakness through sending malicious links to individuals who are operating critical infrastructure with the hope that the operators visit malicious links and Websites. These types of phishing attacks can be in the form of emails, adware, or even malicious fake Websites .
In spite of advancement in security technologies and controls, cyber security attacks through phishing still remain the number one choice of attackers due to their higher success rates. Even though many antivirus software and blocking strategies are able to detect spamicity of emails, these malicious emails are able to get the attention of the target receivers, pass through the spam detection tools, and thus being activated.
Machine learning-based approaches have revolutionized the detection mechanism of spam and phishing emails. However, the goodness and accuracy of the detection power of these learning-based algorithms heavily rely on the type of historical data, as known as training data. Existing studies confirm that phishing emails are distinguishable from genuine ones based on features. However, it is unclear whether the historical data are rich enough to train the classifiers properly. More specifically, an important question is whether the machine learning detection-based models are content-aware or they are content agnostic.
To address this grand problem, this paper studies the performance of email embeddings (i.e., email vectorization) to detect phishing emails. We created 12 genuine and 12 phishing emails. The systematically crafted emails are then vectorized and fed into well-known machine learning classifiers. The results demonstrate that, even though the contents of both phishing and legitimate emails are similar, the email embeddings technique is able to distinguish between phishing and genuine emails. To capture the content of the emails into account, we then embed emails using doc2vec, a vectorization approach to capture the semantic of a given text.
In our previous work [12, 1], we studied the usage of linguistic features in the context of fake reviews. We observed that it is possible to detect fake reviews using linguistic features. In this paper, we are interested in exploring whether it is possible to detect phishing emails when the contents of both phishing and legitimate emails are somewhat similar. The key contributions of this paper are as follows:
We introduce a carefully and systematically crafted set of phishing and legitimate emails to support this line of research.
The performance of email embeddings techniques is presented through a number of machine learning classifiers.
The results show that even in the presence of similar contents, the email embeddings are able to distinguish legitimate emails from the phishing ones.
This paper is organized as follows: Section II reviews the related work in this line of research. In Section III, a brief technical background of the machine learning models and embeddings is presented. Section IV presents the experimental setup and procedure. The results of the study are presented in Section V. Section VI concludes the paper and highlights future research directions.
Ii Related work
Machine learning techniques are broadly used for spam detection. The most important step for developing such a detection framework is extracting features that are fed into classifiers. A very common method used for this step is Bag of Words (BOW), which utilizes the occurrence frequency of the words or group of words. Although this method is computationally fast and easy to implement, it has several limitations such as ignoring semantics, word order, and a huge number of features that imposes the curse of dimensionality on machine learning classifiers.
Term frequency-inverse document frequency (tf-idf) is another method for feature extraction that is commonly used in search engines and is useful for spam detection. It considers the frequency of only the key words by checking the occurrence frequency of a word in a text or document and comparing it to its occurrence in the whole document or corpus. Therefore, words that are repeated in small sections have higher scores compared to commonly used words in the whole content.
BOW or tf-idf are less efficient methods for detecting spams, such as phishing emails, which have more specific content properties such as a URL link to a malicious source. Given that sometimes it is desirable to pick out phishing emails especially from spam emails, more robust features are needed. Doc2Vec 
is a document embedding technique in which similar documents have similar encodings semantically. It is an unsupervised neural network, which predicts the words in a document.
Duzi et al.  proposed an ensemble method that detects spam emails by combining the classifying results obtained from two feature sets. These feature sets extracted using Doc2Vec and tf-idf methods. They compared different classifiers with these feature sets on two datasets: emails from the Enron dataset  and
emails from the Ling spam corpus. Using support vector machines, they achieved 98.27% for accuracy and 98.97% for f-score on the Ling spam dataset and 96.16% for accuracy and 96.07% for f-score on the Enron dataset, respectively.
Akinyelu and Adewumi 
Unnithan et al.  examined several classifiers for phishing email detection. They used the dataset shared by the ”IWSPA-AP 2018” workshop, including train and test subsets for emails with headers and emails without headers separately. The subset of no-header emails contains samples for the test data and emails as train data, including legitimate and
phishing emails. Two sets of features were generated using the Term frequency-inverse document frequency (tf-idf) and Doc2Vec techniques. They trained seven different models including: Decision Tree (DT), Naive Bayes (NB), Ada-boost, Logistic Regression (LR), K-nearest neighbour (KNN), Support vector machine (SVM), and Random Forest (RF). They applied the models to both subsets of the data (emails with and without headers) using both feature sets. Using the SVM classifier with Doc2Vec feature set on both datasets, they achieved the best results including:true positives (TP), 0 true negative (TN), false positive (FP), and false negative (FN) on emails without header.
Iii Models and Algorithms
Iii-a Feature Extraction
Doc2Vec is a technique to generate document embeddings through a neural network-based approach , similar to what Word2Vec  achieves with words. Figure 1 shows the Doc2Vec model adapted from . As it is observable, the inclusion of a paragraph id in the training phase is a difference with respect to the original Word2Vec model. Note that although in  the term used is paragraph id, normally, this method is used to embed full documents. Given that all words in a specific document are trained using the id assigned to the document, the resulting vector (i.e., embedding) for the document encodes meaning according to the words it contains. Furthermore, as both word and document embeddings are generated during the same training procedure, both words and documents are embedded in the same semantic vector space.
Hence, the resulting vector space has the capability to encode semantic similarities, i.e., vectors that are close together tend to share semantic properties, as Doc2Vec works under the Distributional Hypothesis of linguistics  (i.e., words that are used in similar contexts tend to hold similar meaning).
Doc2Vec generates dense feature vectors, in contrast to the sparse representations produced by frequency-based techniques (e.g., BOW). Because of this, there exists a trade-off between sparsity of the features and interpretability, as the features extracted by Doc2Vec are considered opaque, since there is no direct interpretation for the values in each dimension.
Iii-B Dimensionality Reduction
Iii-B1 Principal Component Analysis
principal components in the dataset, while retaining as much variance of the original dataset as possible. However, there are three important assumptions of the given data when PCA is utilized:
Linearity of the data should be present,
Large variances denote an important structure, and
The principal components are orthogonal.
In addition, when PCA is used for visualization purposes, the value for (i.e., the number of principal components) is 2 or 3. In this work, we set to visualize the data and also to explore the results of classification using the 2-D projection of the document embeddings.
Iii-B2 Kernel Principal Component Analysis
In some cases, the linear nature of the data that PCA takes as an assumption might not hold, and a non-linear structure could explain the underlying phenomena more accurately. Using this idea, PCA can still be applied to such datasets using a non-linear mapping that takes the samples in the original space, often called input space, to a higher dimensional space, also called feature space, and then performing PCA in the resulting feature space . The mapping function is often referred to as Kernel, and the full algorithm as Kernel PCA. Like the case of SVMs, several types of kernels can be utilized to model different patterns in the data. In our work, we used the Gaussian or RBF kernel.
Brief descriptions of the classifiers that we experimentally compared are as follows:
Iii-C1 Random Forest
The random forest classifier, also known as ensemble of decision trees, combines the prediction of several weak classifiers (i.e. decision trees) and produces a final classification through voting. The main hyperparameter of this model is the number of weak classifiers to combine. Due to the fact that decision trees are weak classifiers, their corresponding hyperparameters are also present in this model.
Iii-C2 Support Vector Machine
Support Vector Machines (SVMs) determine a hyperplane that maximizes the margin between itself and the nearest samples of each data class. Such nearest samples are known as support vectors. This construction can be performed in the original input space, or a feature space that is generated by applying a kernel function to each pair of samples, which allows the decision boundary to be non-linear. In our work, we used the linear, Radial Basis Function (RBF), polynomial, and sigmoid kernels. An SVM converges towards non-linearly separable classes using thehyperparameter, where controls the number of samples that can be misclassified, and its optimal value is problem-dependent. Moreover, there exist additional hyperparameters for each kernel, such as the degree in the polynomial kernel, or the gamma term in the RBF kernel.
Iii-C3 Logistic Regression
In binary classification, logistic regression assigns the probability that a sample, denoted by the feature vector, belongs to a class as :
Iii-C4 Naive Bayes
The Naive Bayes classifier performs classification using the strong assumption that all features in a vectorare statistically independent with respect to the class . This is,
This model assigns a predicted class label according to
where is the -th class.
Because the features of our dataset are generated using Doc2Vec document embeddings, we used the Gaussian Naive Bayes in order to estimate the conditional probabilities with a dataset of continuous features. Given a class, a variance , a mean (both and are estimated from the samples), and an observed value for the feature, the probability can be estimated using
Although this assumption is unnatural and it does not hold for most situations, the Naive Bayes classifier has been adopted successfully in tasks such as spam filtering, medical diagnosis, and text classification 
Iv Experimental Setup
Iv-a Scripts and Libraries
Iv-B Data Collection
Twenty-four email stimuli were created for an experiment with human subjects. The content of the emails was modeled off legitimate emails found in the authors’ inboxes or online. For all emails, a sender address, subject line, email body, and URL were created. The body of all emails was 50-100 words long and contained a hyperlink. Because the email stimuli were created to later show to research participants, this method for creating emails and the characteristics of the email stimuli are similar to those used in other phishing experiments [5, 9, 19].
Twelve of the 24 email stimuli were legitimate emails, and the other 12 emails were phishing. All legitimate emails shared two characteristics. First, the URL in the legitimate emails went to a real and safe web address. All legitimate email URLs began with HTTPS. Second, the body of legitimate emails contained targeted messaging. All email stimuli were addressed to a fictional university student. Email stimuli with targeted messaging referenced the student by name, referenced their position as a student, or referenced the city in which the student attended university.
The 12 phishing emails were designed to mimic non-targeted phishing emails. The non-targeted phishing emails differed from the other types of emails in terms of both their URL and email body. The URLs of the non-targeted phishing emails did not contain HTTPS and had at least two additional characteristics of suspicious links : five or more dots in the domain, over 75 characters long, contained an IP address, or contained misspellings. The email body contained no targeted messaging; non-targeted phishing emails were designed to be emails to could be sent to anyone.
Iv-C Data Preprocessing
We used standard techniques for text preprocessing in order to clean the dataset. These are removal of 1) punctuation marks, 2) URLs, 3) email addresses, and 4) stopwords. We also used the Porter stemmer  to perform stemming for each remaining token.
Iv-D Classification Metrics
Consider a confusion matrix with counts of samples classified as True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). We report the results of our experiments using accuracy,
measure, precision, and recall, which are defined as
Iv-E Hyperparameters Tuning
We performed several exhaustive grid search in order to find the best set of hyperparameters for SVM, Logistic Regression, and Random Forest. For SVM, we tried different values for the constant, kernel type, degree of the polynomial in case of using polynomial kernel, and the gamma value in case of the Gaussian kernel. For logistic regression, we tried different values for the regularization parameter. For random forests, we tried different number of estimators, maximum tree depth, minimum number of samples per leaf, minimum number of samples per split, and criterion to choose one split over others.
Iv-F Experiments Flowchart
Figure 2 shows the flowchart of the experiments carried out in our work. After the feature extraction step, there are three scenarios in which the classification task was applied: 1) directly using the 20-dimensional Doc2Vec email embeddings, 2) using the linear PCA 2-D projections of the embeddings, and 3) using the Kernel PCA 2-D projections of the embeddings. Whether it is using the full high-dimensional features or a 2-D projection of it, the next step is the hyperparameter tuning, in which the best hyperparameters for the classifiers are set. Once the hyperparameters are found, the classifiers are fit using them and report the classification metrics using 10-fold cross-validation.
V Results and discussion
We report the averages of accuracy and score obtained after running the experiments using 10-fold cross-validation. Table I shows the results for the classifiers using the full 20-dimensional space provided by the document embeddings. SVM reports accuracy and score of 81.6% and 76.6%, respectively, which is significantly higher than that of the next classifier, the random forest.
Table II shows the results for the classifiers trained using only the first two components projections of linear PCA. In general, these results are consistently higher than those of in Table I. This is an interesting finding as it suggests that the variance present in the first two components of PCA is not only enough to maintain the performance of the classifiers, but it also increases the performance. Random forest reports the highest accuracy and score with 91.6% and 90.0%, respectively. However the margin between the performance of random forest and the rest of the classifiers is also reduced, the SVM and Naive Bayes classifiers having similar values performance-wise.
Table III shows the results for the classifiers trained using the first two components projections of the RBF Kernel PCA. Unlike the results for linear PCA, the performance of the classifiers is diminished with respect to the results in the original 20-dimensional space. In this case, the best results are reported by SVM, with an accuracy and score of 78.3% and 76.6%, respectively.
One interesting finding of these results is that the linear PCA representation of the document embeddings yields to better classification results. This is worth noting as the document embeddings often contain an underlying non-linear structure. Because Doc2Vec embeds the document vectors in a semantic vector space, this suggests that the content of the emails can be segmented according to the class through a linear transformation of the vectors.
V-B Visualization of Email Embeddings
Figure 3 shows the 2-dimensional plot for the email embeddings using the first two linear PCA components where the x-axis represents the first principal and the y-axis holds the values for the second principal. Additionally, we plotted the decision boundary of a SVM with RBF kernel that was fitted using the 24 emails only for visualization purposes. It is visible that the distribution of document embeddings follows a pattern that is easily captured by the decision boundary of the SVM. In this projection, phishing emails are contained in a blob located near the center of the plot, with only two legitimate emails within or in the border of the decision boundary. This finding agrees with the notoriously high performance of the classifiers using the linear PCA 2-D projections. However, note that even though the projection is obtained as linear, the decision boundary is not linear, which also explains the high performance for non-linear classifiers.
Figure 4 shows the 2-dimensional plot using the first two RBF Kernel components. Like Figure 3, we also plotted the decision boundary of a SVM that was fitted using all the samples and a RBF kernel. However, unlike Figure 3, the document embeddings are not clearly segmented by class. Legitimate emails are grouped closely together with the exception of emails 9 and 2, which are the same emails that lie within or near the wrong side of the decision boundary in Figure 3. Nevertheless, there are several phishing emails on the wrong side of the decision boundary. In this sense, this projection presents a less useful representation for class segmentation when compared to linear PCA, which can be seen in its classification results in Table III. The slight difference between the results in Table I and Table III might suggest that the document embeddings follow a similar configuration in the 20-dimensional feature space.
V-C Explained Variance of Linear PCA
Given that we obtained better classification results using the linear PCA projection, we performed an explained variance analysis for this PCA. Figure 5 shows the cumulative explained variance per number of components using PCA. The x-axis represents the number of components; whereas, the y-axis shows the cumulative explained variance ratio , defined as , where is the number of components and is the portion of variance explained by the -th component. In this plot, the 90% of the explained variance is reached using 12 components. The results in Table II are reported using the first two PCA components, which approximately comprise the 25% of the total variance. Hence, one surprising finding is that the 91.6% accuracy of the random forest classifier was achieved using only 25% of the original variance.
Vi Conclusion and Future Work
In this work, we generated document embeddings using Doc2Vec and performed binary classification using SVM, Logistic Regression, Random Forest, and Naive Bayes. We executed the experiments on the full 20-dimensional feature space generated by Doc2Vec, and additionally we calculated the 2-D projections of linear PCA and RBF Kernel in order to run the experiments in the low-dimensional projections.
In the original feature space, SVM reports the highest classification performance with an accuracy and score of 81.6% and 76.6%, respectively. The results using the 2-D linear PCA projections are higher than those of the 20-dimensional feature space, where the Random Forest reports an accuracy and score of 91.6% and 90.0%, respectively, using only 25% of the original variance. The results using the 2-D RBF Kernel PCA projections are slightly lower than those of the 20-dimensional feature space, where the SVM reports an accuracy and score of 78.3% and 76.6%, respectively. The highest results using the linear PCA projections suggests that the underlying structure of the Doc2Vec document embeddings is likely to be linear. The overall high classification results suggest that the semantic vector space in which the document vectors are is appropriate for this classification task. Moreover, the semantics of the emails’ content are well suited for the class segmentation.
As future work, we will explore the use of features that permit an easier interpretation and provide a deeper insight into phishing and legitimate emails. There are some other intriguing approaches to address the phishing email detection problem. A possible approach is the use of evidence theory and fusion in formulating the problem  where a set of linguistic features and evidence can be used to decide pignistic probability of whether an email is phishing. It is also possible to model the phishing email detection through exploring some other machine learning techniques 
or emerging deep/machine learning techniques such as reinforcement learning.
This research work is supported by National Science Foundation (NSF) under Grant No: 1723765.
-  (2020) Linguistic features for detecting fake reviews. In ICMLA, Cited by: §I, §I.
Can machine/deep learning classifiers detect zero-day malware with high accuracy?. In IEEE Big Data, pp. 3252–3259. Cited by: §VI.
-  (2014-04) Classification of phishing email using random forest machine learning technique. Journal of Applied Mathematics 2014, pp. . External Links: Cited by: §II.
-  (2006) Apache spamassassin project. External Links: Cited by: §II.
-  (2016) Quantifying phishing susceptibility for detection and behavior decisions. Human factors 58 (8), pp. 1158–1172. Cited by: §IV-B.
-  (2018) Evidence fusion for malicious bot detection in IoT. In IEEE Big Data, pp. 4545–4548. Cited by: §I, §VI.
-  (2019) Detecting phishing websites through deep reinforcement learning. In IEEE COMPSAC, Vol. 2, pp. 227–232. Cited by: §IV-B, §VI.
Hybrid email spam detection model using artificial intelligence. International Journal of Machine Learning and Computing 10, pp. 316–322. External Links: Cited by: §II.
-  (2007) Behavioral response to phishing risk. In Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit, pp. 37–44. Cited by: §IV-B.
-  (2002) Logistic regression and artificial neural network classification models: a methodology review. Journal of biomedical informatics 35 (5-6), pp. 352–359. Cited by: §III-C3, §III-C3.
-  (2006) Dimensionality reduction a short tutorial. Department of Statistics and Actuarial Science, Univ. of Waterloo, Ontario, Canada 37 (38), pp. 2006. Cited by: §III-B2.
-  (2020) Ensemble learning for detecting fake reviews. In IEEE COMPSAC, pp. 1320–1325. Cited by: §I.
-  (2004-09) The enron corpus: a new dataset for email classification research. Vol. 3201, pp. 217–226. External Links: Cited by: §II.
-  (2014) Distributed representations of sentences and documents. In ICML, pp. 1188–1196. External Links: Cited by: §II, Fig. 1, §III-A1.
-  (2002) NLTK: the natural language toolkit. arXiv preprint cs/0205028. Cited by: §IV-A.
-  (2019) Machine learning for email spam filtering: review, approaches and open research problems. Heliyon 5 (6), pp. e01802. Cited by: §II.
-  (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §III-A1, §III-A1.
-  (2006) Phishingcorpus. External Links: Cited by: §II.
-  (2015) The design of phishing studies: challenges for researchers. Computers & Security 52, pp. 194–206. Cited by: §IV-B.
-  (1980) An algorithm for suffix stripping.. Program 14 (3), pp. 130–137. Cited by: §IV-C.
-  (2010) Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. Cited by: §IV-A.
-  (2001) An empirical study of the naive bayes classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence, Vol. 3, pp. 41–46. Cited by: §III-C4.
-  (2014) A tutorial on principal component analysis. arXiv preprint arXiv:1404.1100. Cited by: §III-B1.
-  (2018-03) Detecting phishing e-mail using machine learning techniques cen-securenlp. pp. . Cited by: §II.