HTMLPhish: Enabling Accurate Phishing Web Page Detection by Applying Deep Learning Techniques on HTML Analysis

08/28/2019 ∙ by Chidimma Opara, et al. ∙ Northumbria University Teesside University 0

Recently, the development and implementation of phishing attacks require little technical skills and costs. This uprising has led to an ever-growing number of phishing attacks on the World Wide Web daily. Consequently, proactive techniques to fight phishing attacks have become extremely necessary. In this paper, we propose a deep learning model HTMLPhish based on the HTML analysis of a web page for accurate phishing attack detection. By using our proposed HTMLPhish, the experimental results on a dataset of over 300,000 web pages yielded 97.2 learning methods such as Support Vector Machine, Random Forest and Logistics Regression. We also show the advantage of HTMLPhish in the aspect of the temporal stability and robustness by testing our proposed model on a dataset collected after two months when the model was trained. In addition, HTMLPhish is a completely language-independent and client-side strategy which can, therefore, conduct web page phishing detection regardless of the textual language.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The infamous phishing attack is a social engineering technique that manipulates internet users into revealing private information that may be exploited for fraudulent purposes (Lopez and Rubio, 2018). This form of cybercrime has recently become common because it can be carried out with little technical ability and significant cost (Sahingoz et al., 2019). The proliferation of phishing attacks is evident in the 46% increase in the number of phishing websites identified between October 2017 and March 2018 by the Anti-Phishing Working Group (APWG) (APWG, 2018). The impact of phishing attacks on businesses such as decreased productivity, loss of propriety data, and damage to reputation can be devastating.

1.1. Problem Definition

Recent research in phishing detection approaches has resulted in the rise of multiple technical methods such as augmenting password logins (Chattaraj et al., 2018), and multi-factor authentication (Acar et al., 2013). However, these can be easily imitated by the phishing websites as well.

Machine learning-based anti-phishing techniques which rely on the assumption that phishing pages contain statistically distinct patterns from legitimate pages are becoming popular. For example, in (Amrutkar et al., 2017)

, phishing web pages are automatically detected based on handcrafted features extracted from the URL, HTML content, network, and JavaScript of a web page. These machine learning-based techniques compare the features of the phishing websites with a set of predefined features. Therefore the accuracy of these models depends on how comprehensive the feature set is and how impervious the feature set remains to future attacks. Furthermore, current phishing detection research has focused on integrating natural language processing techniques with deep learning

(LeCun et al., 2015; Gutierrez et al., 2018; Buber et al., 2017) to extract specific features from emails and URLs, respectively.

While feature engineered techniques have proven successful, they nevertheless have their limitations. Manual feature engineering is a tedious process, and these handcrafted features are often targeted and bypassed in future attacks. It is also challenging to know the best features for one particular application.

1.2. Proposed Solution

This paper presents a deep learning based data-driven end-to-end automatic phishing web page classification approach, HTMLPhish. Our proposed approach does not require manual feature engineering. Instead, HTMLPhish directly analyses the HTML content of a web page. HTMLPhish first uses the word embedding technique to learn the representation of each HTML element and textual content, and then recurrent neural networks (RNNs) is employed to model the sequential dependencies.

The following characteristics highlight the relevance of HTMLPhish to web page phishing detection: (1) Every web page, whether legitimate or phishing, uses HTML to render its content on a web browser(Orman, 2012); (2)HTMLPhish analyses HTML directly to help reserve useful information. It also removes the arduous task required for the manual feature engineering process. (3) HTMLPhish takes into consideration both the HTML and textual contents when training the deep learning model.

To train and thoroughly evaluate our proposed model, we collected a dataset of the labelled HTML content of over 300,000 web pages two months apart. We propose a state-of-the-art HTML phishing detection method based on RNN techniques and have an excellent performance. Furthermore, extensive evaluations are conducted to demonstrate the ability of HTMLPhish to detect zero-day attacks. This is achieved by training the model months before being applied to test dataset. Our result only recorded a minimal 2% accuracy decline, and this confirms the fact that HTMLPhish remains reliable and temporally robust over a long period. Our experiments also demonstrate the ability of our model to detect phishing web sites it has not been exposed to thereby preventing zero-day phishing attacks. Finally, we demonstrate that HTMLPhish outperforms other baseline models, namely, Logistics regression (Cox, 1958), kernel SVM (Drucker et al., 1999)

, and a Random Forest Classifier

(Kam, 1995).

The main contributions of this paper are summarised as follows:

  • Different from existing methods, our proposed model, HTMLPhish, to the best of our knowledge, is the first to use the HTML elements and textual content without any feature selection to train a deep learning model for phishing detection.

  • We conducted extensive evaluations of HTMLPhish using a dataset of more than 300,000 HTML content and URLs of phishing and legitimate web pages created during the investigation.

  • Also, we applied NLP data pre-processing techniques which tailors the dataset to our model by removing insignificant HTML markups while reserving punctuation marks.

  • Furthermore, we demonstrated the robustness of HTMLPhish to detect zero-day attacks. The model was tested on a dataset collected months after the deep learning model has been trained, and its accuracy remained similar, which shows the temporal robustness of our proposed model.

We organised the remainder of the paper as follows: the next section provides an overview of related works on proposed techniques of detecting phishing on web pages. Section 3 gives the preliminary knowledge on recurrent neural networks used in our model, while Section 4 provides an in-depth description of our proposed model. Section 5 elaborates on the dataset collection and pre-processing while the detailed results on the evaluations of our proposed model are found in Section 6. Finally, we conclude our paper in Section 7.

2. Related Works

Ding et al (Ding et al., 2019)

proposed a search and heuristics based approach trained with Logistics Regression named SHLR. Their approach involved searching the title tag of the web page in Baidu search engine to check if it matches with the top 10 search results. Subsequently, a set of heuristics rules are applied to the characters of the URL of the web page before using a logistics regression classifier to evaluate the accuracy of the proposed method. Their experimental results yield 98.9% accuracy.

LPD, a client-side based web page phishing detection mechanism was proposed by Varshney et al (Varshney et al., 2016). The strings from the URL and page title from a specified web page is extracted and searched on the Google search engine. If there is match between the domain names of the top T search results and the domain name of the specified URL, the web page is considered to be legitimate. The result from their evaluations gave a true positive rate of 99.5%.

Barraclough et al. (Barraclough et al., 2013) built a model that combined Fuzzy Logic and neural network to expose phishing in online transactions. The novelty is the combination of 288 features derived from the user-behaviour profile, legitimate website rules, PhishTank 111https://www.phishtank.com

and User-specific site, and pop-Ups from emails. These attributes were trained using supervised learning and yielded an accuracy of 98.5%. However, this method is also heavily dependent on manual feature engineering,

Smadi et al. (Smadi et al., 2018)

proposed a neural network model that can adapt to the dynamic nature of phishing emails using reinforcement learning. The proposed model can handle zero-day phishing attacks and also mitigate the problem of a limited dataset using an updated offline database. Their experiment yielded a high accuracy of 98.63% on fifty features extracted from a dataset of 12,266 emails.

Bahnsen et al. (Bahnsen et al., 2017) proposed a phishing classifying scheme that used features of the URLs of a web page as input and implemented the model on an LSTM network. The results yielded gave an accuracy of 98.7% accuracy on a corpus of 2 million phishing and legitimate URLs.

Unlike these works, we built HTMLPhish using only the HTML content of a web page on an LSTM neural network that does not require manual feature engineering. HTMLPhish learns sequences of data as a component of its overall procedure of figuring out how to accurately detect phishing web pages.

3. Preliminaries

In this section, we give a brief overview of the deep learning algorithms employed in our phishing detection classification problem.

3.1. Recurrent Neural Networks

Deep Learning models have layers of hidden networks between its input and output layers. These hidden networks in deep learning models allow the extraction of more representative characteristics from the lower layers. Additional to the typical feed-forward layers, researchers have created effective deep learning models with recurrent layers (LeCun et al., 2015). Recurrent units have links between neural units which forms a directed cycle that enables them to display dynamic temporal conduct of arbitrary sequential inputs. This characteristic has seen the application of Recurrent units in speech recognition and recognition of handwriting.

3.1.1. Vanilla RNN

A Vanilla RNN (Goodfellow et al., 2016) is an internally looped neural network, which can process sequences of inputs by iterating through them and keeping a state which contains information that it has been exposed to in the network. Given a sequence of inputs, the output from Vanilla RNN is given by:

(1)

where W and U are the weights, while is the bias. is the input at time step t and is the input from the previous time step.

3.1.2. Long short-term memory

The Vanilla RNN applications are still facing the challenges of the inability to associate past data to the present task when there is a large learning gap between them. The Long-Short Term Memory (LSTM) model

(Hochreiter and Schmidhuber, 1997) was proposed to solve the long-term dependency issue in RNNs. The default aim of LSTMs is to recall information for prolonged periods. To achieve this, the LSTM has a cell that tunes the following output and the next state. The following equations describe how a layer of the memory cell is updated at every time step in the LSTM layer:

Gating variables:

(2)
(3)
(4)

New cell state:

(5)
(6)
(7)

in which denotes inputs, denotes forget gates, denotes cell states, denotes outputs, and denotes the hidden state.

3.1.3. Gru

In 2014, Cho et al. (Cho et al., 2014) proposed GRU, which is similar to LSTM but more computationally efficient. The GRU cell is made up of the update gate a and the reset gate b. The update gate is calculated using:

(8)

Where is the input at the current time step and is information from previous time step . and are their corresponding weights. The reset gate is very similar to the update gate except for the weights applied. It is calculated by

(9)

The final output of the cell is calculated in Equation 10 using the values from the update gate , input from the previous time step and the value form the current cell state .

(10)

4. The Proposed Model

In this section, we elaborate on the architecture of our proposed deep learning model HTMLPhish.

To model the sequential dependencies of each HTML element and textual content in our dataset, we choose to use an RNN due to its capability to process sequences of inputs and automate the feature engineering process. We define the problem of detecting phishing web pages using their HTML content as a binary classification task for prediction of two classes: legitimate or phishing.

Given a dataset with R web pages , where for r = 1, . . . , R represents an HTML content of the th web page from the dataset, while is its label. corresponds to a phishing HTML content while is a legitimate HTML content. Our proposed model aims to detect the phishing websites accurately.

4.1. The Deep Learning Model

As detailed in Figure 1, HTMLPhish is essentially a deep learning model comprised of recurrent layers and Fully Connected (FC) layers. We employ recurrent layers to learn the representation in the input features for the HTML website classification. We also apply an Embedding Layer to extract useful features from the HTML content while the FC layers serve as an output layer for the deep learning model.

Figure 1.

A schematic overview of HTMLPhish. An 8-dimensional embedding transforms each tokenised word. Using a batch size of 20, the transformed HTML tokens are fed into an RNN layer. Then, the classification is carried out in the FC layer using a sigmoid function.

Embedding Layer: One of the main advantages of our model is its capacity to perform functionally using unprocessed words. Therefore, the ability of our technique to learn representations of words is essential to our approach. At first, we conduct tokenization on the HTML contents and segment the strings of HTML into tokens. An index is associated with each token from a finite dictionary, M

. Next, we apply the Word Embedding technique. A simple index does not contain much valuable word data. However, the Word Embedding layer maps each of this word indices into a feature vector. A relevant representation of each word in the HTML and textual content in our dataset is then provided by the respective lookup table feature vector, which is trained by backpropagation using a random initialization. As these word vectors are gradually modified during training by backpropagation, they are structured into a vector space that is relevant to our phishing detection model, which are exploited by the neural network layers.

The lookup table layer in the Embedding layer, gives a -dimensional vector representation for each word . Given an HTML content from a web page of length N words, the Lookup table applies the following function for each word to produce a resulting matrix of:

(11)

Where are parameters to be learned while is the column of S and the is the word vector size.

We compared their HTML length distribution and counted the number of unique words in our dataset to ensure the necessary hyperparameter for this layer is well-tailored and suited to our model. We have about 321,009 unique tokens in our dataset. Also, we observed that the phishing pages tend to be shorter in terms of the number of words, with the average number of words in both phishing and legitimate web pages being 3,500 words. Therefore, we empirically set the

-dimension of the word vector size to 8, the maximum number of words s per web page to 5,000 and the size of the finite Dictionary M to 320,000.

Recurrent Layer: RNN architectures follow the Word Embedding layer. Consider a sequence of HTML content of length N with the input to the network denoted as where each is a real-valued vector containing embedded word information of the HTML content at i. The RNNs accepts as input to at each time step. Therefore once the RNN layers are trained, their outputs includes the spatial relationship for its input sequence at time t, = , , …,. This output is passed to the FC layers.

The RNNs were used to learn the interaction of the HTML content and their associated labels for each web page. At each time step, the network receives a token from the HTML content as input and outputs its relevant label.

FC Layer: For the FC layer in our model, we chose to use the sigmoid activation to output the result from the model. This last layer which comes after the RNN layer squashes the output from the model into the range 0 to 1, according to the expression:

given the probability of two classes:

legitimate or phishing, where . W and are model parameters, and is the input at time step t.

Optimisation: HTMLPhish was trained using Adam, a method posited by (Kingma and Ba, 2015) for stochastic gradient optimization. Adam is a blend of two common methods of optimization: AdaGrad(Duchi et al., 2011)

, the adaptive gradient algorithm, and RMSProp

(Tieleman and Hinton, 2012)

, which adds a decomposition term. For the different network parameters, Adam computes specific adaptive learning rates based on projections for the first and second gradient moments.Adam has been shown to perform equivalent or better than some other methods of optimization

(Kingma and Ba, 2014), regardless of hyperparameter environment. Since our deep learning model HTMLPhish is a binary classification network, we implemented the binary cross-entropy to monitor the performance of the model:

(12)

where and are the true and predicted labels respectively . This score was then used as a feedback signal to adjust the value of the weights using the Adam optimizer in Equation 13.

(13)

where is the learning rate and and are the first and second moments of the gradients with the default values of 0.9 for , 0.999 for and for .

5. Dataset

Data collection plays an essential role in the phishing web page detection using only HTML contents. In our approach, we have gathered the HTML and textual contents using a web crawler. We used the Beautiful Soup (Richardson, 2017) library in Python to create a parser that dynamically extracted the HTML content from each final landing page. We chose to use Beautiful Soup for the following reasons: (1) it has functional versatility and speed in parsing HTML contents, and (2) Beautiful Soup does not correct errors when analysing the HTML Document Object Model(DOM). Figure 2 shows an overview of the data collection stage.

5.0.1. Data Collection

Figure 2. A schematic overview of the stages involved in our proposed model.

To evaluate the performance of HTMLPhish in this paper, we focus specifically on the HTML elements and textual contents generated from the pre-labelled phishing and legitimate web pages.

To carefully explore HTMLphish’s accuracy, stability, and reusability, we collected a dataset of URLs and HTML content from phishing and legitimate web pages over 60 days. The first dataset D1 consisting of HTML content from 149,751 legitimate URLs and 46,862 phishing URLs was collected between 11 November 2018 to 18 November 2018. D1 dataset was used to train, validate, and test our model and also make a comparison against three baseline methods (see the Overall result in Section 6.2). From 10 January 2019 to 17 January 2019, data set D2 consisting of HTML content from 99,833 legitimate URLs and 39,920 phishing URLs were generated. Note that . The D2 dataset, which was generated two months later, enable us to check whether HTMLPhish maintains the temporal robustness. In total, we had HTML content from 249,584 legitimate URLs and 86,782 phishing URLs in our corpus, as shown in Table 1.

The legitimate URLs were drawn from Alexa.com top 500,000 domains while the phishing URLs were gathered from continuously monitoring Phishtank.com. The web pages in our dataset were written in different languages. Therefore, this does not limit our model to only detecting English web pages. Alexa.com offers a top list of working websites that internet users frequently visit, so it is a good source to be used for our aim. Note that the top three most targeted legitimate domains in our dataset were: ”google.com”, ”facebook.com”, and ”paypal.com”.

Table 1. Summary of the HTML and URLs datasets used in this paper center, width=0.77 Dataset D1 D2 Date generated 11 - 18 Nov, 2018 10 -17 Jan, 2019 Legitimate Web Pages 149,751 99,833 Phishing Web pages 46,862 39,920 Total 196,613 139,753

5.0.2. Data Pre-processing

In consideration of the individual characters and elements that make up our dataset, we carried out the following data pre-processing procedures. We validated the importance of punctuation marks in our dataset. Recent studies (Spitkovsky et al., 2011) has highlighted that punctuation marks improve the linguistic and semantic meaning and dependency of sentences, however, most of their models were trained on English language dataset such as the Wall Street Journal (WSJ) corpus (Marcus et al., 1993). HTML contains a sequence of markup tags that are used to frame the elements in a website. Tags contain keywords and punctuation marks that define the formatting and display of the content on the Web browser. We conducted a preliminary experiment by using 2% of our dataset, which revealed that the HTML contents with the punctuation marks improve phishing detection accuracy by over 2%. This show that the HTML markup and punctuation marks can successfully lend meaning to the syntactic structure of the HTML document and therefore match linguistic constituents.

Additionally, we performed a selective removal of stored stop words. Following standard practices in NLP, we removed stop words in HTML contents to increase the efficiency of our model. Although there are predefined libraries such as the Natural Language Toolkit in Python (Bird et al., 2009) which has a list of stop words tailored to 16 different languages closely related to the semantics of each language, these corpora would not be relevant to our model. Since our dataset is made up of HTML content, which is a web language, we proposed an efficient dictionary method which creates a list of stop words generated from our corpus.

To achieve this, we computed the term frequency-inverse document frequency (tf-idf) weight to each term s in a web page r given by

(14)

Where weight is equal to the number of occurrences of term s in web page r and is calculated by

where is the number of web pages in the dataset that contain the term s and R is the number of web pages in the dataset. Using the list from our corpus, our model can be used with other language websites.

In other words, assigns to term s a weight in each web page r that is: (1) High when s occurs frequently in a small set of web pages (thus giving those web pages highly discriminatory authority) ; or (2) low if almost all the web pages use the term s.

We removed the words from our dataset with tf-idf =¡ 0.001 to construct a new dataset for analysing the web page. Table 5.0.2 shows an excerpt of the stop words removed from our dataset. In total, we removed 437 words, and 90% of these words were HTML attributes such as 'IE 'which means 'Internet Explorer ', '¡b¿ ', and the 'X-UA-Compatible 'meta tag which allows web developers chose the version of web browser the web page should be rendered.

Table 2. HTML Stopwords center, width=0.77

Words
Description
¡b¿ This element is used to draw attention to the attached text without suggesting any additional significance or emphasis. A bold typeface is used to display text encircled by ¡ b ¿ tags.
¡i¿
This element is used to distinguish texts from the adjacent text by styling the labeled text in italics without suggesting any new emphasis on the italic texts.
X-UA-Compatible This meta tag enables web developers to select which Internet Explorer version the website should be rendered as. IE Internet Explorer web browser Chrome Google’s Chrome Frame ¡span¿ This element is the inline element that corresponds to the block level ¡ div ¿. It is used for purely stylistic purposes to select inline content. ¡div¿ The HTML division element is a generic container for continuous content. It has no impact on the content or design until it is styled using CSS. ¡li¿ This element is used to illustrate items in a list. ¡ul¿ With this HTML element, an unordered list of items usually represented as a bulleted list is rendered.

6. Evaluation

In this section, we discuss the selection of hyperparameters used to tune our proposed model. Furthermore, we elaborate on the evaluations carried out to analyse the performance of HTMLPhish.

6.1. Experimental Setup

6.1.1. Implementation.

A suitable combination of hyper-parameters were needed to tune the Vanilla RNN, GRU and LSTM RNN models to determine the best RNN algorithm for HTMLPhish. A grid search was used to select the best combination of hidden units (ranging from 2 to 32) and also the best optimization algorithm suited for the models ( varying between RMSProp and Adam) within a range of learning rates (from 0.0001 to 0.1).

Table 6.1.1

details the selected parameters we found gave the best performance on our dataset bearing in mind the unavoidable hardware limitation. The three RNN models were implemented in Python 3.5 on a Tensorflow 1.2.1 backend. The batch size for training and testing the model were adjusted to 20. The Adam optimizer

(Kingma and Ba, 2015) with a learning rate of 0.0015, was used to update the network weights. As introduced, we implemented the binary cross-entropy to monitor the performance of the model. All HTMLPhish and baseline experiments were conducted on an HP desktop with Intel(R) Core CPU, Nvidia Quadro P600 GPU, and CUDA 9.0 toolkit installed.

Table 3. Hyperparameters selected to tune HTMLPhish + RNN models center, width=0.77
Hyperparameters
Potential Choices Selected
Number of hidden units in RNN layer 2 -32 4 Vector Size of Embedding Layer 8 Optimizer RMSProp and Adam Adam Learning rate 0.0001 - 0.1 0.0015

Number of Epochs

20
Batch Size 20 Loss Functions binary cross-entropy and categorical cross entropy binary cross-entropy Maximum number of words per web page 5,000 Size of the Embedding Lookup Dictionary 320,000

6.1.2. Evaluation Setup.

We implemented a validation strategy to ensure a well-rounded experiment for both HTMLPhish and the baseline methods. For our model, we used a 5-fold cross-validation strategy, which provides an extensive evaluation of the model.

6.1.3. Evaluation Metrics

In our experiment, we used the precision, recall, and f-1 score to evaluate the performance of HTMLPhish and the three baseline models. These metrics were calculated using the number of true positives, false positives, and false negatives. We also calculated the accuracy of each model using Equation 15 for each validation round to analyse the number of instances in the dataset that was correctly classified by the models.

(15)

where TP, FP, TN, and FN stand for the numbers of True Positives, False Positives, True Negatives and False Negatives, respectively.

We also use the receiver operating characteristic (ROC) curve and the Area Under the Curve (AUC) in our evaluation. The ROC curve is a probability curve while the AUC depicts how much the model can distinguish between two classes, which for our model is - legitimate or phishing. The higher the AUC value, the better the performance of the model. The ROC curve is plotted with the true positive rate (TPR) against the false positive rate (FPR) where and .

6.2. Overall Result

In this section, we evaluated the performance of HTMLPhish using SimpleRNN, GRU, and LSTM in detecting phishing web pages and its effectiveness when compared to other baseline models. We trained and tested HTMLPhish with the RNNs on the D1 dataset using the evaluation setup and metrics detailed above. The ROC curves of this experiment are shown in Figure 3. From the result detailed in Table 4, HTMLPhish + LSTM yielded over 95% across its precision, f-1 score and recall metrics. Also, HTMLPhish + LSTM was able to accurately classify 97.2% of our test set as cases of phishing or legitimate web pages. Furthermore, from the results, it is clear that the precision, recall, and f-1 score from the experiment for the three RNN models are well-balanced as their values are very similar. This indicates that HTMLPhish can accurately detect phishing web pages when implemented in the wild.

Note: Given that HTMLPhish + LSTM outperformed the other configurations, we use HTMLPhish + LSTM as the default setting. For the rest of this section, we will use the term HTMLPhish to indicate HTMLPhish trained with the LSTM algorithm unless otherwise stated.

Figure 3. The ROC curve of HTMLPhish on the RNN models
Figure 4. The ROC curve of HTMLPhish and the baseline models
Models Accuracy Precision Recall F-1 Score Training time
HTMLPhish + LSTM 0.972 0.974 0.950 0.964 22 minutes
HTMLPhish + RNN 0.931 0.942 0.920 0.942 5 minutes
HTMLPhish + GRU 0.951 0.940 0.940 0.950 8.3 minutes
HTMLPhish + LSTM + D2 Dataset 0.953 0.945 0.945 0.958 16 minutes
Kernel SVM + TF-IDF + D1 Dataset 0.914 0.795 0.854 0.823 7 minutes
Logistics Regression + TF-IDF + D1 Dataset 0.849 0.835 0.889 0.861 45 Seconds
Random Forest Classifier + TF-IDF + D1 Dataset 0.945 0.901 0.932 0.924 2 minutes
Table 4. Result of HTMLPhish with the RNNs and Baseline models on the D1 Dataset
(a) Random Forest Classifier
(b) Kernel SVM
(c) Logistics Regression
(d) HTMLPhish
Figure 5.

The Confusion Matrix from testing HTMLPhish, Random forest Classifier, Kernel SVM, and the Logistics regression algorithms on the

D1 dataset.

6.3. Comparison Study

Furthermore, we investigated the effect of LSTM in improving phishing detection on web pages using HTML content when compared to simpler baseline models. For our comparative study, we used three machine learning models: logistics regression, kernel SVM, and a random forest classifier. We chose these models because these traditional classifiers were commonly used in sequence detection systems (Mirończuk and Protasiewicz, 2018) and are therefore relevant baselines to compare with HTMLPhish. We used TF- IDF scores of words in the vocabulary of our corpus as features for the baseline model to train the baseline models.

6.3.1. Experimental Setup and Result.

For the comparative study, we also used the D1

dataset and the evaluation metrics explained above to carry out our analysis. We empirically set the number of trees as 70 for the random forest classifier, the penalty for the logistics regression as

L1, and its kernel bias function (RBF) of the non-linear SVM as 50.0.

Table 4 shows the precision, recall, and f-1 score of HTMLPhish against the three baseline models while their ROC curves are shown in Figure 4. HTMLPhish outperforms the three baseline models (with over 5% improvement on the accuracy), while the random forest classifier yielded better results than the logistics regression and SVM classifiers. The key feature in RNN can be attributed to the existence of a long- term memory and temporal robustness. In addition, we measured the training and evaluation times of HTMLPhish and the baseline models. On the D1 dataset, HTMLPhish needs 22 minutes to be trained while the Logistics regression model was trained in less than a minute. Once the model has been trained, HTMLPhish can conduct phishing detection on 564 HTML contents within one second.

6.4. Temporal Robustness

In this section, we evaluate the temporal robustness of HTMLPhish on a dataset collected two months after it was trained.

6.4.1. Experimental Setup and Result.

For this experiment, we saved the architecture and weights of HTMLPhish trained on the D1 dataset. Subsequently, we tested the trained model on the D2 dataset collected two months after. The evaluation setup and metrics detailed in Section 6.1.2 were once again applied to measure the performance of HTMLPhish on the D2 dataset.

It is seen that HTMLPhish on D2 dataset has the equivalent performance, where the accuracy decreases only 2% compared with that on D1 dataset. This shows that HTMLPhish remains temporally robust, and there is no need to retrain model within two months.

7. Conclusion

In this paper, we proposed HTMLPhish, a deep learning-based model that detects phishing web pages using HTML contents. We evaluated our model using a comprehensive dataset of HTML contents generated from over 300,000 phishing and legitimate web pages. HTMLPhish provided a high precision rate, showing a temporally stable result even when it was trained two months before being applied to a test dataset. This shows that HTMLPhish can effectively detect zero-day phishing attacks containing known phishing HTML contents.

We also experimentally demonstrated the added advantage of the long term memory in recurrent neural networks over baseline models in detecting phishing web pages. Furthermore, we showed our proposed method outperforms feature-based detection methods. Our dataset was generated from web pages written in English and non-English language. This shows that the performance of HTMLPhish is independent of the language of the HTML content.

Acknowledgements.
The authors hereby acknowledge the Petroleum Technology Development Fund (PTDF), Nigeria for the funding and support provided for this work.

References

  • T. Acar, M. Belenkiy, and A. Küpçü (2013) Single password authentication. Computer Networks 57 (13), pp. 2597–2614. Cited by: §1.1.
  • C. Amrutkar, Y. S. Kim, and P. Traynor (2017) Detecting mobile malicious webpages in real time. IEEE Transactions on Mobile Computing (8), pp. 2184–2197. Cited by: §1.1.
  • APWG (2018) Phishing activity trends report, 1st quarter 2018. Technical report Cited by: §1.
  • A. C. Bahnsen, E. C. Bohorquez, S. Villegas, J. Vargas, and F. A. González (2017) Classifying phishing urls using recurrent neural networks. In Electronic Crime Research (eCrime), 2017 APWG Symposium on, pp. 1–8. Cited by: §2.
  • P. A. Barraclough, M. A. Hossain, M. Tahir, G. Sexton, and N. Aslam (2013) Intelligent phishing detection and protection scheme for online transactions. Expert Systems with Applications 40 (11), pp. 4697–4706. Cited by: §2.
  • S. Bird, E. Klein, and E. Loper (2009) Natural language processing with python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”. Cited by: §5.0.2.
  • E. Buber, B. Dırı, and O. K. Sahingoz (2017) Detecting phishing attacks from url by using nlp techniques. In International Conference on Computer Science and Engineering (UBMK), 2017, pp. 337–342. Cited by: §1.1.
  • D. Chattaraj, M. Sarma, and A. K. Das (2018) A new two-server authentication and key agreement protocol for accessing secure cloud services. Computer Networks 131, pp. 144–164. Cited by: §1.1.
  • K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio (2014)

    On the properties of neural machine translation: encoder-decoder approaches

    .
    arXiv preprint arXiv:1409.1259. Cited by: §3.1.3.
  • D. R. Cox (1958)

    The regression analysis of binary sequences

    .
    Journal of the Royal Statistical Society. Series B (Methodological), pp. 215–242. Cited by: §1.2.
  • Y. Ding, N. Luktarhan, K. Li, and W. Slamu (2019) A keyword-based combination approach for detecting phishing webpages. computers & security 84, pp. 256–275. Cited by: §2.
  • H. Drucker, D. Wu, and V. N. Vapnik (1999) Support vector machines for spam categorization. IEEE Transactions on Neural networks 10 (5), pp. 1048–1054. Cited by: §1.2.
  • J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12 (Jul), pp. 2121–2159. Cited by: §4.1.
  • I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. Note: http://www.deeplearningbook.org Cited by: §3.1.1.
  • C. N. Gutierrez, T. Kim, R. Della Corte, J. Avery, D. Goldwasser, M. Cinque, and S. Bagchi (2018) Learning from the ones that got away: detecting new forms of phishing attacks. IEEE Transactions on Dependable and Secure Computing 15 (6), pp. 988–1001. Cited by: §1.1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.1.2.
  • H. T. Kam (1995) Random decision forest. In Proc. of the 3rd Int’l Conf. on Document Analysis and Recognition, Montreal, Canada, August, pp. 14–18. Cited by: §1.2.
  • D. Kingma and J. Ba (2015) Adam: a method for stochastic optimization (2014). arXiv preprint arXiv:1412.6980 15. Cited by: §4.1, §6.1.1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436. Cited by: §1.1, §3.1.
  • J. Lopez and J. E. Rubio (2018) Access control for cyber-physical systems interconnected to the cloud. Computer Networks 134, pp. 46–54. Cited by: §1.
  • M. Marcus, B. Santorini, and M. A. Marcinkiewicz (1993) Building a large annotated corpus of english: the penn treebank. Cited by: §5.0.2.
  • M. M. Mirończuk and J. Protasiewicz (2018) A recent overview of the state-of-the-art elements of text classification. Expert Systems with Applications 106, pp. 36–54. Cited by: §6.3.
  • H. Orman (2012) Towards a semantics of phish. In 2012 IEEE Symposium on Security and Privacy Workshops, pp. 91–96. Cited by: §1.2.
  • L. Richardson (2017) Beautiful soup. 4.6.0 edition. External Links: Link Cited by: §5.
  • O. K. Sahingoz, E. Buber, O. Demir, and B. Diri (2019) Machine learning based phishing detection from urls. Expert Systems with Applications 117, pp. 345–357. Cited by: §1.
  • S. Smadi, N. Aslam, and L. Zhang (2018) Detection of online phishing email using dynamic evolving neural network based on reinforcement learning. Decision Support Systems 107, pp. 88–102. Cited by: §2.
  • V. I. Spitkovsky, H. Alshawi, and D. Jurafsky (2011) Punctuation: making a point in unsupervised dependency parsing. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pp. 19–28. Cited by: §5.0.2.
  • T. Tieleman and G. Hinton (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4 (2), pp. 26–31. Cited by: §4.1.
  • G. Varshney, M. Misra, and P. K. Atrey (2016) A phish detector using lightweight search features. Computers & Security 62, pp. 213–228. Cited by: §2.