Cyberattacks cause huge damage to our society. Many cyberattacks start with phishing. Phishing is to trick people into revealing their sensitive information to the attacker. In particular, phishing URLs are camouflaged as URLs that look familiar to people. Careless people will click them, causing their private information to be leaked. Therefore, many detection methods have been developed and as a response, attackers started to consider evasion techniques that camouflage with legitimate patterns (see Section 3 for more details) (Oliveira:2017:DSP:3025453.3025831; Lin:2019:SSE:3349608.3336141; Ho:2019:DCL:3361338.3361427; adv; Ehab). Thus, it is of utmost importance to prevent phishing attacks using evasion.
There have been proposed machine learning methods to detect phishing. They can be categorized into two types: content-based and URL string-based. Content-based methods download and analyze web page contents (8015116; DBLP:conf/icitst/MohammadTM12; 2017arXiv170107179S). However, they require non-trivial computations to process many web pages and are weak against web browser-based exploits (because we need to access their web pages). Most importantly, it is not easy to collect such training data. For all those reasons, content-based methods are not always preferred. String-based methods mainly rely on URL string pattern analyses because it is well known that phishing URLs have very distinguishable string patterns (Ma09beyondblacklists; Blum:2010:LFB:1866423.1866434; 6061361; DBLP:conf/icitst/MohammadTM12; Mohammad2014; 7207281; Verma:2015:CPU:2699026.2699115; 7945048; 2017arXiv170107179S; hong2020phishing; anand2018phishing). Thus, many lexical features to detect phishing URLs have been proposed (see Section 2). These features are known to be effective in detecting phishing URLs. Because string-based methods are computationally lightweight and provide high accuracy, many researchers prefer them for the high efficiency (2017arXiv170107179S). Some researchers rely on a blacklist of IP addresses and domains. However, its accuracy is known to be mediocre.
Almost all existing string-based methods hardly consider evasion (adv). Evasion means the technique that the attacker creates phishing URLs seemingly legitimate by manipulating their patterns to deceive defenders’ detection methods. In this work, we consider a couple of more key patterns of phishing attacks to design an advanced string-based detection method that outperforms existing methods and is strong against evasion. First, the attacker is sensitive to cost efficiency (apwg). In many cases, they (partially) reuse phishing attack materials and prefer specific hosting companies for their looser policies (e.g., not to require identification information) and relatively cheaper prices than other agencies. When a private server is used instead of hosting companies, the attacker prefers shared hosting, i.e., one server is used for multiple phishing attack campaigns and also for multiple domains — in our data, 15.8% of IP addresses are connected to multiple domains. Second, the attacker creates phishing URLs on top of benign servers, domains, IP addresses, and/or substrings to evade existing detection methods (apwg).
Considering all these facts, we design a novel unified framework of natural language processing and a network-based approach to detect phishing URLs — its overall workflow is shown in Fig.1. We regard each URL as a sentence and segment it into substrings (words) considering the syntax and punctuation symbols of URLs — URLs have well defined syntax as in English. After that, we build one large network that consists of heterogeneous entities, such as URLs, domains, IP addresses, authoritative name servers, and substrings, and perform our customized belief propagation to detect phishing URLs (see Section 4.3.1). We note that the above listed related works do no include any network-based inference schemes. On the contrary, similar network-based inference methods had been used in various other domains (Manadhata2014; chau2011polonium). However, our method differs from them in defining edge potentials which decide a penalty when two neighboring entities have different predicted labels.
Our approach is effective to infer that seemingly unrelated phishing URLs are actually related and is robust to evasion. Because we infer on such a network of heterogeneous entities, an evasion for a phishing URL is not likely to be successful unless a majority of its neighbors in the network are evaded at the same time (see Section 5 for more detailed discussions with theorems and proofs), which is our main contribution in comparison with existing works.
We crawled many suspicious URLs and also downloaded a couple of datasets released by other researchers (Sorio2013DetectionOH; ahmad). In total, we have about 120K phishy and 380K benign URLs. We compare our approach with state-of-the-art baseline methods including graph convolutional networks (GCNs) and feature engineering-based methods. Our method shows the best detection performance among them. Furthermore, in additional evasion tests, our method shows better F-1 scores than other baseline methods. Because the evasion incurs non-trivial expenses for the attacker to access to benign domains, IP addresses, and so forth, our robust detection method greatly increases the attacker’s financial burden to perform evasion.
Our contributions can be summarized as follows:
We design a novel network-based inference method equipped with our proposed robust edge potential assignment mechanism. Our network inference on top of the edge potential assignment outperforms many baseline methods including feature engineering-based and network-based classifiers.
Our proposed network-based method has a theoretical ground on why it is robust to evasion (see Section 5).
We conduct experiments with a large set of URLs collected by us and downloaded from other work. Our data covers a wide variety of phishy/benign URL patterns.
In the following, we first review the literature in Section 2 and describe the motivation of this work in Section 3. Then, in Sections 4 and 5, we design a novel network-based detection method robust to evasion and analyze the theoretical robustness of the proposed method. After that, we conduct extensive experiments on phishing URL detection with and without evasion in Section 6. Lastly, in Sections 7 and 8, we describe crawled our data and conclude our paper. For reference, in Appendix A, we introduce a set of lexical features widely used to detect phishing URLs and sorted in descending order of the feature importance extracted from the best performing baseline method.
2. Related Work
In this section, we review phishing URL detection models and attackers’ behavioral pattern analyses.
2.1. Methods to Detect Phishing URLs
Extensive work has been done to counter phishing attacks (Ma09beyondblacklists; Blum:2010:LFB:1866423.1866434; 6061361; DBLP:conf/icitst/MohammadTM12; Mohammad2014; 7207281; Verma:2015:CPU:2699026.2699115; 7945048; 8015116; 2017arXiv170107179S). Typically, researchers have explored machine learning techniques to automatically detect phishing URLs. It is vital to have a well-defined set of features for the effectiveness of classification algorithms. So, we introduce a widely used set of 19 URL features that we collected from related papers in Appendix A. All these features are used by some baseline methods in our experiments. All the mentioned works are not based on network-based inference but on feature engineering.
Mao et al. designed a phishing URL detection method robust to evasion based on web page content features (8015116). However, it is not easy to collect such training data in many cases because phishing attacks do not last long and web pages are quickly removed, which is one common drawback of all content-based detection methods (6061361).
In (7945048; 2018arXiv180203162L; melissa-dl), several sequence (e.g.,
URL in our context) classification models have been proposed. Some of them have an advanced architecture to combine various components such as recurrent neural networks, convolutional neural networks, word embeddings, and their multiple hierarchical layers. We use their ideas as additional baselines. The first one uses long short-term memory (LSTM) cells and the second model uses one-dimensional convolution (1DConv), and the third baseline uses both (1DConv+LSTM).
For a couple of related problems (Manadhata2014; chau2011polonium), network-based methods have been used. In (Manadhata2014), the authors tried to detect malicious domains (rather than URLs) and the authors in (chau2011polonium)
proposed one heuristic-based belief propagation method to detect malicious codes. Those two works differ in how to create networks but use the same belief propagation method. Both methods correspond to the baseline method marked as ‘POL’ in our experiments. Peng et al. and Khalil et al. also tried a network approach for malicious domain detection(10.1007/978-3-030-12981-1_34; Khalil:2018:DOG:3176258.3176329). However, their methods are not directly applicable to our phishing URL data.
2.2. Attackers’ Behavioral Patterns
Phishing Activity Trends Report (apwg) by Anti-Phishing Working Group is one of the most reputable reports. We analyzed their quarterly reports. The two most important observations from the reports are i) there are some web hosting companies preferred by the attacker due to their low prices and anonymity, and ii) many phishing URLs have similar string patterns because they are created by similar tools or reused from old phishing campaigns. There exist many other interesting observations as follows:
There has been an increase in the number of phishings using free hosting providers or website builders. It has been reported that 81.7% of malicious websites are hosted on free hosting providers (de2021compromised). These free hosts are easy to use but also allow threat actors to create subdomains spoofing a targeted brand, resulting in a more legitimate-looking phishing site. Free hosts also afford phishers additional anonymity, because these services hide registrant information.
The attacker prefers shared hosting which means multiple domains share the same hosting server. Therefore, seemingly unrelated domains may belong to the same host or IP address.
Hundreds of vendors are mostly targeted. This continues a years-long trend in which a few hundred companies are attacked regularly. Considering this fact, we crawled URLs from phishtank.com for the three most frequently attacked vendors: Bank of America, eBay, and PayPal.
53% of phishing attacks use ‘com’ domains and ‘net’, ‘org’, and ‘br’ domains are next equally preferred.
Definition 0 ().
Evasion is an effective technique that one can adopt to disturb a machine learning task by creating a ‘counter-evident’ sample, e.g., a phishing URL hosted by a benign domain or IP address. This evasion can be done in various ways. For detailed evasion techniques that we consider, refer to Section 6.6.
Shirazi et al. showed that existing phishing URL detection methods are adversely impacted by evasion without suggesting a countermeasure (adv). Specifically, they conducted evasion tests that randomly select up to four features of phishing URLs and change the selected features to other benign values. In their non-evasion tests, most classifiers showed high accuracy. In their evasion tests, however, the best performing classifier’s accuracy (recall) decreased from 82-97% to 79-45% with one feature change, and to 0% with four feature changes.
To our knowledge, it has not been actively studied to design a non-content-based phishing URL detection method robust to evasion. We consider many aspects of URLs, including domains, IP addresses, name servers, and string patterns except contents — because collecting phishing web page contents require non-trivial efforts. Most importantly, our method is based on a network of them. Intuitively speaking, attackers cannot disturb our network-based inference task even after evasion if many neighbors of a phishing URL in the network remain the same as before (see Section 5). Some large-scale evasion can still neutralize our method. However, it requires non-trivial expenses, thus decreasing the attackers’ motivation on such evasion.
While it is hard to measure the evasion cost for money, it includes various intangible efforts, such as exploiting benign web servers to implant their phishing pages, maintaining a custom domain without any phishing campaigns until D-Day to prevent it from being blacklisted, and so forth. In particular, it depends on security environments and skills how long it will take until an attacker successfully exploits an administrator’s account of a benign server.
4. Proposed Method
After introducing the overall workflow in our method, we describe its detailed steps with some key visualization results.
4.1. Overall Method
Fig. 1 shows our overall workflow. The entire process can be divided into the following steps:
We crawl many URLs from phishtank.com and download other works’ open datasets.
As mentioned earlier, we create a heterogeneous network of URLs, domains, IP addresses, name servers, and substrings (words). We use a standard natural language processing technique to segment URLs into substrings (words) and draw edges between a URL and substrings.
We run our customized belief propagation algorithm to infer unknown URLs’ phishy/benign labels, which is our main contribution. In particular, this type of inference is called transductive. In our case, both training and testing samples co-exist in a network and testing samples’ labels are inferred from other known training samples’ labels following the network architecture.
4.2. Network Construction
We do a network-based classification rather than feature engineering-based classification. As mentioned earlier, phishing URLs share many common string patterns and various entities are cross-related to each other, so we create a network to represent complicated relationships among multiple entities (vertices) such as URLs, their domains, IP addresses, authoritative name servers, and substrings.
We draw an edge between a URL and its domain.
We draw an edge between a domain and its resolved IP address. We use domains.google and virustotal.com to retrieve domain-IP address resolution history. They return not only current but also all past resolution results with timestamps which enable correct connections. Sometimes, one domain can be connected to multiple IP addresses.
We draw an edge between a domain and its authoritative name servers. In general, there exist multiple authoritative name servers for a domain, and one authoritative name server provides resolution services for multiple domains.
We draw an edge between a URL (i.e., sentence) and a substring (i.e., word) if the URL contains the substring. For these edges, it is very crucial how to segment a URL into substrings. We will shortly describe this in the following section.
4.2.1. How to segment a URL into words
A URL is used to locate resources in the Internet. It consists of several parts: scheme, username, password, host, port number, path, and query string — some of them can be missing. We use our customized word segmentation policies in each part as follows:
Scheme means the protocol, e.g., http and https. Only two words can be possible. However, since these words have very high frequencies, we do not use these two words in our network. We will describe how to remove those stop words111Stop words do not carry meaning but have high frequency values in English, such as ‘a’, ‘the’, ‘is’, and so forth. It is a standard process to remove such stop words in natural language processing algorithms. We use the elbow method based on frequency to detect the stop words of URLs. of URLs shortly.
Username and password can be specified before host. We segment them using the punctuation symbols, i.e., ‘//’, ‘:’, and ‘@’. An example is ‘http://username:firstname.lastname@example.org’.
Hostname can be simply segmented into words by ‘.’.
Sometimes path can be very long, separated by ‘/’. We use all possible punctuation symbols, such as ‘/’, ‘.’, ‘!’, ‘&’, ‘,’, ‘#’, ‘$’, ‘%’, and ‘;’, to segment the path part into words.
Query string is able to contain multiple queries separated by ‘&’, and each query consists of a query name and a value, e.g., ‘term=bluebird&source=browser-search’. We extract words using the two punctuation symbols, ‘=’ and ‘&’.
Because the syntax of URLs is well defined, extracting words can be done very efficiently. However, many meaningless words can be also extracted. Therefore, before drawing edges between URLs and words, those words should be removed. In the field of natural language processing, it is well known that the frequency of words follows Zipf’s law — more precisely, word frequency exponentially decays (33858). In particular, this pattern describes stop words in English very well. For instance, the frequency of the most popular stop word ‘the’ occupies 7% of all word occurrences in the Brown Corpus of American English (francis79browncorpus) and the second most popular stop word ‘of’ has 3.5%. We found that the extracted words from URLs show similar statistics (cf. Fig. 2). After that, we remove some high-frequency words using the elbow method (ketchen1996application). It decides the point whose perpendicular distance to the line segment connecting the two ends is the biggest as the saturation point, which is 800 in our data. We remove all the words whose frequency values are larger than the point.
Fig. 3 shows the network created by the proposed method. Note that there exists strong correlation between the cluster constructions and the ground-truth phishy/benign labels, which justifies our network-based inference method that will be described shortly. In this regard, the main intuition in our work is that it is hard to evade our natural language processing and network-based approach unless a majority of entities in a cluster are evaded simultaneously.
4.3. Network-based Inference
We employ loopy belief propagation (LBP) (Bishop:2006:PRM:1162264) for our network-based inference. Our key contribution in this step is to define a more advanced edge potential assignment mechanism than that of the state-of-the-art methods (chau2011polonium; Manadhata2014). Because these methods typically not only follow a majority voting of neighbors but also give a fixed edge potential regardless of the similarity of the two connected vertices, a vertex is mainly classified as benign if it has many benign neighbors. However, we want to correctly classify a phishing vertex even if it has many benign neighbors. Therefore, we define a more advanced edge potential assignment mechanism for enabling more sophisticated classification and achieving evasion-robustness. We will describe our edge potential definition in Section 4.3.1.
LBP is a message passing algorithm to solve network-based inference problems. Let be a hidden variable and be a set of its neighboring variables, and let be an observed variable. In our contexts, an observed variable means a training sample and a hidden variable means a testing sample. We use and to denote a set of all hidden and observed variables, respectively. Each variable represents the phishy/benign label of an entity in our case. sends a message to other hidden variable after collecting all messages from . Note that observed variables never receive any messages; they only broadcast the messages to their neighboring hidden variables. In our case, phishy and benign URLs in the training set are observed variables.
As mentioned, we need to calculate a message from a variable to other variable regarding a phishy/benign label , where is a set of all possible label options. There exist several message passing strategies: sum-product, max-product, and min-sum. We use the min-sum algorithm having better computational stability than the other two algorithms. For some high degree vertices, message values tend to quickly decay to zeros (i.e., floating point underflow) in the sum-product and max-product. Their product operation is reduced to the sum in the min-sum algorithm. The message in the min-sum algorithm is calculated as:
where is a prior that the variable has the label and is an edge potential
, indicating a joint-probability that’s label is and ’s label is . Note that there is a log function in the message definition so the min-sum is equivalent to performing the max-product in the log space for better computational stability.
After exchanging messages many times, we first calculate a cost of each variable and label pair and then choose the label that yields the lowest cost222The min-sum tries to minimize ‘cost’ as the name ‘min’ suggests whereas both the sum-product and max-product maximize ‘belief’. for each variable. The cost, when has the label , is computed as:
Then, the formal definition of the problem that the min-sum algorithm solves can be defined as follows:
where , where is a set of hidden variables and , is a label assignment function. It is worth mentioning in our setting, can be a hidden variable representing a URL, domain, IP, name server, or word. Our final target is to infer the labels of testing URLs. To this end, we need to infer the labels of other non-URL entities as well because they connect URLs. Therefore, the min-sum algorithm can be described as a process of finding such label assignments to hidden variables that the sum of the costs is minimized.
4.3.1. Edge Potential Assignment
The definition of edge potential is the key factor in the LBP method. (chau2011polonium) used the heuristics of homophily and heterophily. They, for example, assign an edge potential of (resp. ) if two neighboring variables and have different (resp. same) labels as shown in the compatibility matrix in Table 2. is usually set as very small, e.g., 0.001. We use two labels, phishy and benign. At the end of the network-based inference process, for each entity, one label will be assigned as a prediction result. The final label assignments are greatly influenced by the edge potential definition.
In contrast to (chau2011polonium), we incorporate more factors, such as similarity among entities and an improved compatibility matrix, to derive reliable edge potentials — we prove shortly in Section 5 that reliable similarity definitions can lead to the evasion-robustness in our method. The similarity can be measured via various embedding approaches, such as Doc2Vec (le2014distributed) and Node2Vec (grover2016node2vec)
. We discuss how to calculate vector representations of URLs, their domains, IP addresses, authoritative name servers, and words in Section4.3.2.
To calculate the similarity based on those vector representations, we adopt several different similarity measures, including the cosine similarity and various kernels. Our proposed definition of edge potential is shown in Table2. In the table, we denote vector representations of entities in boldface and indicates a similarity between two vectors that can be defined in various ways. Two such examples are as follows:
After that, we use a concept inspired by the hinge-loss (Rosasco:2004:LFS:996933.996940) to assign edge potential values. For instance, in the table is to limit the minimum edge potential to 333This means a lower-bound of edge potential and is set by a user. when two entities have the same label. When is low (resp. high), the proposed definition imposes a large (resp. small) penalty closed to 1 (resp. ). Therefore, the proposed mechanism is able to assign much more sophisticated edge potentials in comparison with existing methods.
One should be very careful when applying our compatibility matrix to other applications. Recall that we use the min-sum algorithm so that in our compatibility matrix , we assign 0 (which corresponds to 1 in the sum-product and max-product algorithms) when and are the same. For the sum-product and max-product algorithms, should be used.
4.3.2. Vector Representations of Entities
We describe how we can calculate reliable vector representations of various entities. These embedding methods are known to be effective in discovering latent relationships among entities (mikolov2013efficient; le2014distributed; 2014arXiv1403.6652P; grover2016node2vec; yoo2022directed; lee2020asine; lee2020negative; DBLP:conf/iclr/0002JJH0JSP22), which is a good fit to our network-based detection under the presence of evasions.
Word Embedding-based Methods
In the area of natural language processing, there have been proposed various semantic embedding methods such as Word2Vec (mikolov2013efficient) and Doc2Vec (le2014distributed). As we mentioned earlier, we segment URLs into words so we can directly apply the methods to calculate the vector representations of URLs and words. However, we cannot directly calculate vector representations of domains, IP addresses, and name servers in this approach because it considers only strings. Inspired by locally linear embedding (LLE) (Roweis2000), however, we propose a heuristic to represent a domain, IP address, or name server as a mean vector of its neighbors’ vectors. LLE says that a vector representation of an entity is a weighted combination of its neighbors’ vectors, e.g., equally weighted in our case. For this, we first calculate mean vector representations of domains and then IP addresses and so forth, given URLs’ vector representations calculated by Word2Vec or Doc2Vec.
Network Embedding-based Methods
Another reliable approach to find vector representations is to use network embedding methods. Many of these methods have been proposed by social network researchers. One advantage of the approach is that we can find vector representations of all entities simultaneously because they can run on our network directly. We use Node2Vec (grover2016node2vec) and DeepWalk (2014arXiv1403.6652P). In Fig. 4 (a), we show a pairwise similarity plot that intuitively justifies our embedding and similarity-based edge potential assignments. However, we see a small portion of phishy and benign pairs in the green circle have high similarities. This can be corrected by our proposed edge potential assignment mechanism, which is shown in Fig. 4 (b).
5. Evasion-Robustness of Our Network-based Approach
In this section, we formally prove that a hidden variable’s phishy/benign label follows its similar neighbors’ majority label, which improves the robustness to evasion.
Lemma 0 ().
Suppose and a small network that consists of a hidden variable and its neighbors . Let be the phishy/benign label of . When , where is an indicator function saying if has a label , the min-sum algorithm in Eq. (3) is optimized.
Example 0 (Example of Lemma 1).
This lemma can be generalized to the following theorem for larger general networks:
Theorem 3 ().
Given a large network , the min-sum algorithm is optimized if for each hidden variable and its neighbors , .
If we can achieve for each hidden variable , it is immediate that the overall cost is minimized in Eq. (3) because the overall cost is defined as the sum of each hidden variable’s cost. ∎
This theorem discusses a sufficient condition of the optimal min-sum solution but sometimes the sufficient condition, for each , is not achievable. However, what the min-sum does in such a case is to strategically drop the sufficient condition for some hidden variables to better minimize the sum of costs for other majority of hidden variables. Therefore, we can still say that the sufficient condition is achievable in general in any network for its majority of hidden variables. In particular, our embedding and hinge-loss based edge potential assignment bring large flexibility to the process. Therefore, the cost sum can be effectively minimized with the proposed method. Fig. 4 shows one such example that our proposed method is able to achieve the sufficient condition in most cases by ignoring some minor edges with high similarity. Because of this property, our approach is robust to evasion unless the attacker collectively evade for neighboring URLs/domains/IP addresses/name servers (see Fig. 6 for an example). However, the collective evasion will cost non-trivial expenses to the attacker.
In this section, we introduce our detailed experimental environments and results. We collected many URLs from crowd-sourced repositories and other papers. After that, we conducted experiments with ten baselines, ranging from classical classifiers and graphical methods to graph convolutional networks. Our method shows the best accuracy and robustness.
The source codes, data, and reproducibility information of our method are available at https://github.com/taerikkk/BPE.
|Bank of America||4,610||9,408|
|Sorio et al. (Sorio2013DetectionOH)||NA||40,439||3,637|
|Ahmad et al. (ahmad)||NA||62,231||344,800|
There have been created several phishing URL detection datasets (Ma09beyondblacklists; 35580; Mohammad2014). However, almost all of them do not release raw URL strings so we cannot use their datasets. We found only two open datasets with raw URL strings (Sorio2013DetectionOH; ahmad). In addition to them, we also crawled phishtank.com and collected three sets of URLs reported during a couple of months recently for Bank of America, eBay, and PayPal, the top-3 most popular targets in the website (see Section 7 for more details). Phishtank.com is a crowdsourced repository of suspicious URLs that does not provide ground-truth labels — users can upvote or downvote the reported URLs in the website, but its voting system is not reliable because anyone (even including attackers) can participate. In total, we have about 500K URLs, 172K domains, and 66K IP addresses. Instead, we used virustotal.com to tag collected URLs. This website returns the prediction results of over 60 anti-virus (AV) products given a URL. The seven most reliable and popular AV products (such as McAfee, Norton, Kaspersky, Avast, and Trend Micro) were selected among them, and a URL is considered phishy if more than half of them indicate so, i.e., tagging by majority vote. At the end, we merged these datasets into one and created a very large URL dataset whose statistics are shown in Table 3.
We split the combined set in the standard ratio of 80:20 for training and testing. Only 10% of the URLs have timestamps. With them, we also tried to split in chronological order. Our method shows good accuracy for this configuration as well. However, we do not include the results because of i) its results similar to that of the random split, ii) its small data size, and iii) space reasons.
6.2. Baselines and Hyperparameters
Among many methods proposed, we consider the following baseline methods in our experiments. First, we test many feature-based prediction models. For this, we had surveyed literature and collected 19 features (see Appendix A). After that, we predict with various classifiers after under/oversampling to address the imbalanced nature in our dataset — benign URL numbers are much larger than phishing URL numbers in the training set. In addition to synthetic minority oversampling (DBLP:journals/corr/abs-1106-1813) and adaptive synthetic sampling (He08adasyn:adaptive), we consider five undersampling methods, six oversampling methods, and one ensemble method below.
Naive random undersampling is randomly choose samples to drop.
Tomek’s link is a representative undersampling method.
Clustering uses centroids of clusters after dropping other cluster members.
NearMiss is also popular for undersampling.
Various nearest neighbor methods are able to undersampling.
Naive random oversampling is randomly choose samples to add.
SMOTE (DBLP:journals/corr/abs-1106-1813) and its variants are a family of the most popular oversampling methods, which include five variations.
ADASYN (He08adasyn:adaptive) is also popular for oversampling.
Ensemble method means that we use both the oversampling and undersampling methods at the same time.
We refer to a survey paper (JMLR:v18:16-365)
for more detailed information. The combination of classifiers, under/oversampling methods, and their hyperparameters create a huge number of possible options in this method. So, first, we perform 5-fold cross validation to choose the best performing classifier/sampling method, and its hyperparameters. Second, we also test three deep learning-based sequence classification methods mentioned in Section2
. These neural networks are based on recurrent or convolutional layers. We use their hyperparameters recommended in their original publications.
Third, on the simple network that consists only of URLs and their words, we run the following graphical methods: i) Random Walk with Restart (RWR): This method runs many random walks from training URLs and counts the number of visits to each testing URL. It is very successful for recommender systems (Cooper:2014:RWR:2567948.2579244). ii) Polonium (POL): Polonium based on a simple belief propagation strategy showed a big success in predicting malware and malicious domains. We run the belief propagation on our network with Polonium’s compatibility matrix definition in Table 2. iii) Belief Propagation with Enhancements (BPE): This is our method to run the belief propagation based on our improved definition of compatibility matrix. We test various embedding techniques, , , and for calculating the vector similarity, the cosine similarity and RBF kernel. We set the dimension of the embeddings to 128.
Fourth, on the extended network that consists of all entities (cf. Section 4.2), we test the same set of graphical methods: RWR, POL, and BPE. For this, we use the blacklist of 41,881 IP addresses and 158,271 domains provided by virustotal.com. In other words, those blacklisted entities are converted into observed variables and excluded from the inference process. We also test BPE on the noisy network where stop words are not removed. Last, we test state-of-the-art graph convolutional networks (GCNs) such as LGCN (Gao:2018:LLG:3219819.3219947) and GAT (velickovic2018graph) on the extended network. For a vertex , we feed a feature vector after concatenating i) the 19 features of we use in the feature-based prediction, ii) a binary value denoting whether is blacklisted or not, and iii) a one-hot vector where only the index of the vertex is one. If some items are missing, we concatenate with zeros — e.g., a domain does not have the 19 features so we zero them out. We test the hyperparameters recommended in their original papers. To prevent overfitting, we also add an L2 regularization of neural network weights. In all those graphical models, such as RWR, POL, GCNs, and our method (i.e., BPE), the labels of training URLs are fixed and only unknown labels of testing URLs are inferred.
We exclude other content-based detection methods in our experiments because it is hard to obtain web page contents in general — recall that phishing attacks do not last long and attackers usually clean their traits from the Internet after the accomplishment of their goal. The two datasets we downloaded from (Sorio2013DetectionOH; ahmad) do not include any content information and we also could not collect web page information in HTML from phishtank.com in a stable manner.
We conducted our experiments on the machines with i9-9900K, 64GB RAM, and GTX 1070.
As our experiments utilize many different types of baseline methods, our software environments are rather complicated. The selected list of important software/libraries are as follows:
Python ver 3.8.1.
Scikit Learn ver 0.22.1.
TensorFlow ver 1.5.1.
CUDA ver 10.
NetworkX ver 2.4.
6.4. Experimental Results
We summarize the results shown in Table 4 as follows. Among all feature-based methods, RandomForest performs the best. For all metrics, it outperforms AdaBoost, SGDClassifier, and others, e.g., the F-1 score of 0.840 for RandomForest vs. 0.830 for AdaBoost vs. 0.734 for SGDClassifier. However, all these feature-based baseline methods are clearly beaten by the network-based methods. This supports the efficacy of our network-based approach.
RWR’s precision for the phishy class on the extended network is the best (0.930). However, its recall is worse than other network-based inference methods. POL shows a balanced performance between recall and precision as in its original task to detect malware. LGCN’s recall for the phishy class is the best (0.999). For LGCN and GAT, we found that they are sensitive to hyperparameters and hard to regularize the overfitting. Surprisingly, the best F-1 was made when we allow overfitting to the phishy class to some degree. When we increase the coefficient of the L2 regularizer to prevent overfitting, their F-1 scores drastically decrease. We also found that training with subgraphs is not effective in processing our large network. Therefore, we set the size of subgraph as large as possible in our recent GPU model — due to the GPU memory limitation, whole graph training is impossible for our network — but its performance is inferior to our method.
Our method with , , RBF kernel, and DeepWalk, which is marked as ‘BPE (RBF, Deepwalk)’, shows the best performance for F-1 and accuracy. Although BPE’s precision for the phishy class is a little lower (0.832) than the best feature-based method (i.e., RandomForest)’s score (0.850), the BPE’s recall for the phishy class is much higher (0.958) than that of RandomForest (0.840). However, one may be worried that our method mis-classifies benign as phishy due to its relatively low precision. To this end, we measure the false positive rate (i.e., FPR) to BPE and RandomForest. As a result, we could obtain 0.031 for BPE and 0.306 for RandomForest. Therefore, we expect that BPE is the most useful for accurately detecting phishing URLs in practice.
The same network-based method on the noisy network shows poor performance (e.g., 0.01 for F-1), which proves our network definition also plays an important role.
|M1 (domain)||M2 (path)||M3 (query)||M4 (domain and path)||M5 (domain and query)|
|M6 (path and query)||M7 (all)|
For the statistical significance of our experiments, we conduct paired -tests with a 95% confidence level between BPE and each baseline, and achieved a -value less than 0.05 for all cases.
Transductive vs. Inductive
Transductive and inductive inferences are two popular paradigms of machine learning (Vapnik:1995:NSL:211359). Among all the baseline methods, RandomForest and some other classifiers are inductive methods and many other network-based methods are transductive. In many cases, people rely on the inductive inference where a generalized prediction model trained with a training set predicts for unknown testing samples. In our work, however, we adopt a transductive method where the class label of a specific unknown testing sample is inferred from other specific related training samples in the network architecture. Fig. 3 justifies our transductive approach because a cluster usually consists of vertices from the same class. However, it is not the case that all transductive methods are successful in Table 4.
BPE is an advanced LBP-based method with our novel similarity-based edge potential assignments. Therefore, the time complexity of BPE is , where indicates a similarity calculation cost, indicates the set of edges, and does the number of iterations required for the convergence. is typically small in our setting, e.g., is enough. The time complexity of RandomForest (i.e., the best feature-based method) is , where is the number of features and is the number of URLs. In our experiments, the training (wall-clock) time of BPE is 4.7 times faster than that of RandomForest.
6.5. Parameter Sensitivity
Sensitivity to thresholds
The following threshold combinations perform very well and are comparable to each other in our experiments: ( = 0.7, = 0.7), ( = 0.3, = 0.9), ( = 0.3, = 0.5), ( = 0.7, = 0.9), ( = 0.5, = 0.3), and so on. One common characteristic of them is that two extreme values, 0 and 1, are not preferred. This supports our decision to adopt thresholds because two dissimilar neighbors do not always mean that their labels should be different. In other words, the one you are not close to is not necessarily your enemy. By limiting the penalty, we achieved the best accuracy in our experiments.
Sensitivity to embedding
It turns out that network embeddings are more effective than word or document embedding methods. All high ranked results are produced by DeepWalk. Doc2Vec produces the best result only for the simple network and RBF kernel environment. We think that this is because our network definition considers common words among URLs and DeepWalk is able to capture the semantic meanings of words closely located in the network.
Cosine similarity vs. RBF kernel
It seems the cosine similarity and the RBF kernel are comparable to each other in our experiments. When sorting all results, all highly ranked results are evenly distributed to both of them.
6.6. Evasion Tests
For our evasion testing, we consider all possible variations for the parts of phishing URLs, i.e., domain, path, and query. Specifically, we define, in total, seven evasion methods (i.e., M1-7) as follows: M1) Phishing URL’s domain is changed to other random benign domain (and as a result, IP address is changed too); M2) Phishing URL’s path string (cf. Section 4.2.1) is changed to other random benign one; M3) Phishing URL’s query string (cf. Section 4.2.1) is changed to other random benign one; M4) Phishing URL’s domain and path string are changed to other random benign ones; M5) Phishing URL’s domain and query string are changed to other random benign ones; M6) Phishing URL’s path and query strings are changed to other random benign ones; M7) Phishing URL’s each part is independently changed to other random benign ones, i.e., phishing URL becomes an entirely new URL that looks like benign.
Note that our evasion tests embrace Shirazi et al.’s evasion settings (cf. Section 3). Also, we note that M7 evasion is the most challenging situation. As mentioned earlier, for M7 evasion, the attackers’ motivation may be low, because it requires non-trivial expenses. Nevertheless, it is worth mentioning that we take into account the case where all of the domain, path, and query strings are evaded simultaneously.
For some spear phishing attacks aiming at particular targets, however, sophisticated URLs are prepared with all benign string patterns and web page contents, in which case more advanced techniques are required to detect. It is well known that the attacker invests large efforts in spear phishing by considering even the psychological and habitual characteristics of the targets after hijacking benign user accounts (Oliveira:2017:DSP:3025453.3025831; Lin:2019:SSE:3349608.3336141; Ho:2019:DCL:3361338.3361427). However, it is out of the scope of this paper and we leave it as our future work.
To simulate evasions, we modify random 5-15% of our testing phishing URLs using one of the seven evasion methods. After the modifications, its network is also reconstructed accordingly. We compare BPE (our method) with POL and RandomForest (RF) which represent network-based and best feature-based baseline methods, respectively. Because all entities are connected in our network and one evasion may affect other neighboring non-evaded URLs in the worst case, a simple measure counting the number of successful detections for the phishing URLs with evasion is not a correct metric. So, we re-evaluate all testing URLs again after evasion and report the results in Tables 5 and 6.
As shown in Tables 5 and 6, our BPE outperforms other baselines with non-trivial margins. Especially, BPE outperforms RandomForest by up to 13.29% in the most challenging situation, i.e., M7 with an evasion ratio of 15%. In addition, although M7 evasion is the most challenging situation, where we independently change every part of a phishing URL to benign, BPE still shows relatively high F-1 scores (0.803-0.861). This is because each part of a benign URL connected to a phishing URL is not likely to have high similarity scores (since these are from different benign URLs), so the phishing URL has a low similarity to each newly connected vertex. Therefore, BPE with our novel similarity-based edge potential will not predict this phishing URL as benign. On the other hand, POL using a majority voting of neighbors shows low F-1 scores (0.762-0.827) in M7 evasion.
Furthermore, we found that BPE in various evasion settings outperforms most baselines in non-evasion settings. Specifically, except for M1 with an evasion ratio of 15% and M4 (resp. M7) with an evasion ratio of 10% and 15%, the minimum F-1 score of BPE in various evasion settings is 0.847 (i.e., M1 with an evasion ratio of 10%), which surpasses that of the best baseline in non-evasion settings, i.e., 0.840 for RandomForest. One more important fact is that evasion incurs additional costs to the attacker. To make a domain whitelisted, for instance, the attacker should pay hosting fees and maintain the domain for a considerable amount of time without any attack campaigns or should compromise other benign web servers. Some attackers do this and switch to phishing web pages at D-Day to launch a phishing attack (apwg). Even after the attacker’s efforts, experiments show that our method is good at detecting those evasion cases.
Evasion case study
Fig. 7 shows eight 2-hop ego networks for a phishing URL that is randomly selected for our evasion settings. The first one shows the original network connection in our dataset. The target URL (the largest red vertex) is connected to other phishy domain and words, in which case it is straightforward to classify the target URL as phishy. In the other seven networks, however, the target URL is connected to a benign domain or/and word(s). Even after these evasions, our method correctly infers that the target URL is still phishy whereas POL and RandomForest fail to detect all the evasion cases. Our method is equipped with a sophisticated edge potential assignment mechanism whereas POL does not consider them. Our theoretical analyses in Section 5 also well supports the robust nature of our method.
We also introduce other visualizations with real prediction results. Fig. 8 shows three visualizations including our method’s and RandomForest’s predictions. To emphasize their differences, we choose some important domain/IP/word vertices from our network and show their URL neighbors (rather than showing the full network). In Fig. 8 (a), we can observe a strong pattern that the ground-truth label follows the network connectivity in many cases. Sometimes red (phishy) and blue (benign) vertices are mixed in a cluster but this is mainly because we find the clusters in the sub-network only. Our method in Fig. 8 (b) shows a better compliance to the network connectivity than that in Fig. 8 (c). To evade our method, therefore, the majority of URLs in the same cluster should be evaded at the same time, which burdens the attacker with non-trivial costs (see our evasion cost discussion in Section 3).
7. Data Crawling
To collect as many phishing URL samples as possible, we had monitored phishtank.com for a couple of months while searching other researchers’ available data. There are several online datasets — many of them were released by Ma et al. who had published several papers for phishing URL detection (Ma:2009:ISU:1553374.1553462; Ma09beyondblacklists). However, their data does not include raw string patterns. We also contacted them but they replied that they cannot share the raw data. Mohammad et al. also released their data in https://archive.ics.uci.edu/ml/datasets/Phishing+Websites but they also do not release their raw data used for their research (DBLP:conf/icitst/MohammadTM12; Mohammad2014). As mentioned earlier, we need string patterns of phishing URLs so we couldn’t utilize all the above mentioned data.
Therefore, we programmed a web crawler using an automated web browser library and collected all the URLs reported for Bank of America, eBay, and PayPal. For retrieving additional information from virustotal.com, we received an academic license to their APIs and collected many such information we listed in the main paper. The academic license was activated for three months so it was more than enough for us to retrieve all the needed information.
8. Conclusions & Future Work
Although many (machine learning) methods have been proposed to detect phishing URLs, it had been overlooked that attackers can use evasion techniques to neutralize them. In this paper, we tackled the significant problem of detecting phishing URLs after evasion. After segmenting URLs into words and creating a heterogeneous network that consists of cross-related entities, we performed the belief propagation equipped with our customized edge potential mechanism which is our main contribution. Furthermore, we showed that our design is theoretically robust to evasion. We collected recent URLs and downloaded other two datasets for extensive experiments. Our experiments with about 500K URLs verify that our method is the most effective in detecting phishing URLs and also is the most robust to evasion than all baselines. Besides, we expect that our method can be easily applied to address any similar network-based problem (e.g., detecting fake accounts in social networks and email spam) if it can be represented as a classification on graphs.
In the future, we will study a string and content-based robust detection method. For some evasion techniques, it is limited to only string-based detection methods. However, it requires non-trivial efforts to collect web page contents. Therefore, we think that hybrid methods will be the most useful for real-world applications.
The work of Sang-Wook Kim was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2022-00155586) and by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2018R1A5A7059549). The work of Noseong Park was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2020-0-01361, Artificial Intelligence Graduate School Program (Yonsei University)).
Appendix A Baseline Lexical, Host, and Domain Features
We did an extensive literature survey and collected 19 features from the papers mentioned in our related work section. The complete list of the features we used in our experiments (sorted by the feature importance extracted from RandomForest, the best performing feature-based classifier in our experiments) is as follows:
[itemindent=4em, label=(Ranking *)]
Kullback-Leibler (KL) divergence
The Kullback-Leibler (KL) divergence is a popular metric to measure the similarity between two probability distributions. We can calculate the KL divergence on character distributions between a URL and the English language. The reference for the character distribution in the English language is obtained from(characterFreq).
Entropy of URL
It was found that the string entropy of a URL is an important feature, since many phishing URLs have random text, causing their entropies to be higher than those of benign URLs.
Digit/Letter Ratio in the whole URL
The ratio of digits w.r.t letters in the whole URL is also important.
Top-level domain numbers in path
Attackers often try to impersonate legitimate websites by adding multiple top-level domains in the path of a URL. If the count of top-level domains in the path exceeds one, then it is likely to be phishy.
The number of dashes in path
This is to count the occurrence of ‘-’ in the path. Many dashes indicate phishing URLs.
A blacklist contains a set of malicious domains and IP addresses. If a URL has such a domain or an IP address, it can be immediately predicted as phishy. However, it is often incomplete and there are many missing malicious domains and IP addresses. In general, we do not use a whitelist because attackers sometimes compromise whitelisted servers to implant their phishing web pages and we cannot always trust whitelisted ones.
Length of URL
Attackers may use long URLs to mask the phishy appearance of phishing URLs. The length of URLs plays an important role in distinguishing phishing URLs from benign ones. We use the same length standards of (DBLP:conf/icitst/MohammadTM12) as follows:
Presence of digits in domain
Benign URLs do not have digits in the domain. The presence of digits in the domain is a common characteristic of phishing URLs. We set this feature as true if any digits are encountered in the domain name part of the URL.
Frequency of suspicious words
We keep track of the frequencies of suspicious and most common words occurring in URLs. We choose several suspicious words like ‘confirm’, ‘account’, ‘signin’, ‘update’, ‘logon’, ‘cmd’, and ‘admin’. These words are selected after surveying the literature and real-world datasets including ours.
(DBLP:conf/icitst/MohammadTM12) stated the criteria for classifying a URL as phishy based on the count of its sub-domains. If a URL’s resource name part has more than three dots, then it is likely to be phishy. An example of such a URL is ‘http://www.outlook.3uwin.com’.
Brand name modifications with ‘-’
We downloaded the top-1000 most visited websites from Alexa and used them as popular brand names. Phishing URLs create similar names with prefixes or suffixes. For example, ‘microsoft-x.com’ and ‘x-microsoft.com’ are phishing URLs.
Very long hostname
Too long hostname typically indicates phishyness. If the length of a hostname is longer than 22, then it is phishy.
Prefix or suffix separated by ‘-’ to domain
It is well known that phishing URLs tend to add prefixes or suffixes separated by ‘-’ to their domain to lure users into believing that the website is legitimate. For instance, an attacker may use Amazon’s domain separated by a prefix as ‘http://www.hello-amazon.com’.
Frequency of punctuation symbols
We count the occurrence of symbols like ‘.’, ‘!’, ‘&’, ‘,’, ‘#’, ‘$’, and ‘%’; (Verma:2015:CPU:2699026.2699115) observed a high percentage of punctuation symbols in phishing URLs.
The number of ‘:’ in hostname
The number of ‘:’ in the hostname part also implies phishyness. In particular, this is used for port number manipulation.
Using Internet Protocol (IP) address
Usage of IP addresses in place of domain names usually indicates fraudulent websites. For example,
Vowel/Consonant ratio in hostname
This feature is to calculate the ratio of total vowels to total consonants in the hostname part. Phishing URLs do not follow the standard ratio.
Very short hostname
If hostname is very short (e.g., smaller than five), then it is an indicator of phishyness.
Existence of ‘@’ symbol
Attackers can use ‘@’ symbol to trick users by exploiting the property of browsers to ignore everything before ‘@’ in the address bar. Attackers can use URLs such as
‘http://email@example.com’ which causes the browser to ignore ‘www.google.com’ and proceed to ‘atc.com’.