Cognitive Triaging of Phishing Attacks

05/06/2019 ∙ by Amber van der Heijden, et al. ∙ TU Eindhoven 0

In this paper we employ quantitative measurements of cognitive vulnerability triggers in phishing emails to predict the degree of success of an attack. To achieve this we rely on the cognitive psychology literature and develop an automated and fully quantitative method based on machine learning and econometrics to construct a triaging mechanism built around the cognitive features of a phishing email; we showcase our approach relying on data from the anti-phishing division of a large financial organization in Europe. Our evaluation shows empirically that an effective triaging mechanism for phishing success can be put in place by response teams to effectively prioritize remediation efforts (e.g. domain takedowns), by first acting on those attacks that are more likely to collect high response rates from potential victims.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Phishing attacks represent a significant threat to organizations and their customers [36]. The problem of phishing detection has been addressed multiple times in the literature [16, 18, 32], yet classification is only part of the issue. A timely and efficient reaction

to phishing attempts (e.g. performing takedown actions on phishing domains, blacklisting, or notifying customers) could save hundreds or thousands of customers from fraud or theft, and associated costs for all involved stakeholders. For this reason, most ‘large enough’ organizations operate a phishing-response team whose task is to promptly investigate potential impacts, identify rogue domains and attack vectors, and act to contain or neutralize the attack 

[8]. The size of this effort often requires the full time operation of several experts within the response team [46].

Unfortunately, these teams currently lack of an objective and quantitative way of prioritizing response activities, which can lead to large inefficiencies in the response process. Technical mechanisms are often in place to a-posteriori quantify the success of a phishing attack, but these are technically limited to attacks ‘in scope’ of the measuring mechanism (e.g. evaluating the requests for internal resources received by the organization’s servers and originating from remote domains) and, importantly, cannot predict how successful the attack is likely to be if no immediate mitigation is put in place.

Key to predicting phishing success is the likelihood that a human will comply with whatever instruction is in the phishing email. Cialdini pioneered the definition of ‘principles of influence’, namely Reciprocity, Consistency, Social Proof, Authority, Liking, and Scarcity as ‘cognitive triggers’ that, once engaged, can greatly impact the likelihood of a human’s decision to comply with what he or she is being requested to do [6]. These principles have been used as a theoretical framework to investigate persuasion in different domains, such as sales and marketing [7], organizational behaviour [41], and wellbeing [51], as well as being linked to phishing effectiveness [53, 54] in (synthetic) experimental settings [55]

; however no means to automatically measure the cognitive features of a phishing email, and estimate their relation to phishing success ‘in the wild’, currently exists.

In this paper we employ techniques from natural language processing and econometrics to build a method and estimation process to measure cognitive triggers in phishing emails, and to build a

cognitive triaging model of how successful an attack can be expected to be. We demonstrate empirically that the resulting estimations can be used to efficiently prioritize phishing response actions, by addressing first the (few) attacks that are likely to be highly successful. To do this, we extensively analyze more than eighty thousand phishing emails received by the anti-phishing division of a very large European financial organization, quantify the ‘cognitive vulnerability triggers’ embedded in the attacks, and relate them to the number of accesses to the remote phish domain that the anti-phishing division measured. This allows us to empirically derive a triaging model that, only based on cognitive features of the incoming phishing email, can predict how many ‘clicks’ it can be expected to generate.

Scope and contribution of this work.

With this work we aim at building a principled analysis that explains why one can expect a certain phishing email to be successful, as opposed to building a method that ‘blindly’ maps mail bodies to success of attack. Importantly, with this work we do not

aim to build a classifier to distinguish phishing from non-phishing emails; instead, we propose a method to

predict to what extent a known phishing attack can be expected to lure users in falling for it. Our contributions can be summarized as follows:

  • we provide the first empirical analysis of cognitive vulnerabilities as exploited in the wild by attackers launching phishing attacks;

  • we employ a robust measurement methodology to identify cognitive vulnerability triggers in phishing emails, using supervised Latent Dirichlet Allocation, and a set of bootstrapped econometric simulations to build robust estimations of model coefficients and predictions;

  • we show empirically the correlation between exploited cognitive factors and spoofed From: addresses with an objective evaluation of phishing success;

  • we quantitatively show that triaging phishing emails to prioritize remediation action is possible and effective in an operational setting.

This paper proceeds as follows: Section 2 sets the background for this work in both the cognitive psychology and information security literature; Section 3 details the employed data and methodology, and Section 4 reports the exploratory and cognitive analysis of the data. The cognitive model and predictions are presented in Section 5. Section 6 provides a discussion of our results, and Section 7 concludes the paper.

2 Background and Related Work

The general objective of a phishing attack is to convince a target to comply with a request, such as clicking a link to a phishing domain, downloading malware, or providing personal credentials. The effectiveness of these attacks significantly relies on how quickly the message can generate the desired response [55]. Moreover, both cognitive [56, 54] and technical [27, 30, 42] features are employed to lure users into falling for the phish and are known to be relevant to explain phishing effectiveness.

2.1 Cognitive characterizations

Believability. Phishers apply several techniques to increase believability of their phishing messages. For example, they may craft their phishing messages to resemble communications of the impersonated organizations as closely as possible [56]. This is commonly done by duplicating the look and feel of these communications by including logos and other branded graphics extracted from their legitimate counterparts, and by adopting a formal writing style [15]. Furthermore, the context of phishing messages is generally highly personalized to appeal to the targeted population [55]. These practices are enhanced by more technical measures, such as spoofing of the phishing source address, and the use of shortened URLs to hide the destination of the embedded phishing link [24].

Persuasiveness. Persuasiveness is associated with the text content of the email. These techniques work by exploiting fundamental vulnerabilities of human cognition [31] that can be explained by ‘shortcuts’ in human cognitive processes that determine decisions on the basis of previous experiences, biases, or beliefs [48]. Despite the clear benefits of these mental-shortcuts, they can result in irrational decision-making as well [52]. Cialdini [6] identified several principles that explain how these mental shortcuts can be exploited for the persuasion of others (e.g. for marketing purposes). Indeed, these principles are applied regularly in multiple domains, including marketing (e.g. to purchase a product or solution) [7], organizational behaviour (e.g. to comply to policies) [41], and health and wellbeing (e.g. to adopt healthy lifestyles) [51]. As these are foundational to human decision-making processes [33], these principles may not be effectively applied to distinguish legitimate from illegitimate resources (e.g. a website, email, or conversation): any activity aiming at ‘influencing’ one’s behaviour (that being through spam or organization policies, phishing or advertisement) will employ some variation of these principles. On the other hand, these provide a solid foundation to evaluate how effective an attempt at convincing a human can be expected to be. Table 1

Principle Definition [6] Phishing text example 2
Reciprocity Tendency to feel obliged to repay favours from others.“I do something for you, you do something for me." “While we work hard to keep our network secure, we’re asking you to help us keep your account safe.”
Consistency Tendency to behave in a way consistent with past decisions and behaviours. After committing to a certain view, company or product, people will act in accordance with those commitments. “You agreed to the terms and conditions before using our service, so we ask you to stop all activities that violate them. Click here to unflag your account for suspension.”
Social Proof Tendency to reference the behaviour of others, by using the majority behaviour to guide their own actions. “We are introducing new security features to our services. All customers must get their accounts verified again.”
Authority Tendency to obey people in authoritative positions, following from the possibility of punishment for not complying with the authoritative requests. “Best regards, Executive Vice President of <company name>”
Liking Preference for saying “yes” to the requests of people they know and like. People are programmed to like others who like them back and who are similar to them. “We care for our customers and their online security. Confirm your identity .. so we can continue protecting you."
Scarcity Tendency to assign more value to items and opportunities when their availability is limited, not to waste the opportunity. “If your account information is not updated within 48 hours then your ability to access your account will be restricted."
Table 1: Definitions and examples of Cialdini’s principles of influence in phishing emails

provides examples and definitions of these principles.

Cialdini’s principles of persuasion are strongly related to the successfulness of face-to-face social engineering efforts in the real world [44] as well. Akbar [2] performed a quantitative analysis on 207 unique phishing emails to identify the application of Cialdini’s persuasion principles in phishing emails. The results show the Authority, Scarcity and Liking principles to be most popular. A similar study was performed by Ferreira et al. [12], who found the Liking principle to be most popularly used, followed distantly by the principles of Scarcity and Authority. Differences can be explained by different experimental settings and application domains. Several other studies [55, 3, 13] have addressed the prevalence and efficacy of Cialdini’s principles in phishing attacks. Others have evaluated phishing campaigns against specific users [23], discussing some of the techniques used by phishers to lure their victims. Unlike these works, we integrate quantitative measures of cognitive attacks and measures of phishing success to predict attack effectiveness in operational settings.

2.2 Phishing effectiveness

Previous work considered the inclusion of forged quality marks, images, and logos from trusted organizations as well as other signals of credibility as means to increase the effectiveness of a phishing attack [9]. Other more technical measures are employed to enhance the credibility of phishing as well, for example spoofing of the source email address, adoption of HTTPS instead of HTTP to convince the user the webpage is ‘safe’ [36], or cloning of the original webpage. Several works have considered such visual similarities between phishing landing pages and their legitimate counterparts based on different features, including DOM tree structures [42], CSS styling [27, 30], content signatures [1, 17], and pixel and/or image properties [5, 10]. Whereas these technical features constitute additional relevant information for the identification of a phishing attack, in this study we focus on the cognitive attacks embedded in an email text (as opposed to the visual clues included in a landing webpage) that affect the human decision making. Additionally, a number of user-studies has been conducted on the impact of client-side detection-assistance tools [57, 20] and how people evaluate phishing web pages [9]. Various phishing detection mechanisms have been proposed based on technical features such as signatures of user email behaviour [49], email-header properties [16], impersonation limitations of attackers [28], search engine rankings [25], and botnet effects [35]. Additionally, [29] presents a set of research guidelines for design and evaluation of such detection systems. These works have predominantly focused on the detection of phishing domains and emails by means of technical traces in order to prevent phishing attacks from happening in the first place. Unlike these studies, we focus on the evaluation of the potential of those attacks that, despite the countermeasures in place, make it through and must be timely addressed.

On the cognitive-side we can consider the impact of user demographics. Oliveira et al. [34] found age to be an important feature, finding younger adults to be more susceptible to Scarcity, whereas older adults were more susceptible to Reciprocity. Other results of this study indicate the relevance of gender by finding older women to be most susceptible of all of the studied user groups. Furthermore, Wash and Cooper [53] demonstrated the impact of message presentation by showing how phishing training methods based on giving facts-and-advice were more effective when presented by an expert figure (Authority), whereas methods based on personal stories benefited more from presentation by people perceived as similar to the user (Liking). In the context of social media, user activity, consumption behaviour, and clicking norms in the social network were found to be important factors for phishing success [40]. As opposed to focusing on the characteristics of the individuals that receive the phishing (as this information for the population of customers is generally unknown to organizations, or may be impossible to collect due to legal and ethical challenges), in this work we consider the expected aggregate responses of the phishing recipients as a function of the phishing emails.

3 Methodology and Data collection

Our analysis relies on a unique dataset from a large phishing email database provided by Org, a large financial organization in Europe with more than 8 million customers and a multi-billion Euro turnover. Org customers that suspect they have received a phishing email in their personal email accounts are instructed by the organization to forward these emails to an internal Org functional mailbox. In parallel, Org

’s phishing response team runs a service to detect phishing domains (not necessarily linked with the received phishing emails) by means of internal heuristics and limited to external domains requesting resources internal to

Org (e.g. images, forms, logos, CSS files/javascript, etc.). This data is generated by a third party service hired by Org that monitors all requests generated towards Org’s resources. Through this mechanism Org can detect the number of visits to the detected domains by accounting for the unique sessions opened between the (rogue) external and the (legitimate) internal services. Access to this data allows us to perform a rich analysis of the arrival of phishing emails, their characteristics, and to evaluate how often users have accessed malicious domains as a proxy measure of ‘phishing success’. Figure 1 depicts Org’s internal process to handle suspect phishing emails.

SOC Operators collect evidence on the maliciousness of the web domain under investigation such that an external party can perform the notice and take-down requests for the malicious domains.
Figure 1: Overview of phishing-related activities at Org 

Overall, we extracted 115,698 reported emails and 11,936 alerts for malicious links between February 1st, 2018 and 15 December 2018, with the exception of the period August-September 2018 due to infrastructural limitations at Org. For this same reason, our sample only includes data for ‘clicks’ collected from end of July onwards.

Data limitations and ethical aspects.

From the data structure, the link between a clicked URL and the specific email from which that click originated is not explicit and can only be reconstructed by exact match of the destination URL. This has the effect of limiting the scope of this study to the comparison of the effectiveness of cognitive influence techniques between phishing emails that are likely to have generated the click (as we cannot fully reproduce the process generating the detection of URLs that could have been clicked, but have not). This also limits the number of matches between URLs reported in event alerts and URLs linked in emails. Further, the results of this work are limited to the emails that have been reported (and therefore identified at least once) by Org’s customers. Despite the large number of active reporting customers, particularly well-crafted emails may not be represented in our dataset. Further, we can only observe data captured by the User Session Monitoring System, i.e. related to emails pointing to domains that ‘call back’ to Org’s systems. This may represent a limitation if emails that do not ‘call back’ also exploit different ‘cognitive vulnerabilities’, or with different distributions. However, an analysis on the available data does not show apparent biases between emails for which a ‘click’ has been recorded, and those for which we do not know of any (ref. Figure 10). These limitations are akin to those outlined by Pitsillidis et al. [37]. Aware of these, we compensate by means of the analysis methodology that explicitly accounts for the potential biases in the data. Finally, the collected data did not contain sensitive subject information and all data handling has been performed within allowance from Org and within the scope of work previously approved by the department IRB.

3.1 Data sanitization and processing

As our email dataset contains messages forwarded by users, we first sanitize the data by removing mobile text messages () that likely result from erroneous forwards to the functional mailbox from a related banking service; as they are irrelevant in our setting, we discarded them. Further, users may have reported emails that target financial organizations different from Org. To capture this, we identify targeted organizations in our dataset by a string search operation within email bodies for the names of the most prominent financial organizations in the country where Org is located, and remove all records that do not belong to Org (). To identify phishing email subjects, dates, and recipient/sender information, we recursively searched through each raw email message to find header matches of the first original email arrived in the user’s inbox,111This is necessary as emails can be forwarded multiple times (e.g. if originally forwarded by the customer to an Org employee) before ending up in the phishing inbox. and extract information on From, To, Date, and Subject values. Table 2 reports summary statistics of the final dataset.222

We notice that the upper 2.5% of the distribution of email length is disproportionally long w.r.t. the remainder of the distribution, suggesting a few outliers in the data. Manual inspection reveals malformed email corpora (e.g. with HTML tags embedded in the body); as no obvious ‘upper limit’ for email length is apparent, we keep these in the dataset for the sake of transparency.

The column type indicates whether the variable is a factor (f) or numeric (n). The column n

reports number of levels for factors, and number of records with at least one observation for numerical variables. We do not report summary statistics for factors. The standard deviation for variable

Date is reported in days. All dates are in 2018 and in format .
Feb-Jul 2018 Oct-Dec 2018
Variable type n Min 0.025q Median 0.975q Max n Min 0.025q Median 0.975q Max
Language f 3 2
To f 38760 2239
From f 1641 330
Date n 69800 02-02 03-07 05-30 07-28 07-31 11458 10-01 10-01 11-29 12-11 12-11
Length n 69800 160 446 1068 3973 67246 11458 173 329 1320 5480 15685
Vuln. triggers Reciprocity n 69800 0 0 3 67 149 11458 0 0 2 37 153
Consistency n 69800 0 0 13 84 132 11458 0 0 19 88 176
Social Proof n 69800 0 0 2 17 52 11458 0 0 0 17 90
Authority n 69800 0 0 5 55 121 11458 0 0 5 27 83
Liking n 69800 0 0 0 7 504 11458 0 0 0 8 198
Scarcity n 69800 0 1 40 107 157 11458 0 0 11 91 189
Spoof dist. n 61911 0 0 7 14 23 10604 0 0 6 14 24
Clicks n 4 9 9 28.5 78 78 35 1 1 37 220 220
Emails Reported f 69800 11458
   of which susp. f   61079   9419
Unique f 1293 424
   of which susp. f   952   329
Table 2: Descriptive statistics of the collected dataset

3.1.1 Identification of suspicious and landing URLs

Suspicious URLs. We check emails for the presence of suspicious URLs that point to any domain that does not belong to Org, as these would not normally appear in a legitimate email originated by the organization. We exclude from the heuristic general-purpose domains with no direct phishing correlation (e.g. youtube.com). Based on this classification we flag emails that contain at least one suspicious URL as Suspicious, whereas the remaining ones are considered uninteresting within our scope (as we can neither count nor estimate clicks for URLs that do not exist).

Landing URLs. These are landing URLs that load resources internal to Org, as detected and reported by the User Session Monitoring System (ref. Fig.1). Whereas they are related to a click on a suspicious URL, this relation is not immediate in the data and needs to be reconstructed.

3.1.2 Landing URL extraction

To reconstruct the association between Landing URLs and Suspicious URLs we adopt the following method:

  1. First, we traverse the suspicious URL embedded in the phishing email (suspiciousURL) multiple times by visiting all URLs arriving to Org’s inbox. These typically generate a number of redirections (generally HTTP 3xx) that lead to a landing webpage, where the actual phishing resource is located. We record the association suspiciousURL, landingUrl for all visited URLs, and for all emails; if the redirection mechanism is not deterministic, we obtain a association between suspiciousURL and a set of landingURLs. As we cannot know how many ‘redirection chains’ exist from a single suspiciousURL, we traverse the URL opportunistically every time it appears in Org’s inbox. To minimize confoundings in the redirection, each visit session is independent from the previous. Figure 2

    Figure 2: Redirection count (left) and density ratio (right) from observed suspiciousURLs.

    shows that the number of different redirections stops growing quickly regardless of how many time we traverse a given URL, suggesting that the dataset of collected landingURLs does not suffer from systematic censoring problems.

  2. When landingURL is visited, a third party contractor of Org records a ‘click’ for landingURL (see Fig 1 and discussion in data limitations), and reports it to Org.

  3. We link clicked landingURLs with the original email body by matching them with the landingURLs we found by traversing the suspiciousURLs in the mail corpus; if there are multiple clicked landingURLs for a single suspiciousURL, we keep record of all matches.

  4. To aggregate clicks to a single suspiciousURL, we considered: average, sum, and max no. of clicks across all landingURLs for a given suspiciousURL. We ran our experiments using all aggregation strategies, and obtained qualitatively identical results. In this paper we report average clicks as it is the most conservative choice to make (e.g. summing landingURL clicks is more susceptible to over-reporting multiple clicks by the same user).

Figure 3

Data generation process for an example email leading to distinct clicked landingURLs. Users may click on the suspiciousURL link in the email and, usually through a series of redirections, reach the phishing domain hosted at one of the landingURLs. Through the dynamics described in Figure 1, an association between each distinct landingURL and recorded number of clicks is reported. The same email can also be reported to Org’s phishing inbox. When it arrives, we opportunistically traverse the redirection chain and record the association between the original email and the final landingURL(s). To reconstruct the association between suspiciousURLs and clicked landingURLs, we aggregate the two datasets.
Figure 3: Data generation process for matched URLs and click data aggregation

provides a bird’s eye view of the data generation process for the landing URL extraction.

3.1.3 Duplicate detection

One complexity of our unstructured dataset is the possible occurrence of multiple duplicates of the same suspect phishing email. In this paper we consider ‘similar’ emails received by users over long periods of time as belonging to the same ‘campaign’.333This is only based on the email text, and we use it as a term to group together emails that are likely to have a common denominator (e.g. a phishing tool, a specific market/phishing pool, or actual attacker). Although the overall textual content of these duplicate emails is similar, they can still contain slight differences, for instance because of the presence of a recipient’s name in the salutation of an email or other minor syntactic features. In order to detect, and subsequently remove, as many of these duplicate emails as possible, we used a fuzzy string matching approach to determine the pairwise similarity for each of the emails in our dataset. We employ a bag-of-words model to calculate, for each document, the frequency of each unique word in the document. We build the word-by-document matrix of our email corpora for the term frequency values for all emails in our dataset. As an additional pre-processing step all input was cleaned by removing special characters, urls, email addresses and line breaks from the text. We use normalization to the term frequencies to limit the impact of differences in email lengths [47].

To evaluate email similarity we employ a measure of cosine similarity. This similarity measure expresses the similarity between two vectors in terms of the cosine of the angle between the two vectors; the evaluation results in a score between

, where constitutes low similarity, and constitutes high similarity. To define the cutoff threshold for similar emails we manually marked 300 randomly sampled emails from the dataset and assigned them to ‘similarity IDs’ to track which emails were replicas of which others. We then performed a bootstrapped () sensitivity analysis of the threshold level to determine the optimal level for the cutoff. This procedure tunes the categorization to very satisfactory sensitivity and specificity levels higher than 90%. Full details on procedure and results are reported in the Appendix.

The duplicate detection procedure identifies and unique emails in the data collection of Feb-Jul 2018 and Oct-Dec 2018 respectively (ref. Table 2). Of these 952 and 329 respectively are classified as ‘suspicious’.444Note that otherwise identical emails may lead to different phishing domains.

3.2 Cognitive evaluation

We perform LLDA using Gibbs sampling iterations for parameter estimation and inference initialised with hyper parameters , , and .
Macro (sd) Micro (sd)
Sensitivity 0.709 (0.016) 0.807 (0.016)
Specificity 0.714 (0.042) 0.813 (0.038)
Precision 0.718 (0.025) 0.755 (0.024)
F1 0.725 (0.020) 0.760 (0.020)
Table 3: Topic model performance results

To identify the presence of cognitive vulnerabilities in email bodies and the intensity of the employed cognitive attacks, we construct a supervised topic model based on Labeled LDA [39] (LLDA). LLDA models each input document as a mixture of topics inferred from labeled input data and outputs probabilistic estimates of label-document distributions, i.e , and word counts of label-specific triggers for each input document. In our application the labels correspond to Cialdini’s principles of influence, detailed in Table 1, whereas documents correspond to the emails.

For model training, we randomly sampled 99 emails (38 with clicks and 61 suspicious) out of the set of unique and suspicious emails in the dataset (),555To have an indication of the effect of sample size on model performance, we first ran the training on 70 emails and added 29 (+40%) at a second time, obtaining virtually identical results. To rule out sampling issues, we also performed a cross-validation procedure (reported) which suggested stable results. Finally, manual checks on a random sample from the dataset of predicted labels found no obvious miscategorization. and manually labelled them for presence of cognitive vulnerabilities. Due to language restrictions, we adopted a mixed approach whereby one author performed the labelling on the original data, and the second author blindly re-performed the labelling on an automatically-translated random sample (20 emails) of the labelled data. To assess model performance we performed a 5 times repeated 5-fold cross validation over the data. Numerous approaches exist to evaluate the performance of multilabel classification problems like ours. Following [43], we consider our problem as a label-pivoted binary classification problem, where the aim is to generate for each label strict yes/no predictions based on the document ranking for that label. For each label, we sort on the per document prediction values, and use the PROPORTIONAL method [43, 14] to define a rank-cutoff value that determines the top ranked items that will receive a positive prediction. For each label, we set equal to the expected number of positive predictions based on training-data frequencies: For label , where and refer to the total number of training and testing documents and is the number of training documents assigned label .

We have aggregated the performance results of our topic model using the PROPORTIONAL rank-cutoff method in Table 3. Unlike other rank-cutoff methods, this approach relies solely on labeling information from the training set, which makes it appropriate for use in real-world production settings as well. We report both macro scores (averages computed over each result of the cross-validation procedure), and micro scores (computed over the aggregate of all cross-validation results). The obtained scores indicate a satisfactory fit over both projections. A manual analysis on randomly sampled emails confirms that the procedure appropriately assigns ‘topics’ to emails. The final model is trained on the complete set of 99 labeled training documents that were previously used in cross-validation, and then applied to the unseen and unlabeled remainder of the full dataset. Standard text cleaning procedures have been applied for removal of special characters and stop-words, sentence tokenization, and word stemming.

In this paper we refer to the ‘topics’ assigned by LLDA to an email as the cognitive vulnerabilities exploited in that text, and to the words associated with that topic and present in the text as the vulnerability triggers for that cognitive vulnerability. With this we aim at distinguishing the presence of a cognitive attack from its intensity in the email text.

Example of training results

We report below an example of a phishing email (translated to English) and its association with different cognitive vulnerabilities. We have indicated the relevant vulnerability triggers in italics and refer to (1) Liking, (2) Consistency, (3) Authority, (4) Social Proof, (5) Reciprocity and (6) Scarcity:

  • (1) As a valued customer of Org we always want to inform you of the latest updates and innovations in our system. We have recently switched to a new system that requires (4) all current customers to replace their (2) current debit cards by our newly-produced ones.

    In connection with the new changes to the (3) European Safety Regulations, Org wishes to alert all its customers to the availability of the new and improved debit cards that adhere to all (3) environmental and safety regulations.

    (1) Org strives to be environmentally friendly. Therefore, our service team will recycle all current debit cards by mounting your (2) current AES Encryption Chip on your renewed biological RFID payment card. For this reason, all current payment cards must be replaced. (5) By participating in our recycling program, the new debit card can be requested free of charge. (6) After October 19th, 2018, a direct debit will be charged.

From the example we can observe that the different cognitive vulnerabilities often appear alongside each other, and that a single vulnerability can even occur multiple times within an email body. Table 4 reports an excerpt of the classification results for the above message, and the learned keywords (translated in English) for each topic.666As the original text is not in English, to provide an accurate translation we report keyword matches for an example.

Reciprocity Consistency Social Proof
Word p Word p Word p
free 0.024 update 0.026 all 0.035
participate 0.016 improve 0.024 customer 0.011
program 0.011 recycle 0.018 current 0.005
request 0.010 renew 0.015 require 0.004
Authority Liking Scarcity
Word p Word p Word p
safety 0.017 valued 0.022 after 0.031
regulate 0.013 friendly 0.012 charge 0.027
european 0.010 strive 0.008 direct 0.020
must 0.007 environment 0.005 debit 0.019
Table 4: Example of extracted keywords for each topic

4 Exploratory analysis

In this section we provide an exploratory analysis of the obtained email data set reported in Table 2.

We first give a look at the time of suspicious email arrivals in victims’ inboxes. Figure 4

Figure 4: Arrival of notified emails to Org’s inbox

reports the CDF distribution of email arrivals to Org

’s phishing inbox. We observe a steady arrival rate through April and the first cutoff date in July 2018, suggesting that email arrival is approximately constant and uniformly distributed in time. As per the time of day of their arrival (not depicted here for brevity) we observe that few suspicious emails arrive in the users’ inboxes during the weekend, with most phishing activity happening during the working days. This may suggest a strategic aspect of these campaigns aimed at increasing the credibility of the email source. On this same line, we find that most emails arrive between 9am and 5pm (business hours), and most arriving between 9am and 11am. Interestingly, these findings are all in line with optimal email send days and times for newsletters as reported by analyses from multiple popular online email marketing services

[26, 4, 45, 38], and is an indication that attackers may follow similar strategies.

4.1 Spoofing and victimization

Figure 5 depicts the distribution of suspicious and non suspicious reported emails.

Figure 5: CDF of emails reported by victim addresses

The CDF is on a log scale to better represent the distribution’s log tail. The vast majority of users report only one email, with almost all reporting less than 10 emails. This suggests that the distribution of phishing emails is uniform across victims, as is generally the case with untargeted phishing attacks [23, 36]. Only 122 addresses out of about 40 thousand report more than 10 emails, and only nine report more than 100 emails.

Figure 6

Figure 6: CDF of spoofed and non-spoofed From: domains

reports the distribution of spoofed and non-spoofed From: domains for reported emails with and without a suspicious URL in the body. An email is classified as spoofed based on the Levenshtein distance of the (spoofed) From: domain the original attack was sent to, w.r.t. the actual name of the organization. This captures exact string matches as well as small variations that may remain undetected by the user [50]. We find attacks employing a range of domains resembling Org’s: from less similar (e.g. org-safety.com, org-customersupport.com), to more closely spoofed domain variations (e.g. theorg.com, 0rg.com). We observe a clear differentiation, whereby emails with no suspicious URL are approximately as likely to have a spoofed From: address as a non-spoofed one. On the other hand, emails with suspicious URLs are more likely to be delivered from non-spoofed than from spoofed addresses, as can observed from the areas under the two curves. This is compatible with a model of a relatively unsophisticated attacker. Here it is also relevant to consider that the pool of ‘spoofed’ addresses is much smaller than the pool of ‘non-spoofed’ addresses (as there are many fewer viable choices similar to Org than otherwise), suggesting that as spoofed domains get blacklisted, attackers may be forced to move to less well-spoofed From: addresses.

4.2 Phishing campaigns

Figure 7

For visualization purposes we report random samples per week of 10% of the emails received in that week. Red represent high similaries above the threshold. We do not observe specific cycles of similar emails, suggesting that any sufficiently long period of time (3-4 weeks) would cover a diverse set of phishing attacks.
Figure 7: Pair-wise cosine similarity between email samples

reports a visualization of the similarity scores between emails received during the observation period. Dark red indicates high similarity.777For details on the identification of similar emails see the Appendix. We do not observe specific and systematic cycles of campaigns emerging with repeating patterns across several weeks. This also suggests that any sufficiently long observation period (in the order of 3-4 weeks) may suffice to collect a diverse set of attacks for analysis. A first look suggests that some attacks seem to re-appear after a few weeks in slightly different forms, perhaps to increase chances of passing updated spam filters (see for example emails from week 21 reappearing slightly modified in week 26, or those from week 18 reappearing in week 24). To evaluate this, Figure 8

Figure 8: Duration of a phishing campaign

reports the distribution of suspicious emails that likely belong to the same campaign. Most campaigns are relatively long, with approximately 50% of similar emails arriving more than 120 days apart, and 25% of emails arriving more than 150 days apart with a relatively long left tail. From the distribution it appears that single-day campaigns are relatively common, whereas long campaigns extend for more than 100 days. Mid-range campaigns lasting between 2 and 100 days are by comparison only few, suggesting that attacks may either be extremely quick and disappear the next day, or last for long periods. Table 5

SINGLE-DAY campaigns last up to one day; SHORT campaigns up to 100 days; LONG campaigns more than 100 days. Most phishing campaigns are either very short (one day) or long, with only a handful lasting more than one day but less than 100.
Phishing samples (#reported emails) Campaign duration (days)
Type n Min 1stQ Mean Med 3rdQ Max sd Min 1stQ Mean Med 3rdQ Max sd
SING. 10 1 1.0 1.3 1.0 1.0 3 0.7 0.0 0.0 0.1 0.0 0.0 1.0 0.3
SHORT 4 2 2.0 36.0 3.0 37.0 136 66.7 18.1 18.2 53.3 52.1 87.2 90.8 40.6
LONG 24 46 86.2 783.4 226.5 929.5 4827 1207.5 116.1 145.2 150.9 150.6 164.2 175.6 17.3
Table 5: Descriptive statistics of duration and intensity of phishing campaigns

reports summary statistics of suspected phishing campaigns. We identify 38 distinct campaigns lasting on average 150 days (approx 5 months) and up to 175 days in the observation period.

To investigate how address spoofing evolves during campaigns, Figure 9

As phishing campaigns progress, the spoofed From: domains appear to be more dissimilar w.r.t. the original domain.
Figure 9: Average weekly decrease in similarity between spoofed domains and name of target organization

reports the weekly average similarity between the domain of the attacker From: address and the domain of the victim organization (measured as their Levenshtein distance) for LONG campaigns. Lower scores indicate more closely spoofed domains. We observe an average increase in dissimilarity between spoofed From: addresses and organization domain, which suggests an overall deterioration of a phishing campaign as it progresses or is replicated by phishers (). This is in line with the intuition that spoofed domains are limited in number, and attackers may therefore run out of options as domains get blacklisted as the campaign progresses.

4.3 Cognitive effects

Figure 10

Figure 10: Distribution of triggered cognitive vulns. (left), and of vuln. triggers (right) for emails

reports the distribution of triggered cognitive vulnerabilities in each unique email (left) and the corresponding vulnerability triggers identified in the corpus (right). We observe a clear relation between the two plots: the most common vulnerabilities and triggers in emails appear to be linked to the Consistency and Scarcity vulnerabilities, regardless of whether a ‘click’ has been recorded for that link or not. Liking and Social proof triggers appear to be particularly rare on the average, with most emails targeting none.888The descriptive statistics reported in Table 2 also suggest stable distributions between the collection periods; for Liking we observe more extreme values (upper in the Feb-Jul data collection); this is caused by the outliers in the email corpora for which we measure disproportionate email lengths. This is consistent with the intuition that in one-shot interactions (as opposed to prolonged or repeated exchanges as in spear-phishing attacks [23]) cognitive attacks linked to the target’s social context and personal preferences (ref. Table 1) are rare. By contrast, exploiting Consistency may only require reference to previous actions that the group of potential victims will have likely performed, such as buying an insurance or receiving a debit card from the organization. Authority appears to be a relatively common trigger in our sample, albeit not for all emails. Common triggers here refer to European and national-level legislation and often come together with the threat of a punishment if certain actions are not completed. Overall, we find that few cognitive triggers are present in the median email, suggesting that the median reported attack may not be highly effective, whereas few emails embed more ‘intense’ cognitive attacks.

Effect of cognitive vulnerablities on phishing success.

To evaluate the effect of the cognitive features of the email(s) embedding the ‘clicked’ URL links, we first report in Figure 11 the distribution of average clicks generated by emails for which at least one click has been recorded ().

Figure 11: Histogram distribution of clicks per email

Most emails generate fewer than 150 clicks, with two emails generating more than 200 clicks (). Figure 12

We observe a clear relation between the presence of exploited cognitive vulnerabilities and the clicks generated by the embedded URL(s). The shifted position of points in the pictures is to clear overlaps and is only presentational.
Figure 12: Relation between number of cognitive vulnerabilities in an email and average clicks ()

displays the relation between triggered cognitive vulnerabilities and generated clicks, for which we observe a clear positive relation.999A possibility is that some emails may be distributed to substantially more users than other emails, generating greater aggregate click counts. As we have no access to the victim’s inboxes, we cannot directly measure this. However, the data does not show specific biases in the likelihood of users reporting emails (Figure  5

), suggesting that major skews are not realistic. This is consistent with previous findings in the literature 

[58, 36]. Further, due to the very low click-through rates of spam and phishing campaigns [19], this difference should be of several orders of magnitude to have a visible effect (as opposed to be undetectable noise in the data generation process). Regardless, in the Appendix we build a data generation model to evaluate the effect this bias would have in the data if present; our analysis finds no evidence. Following common practice [21], to avoid dispersion we here only consider URLs clicked at least ten times, removing six emails. A simple Poisson regression of the form reveals a strong positive correlation between the variables (). This suggests that the more cognitive vulnerabilities are exploited in an email body, the more that email can be expected to generate compliant user behaviour, even when not considering the type of cognitive attack, or its intensity.

Effect of vulnerability triggers.

We now consider the relation of the intensity of each cognitive attack (i.e. measured by the presence of vulnerability triggers) with the measured ‘success’ of the phishing email. Figure 13

The data shows the effect of different cognitive vulnerability triggers on expected number of clicks. Consistency and Scarcity have a clear positive association with the expected number of clicks they generate. Social proof, Authority and Liking do not show any evident trend. Interestingly, we find that Reciprocity appears to be counterproductive.
Figure 13: Correlation between vulnerability triggers and observed clicks

reports the results. The data reports a clear positive relation between Consistency, and Scarcity vulnerability triggers with the expected (log) number of clicks. Reciprocity shows a negative relationship. Additionally Social proof, Liking and Authority show no evident effect, whereby the majority of emails have relatively small counts of associated vulnerability triggers (see also Figure 10). On the other hand, looking at the right extreme of the scale, the few available data points are always related to highly-clicked emails; this may indicate that triggering these vulnerabilities (in this application domain) may be particularly difficult, for example as decisions related to personal finance may have a smaller attached ‘social’ component, or as adding additional ‘authoritative’ effects in the banking domain may be challenging for an attacker.

Effect of spoofing distance.

Apart from the cognitive vulnerabilities exploited in the text, a second relevant factor could be the similarity between the From: address displayed to a user and Org’s legitimate one. Figure 14 reports the relation between Levenshtein distance of the spoofed From: domain and the expected number of clicks.

We identify a negative relation between the dissimilarity of the spoofed From: domain in an email against the original one, and the expected number of clicks the email entices.
Figure 14: Relation between spoofing dissimilarity and average clicks ()

We find an inverse relation between the two variables, suggesting that the greater the dissimilarity between the spoofed and the original domain, the lower the average number of generated clicks (). This suggests that both cognitive attacks and the degree of spoofing in an email may have an effect on the relative success of a phishing email and could be considered to build a triaging model for phishing emails.

5 Modelling phishing success

We now evaluate the relative impact of each cognitive variable in the collected dataset. We estimate coefficients for a Poisson process of the (aggregate) form:

(1)

whereby, for each email , represents the number of measured clicks, is the array of counts of the vulnerability triggers identified in the email body, and indicates the degree of (dis-)similarity between the spoofed From: address and the original Org domain.

is the error term. To monitor and account for overfitting problems related to the few available datapoints, we combine a step analysis of each model (M1..M7) with regression bootstrapping to generate robust confidence intervals for the coefficient estimations. For model selection we report coefficients, 95% confidence intervals, residual deviance, and Adjusted McFadden Pseudo-

, to reduce the statistical bias in the performance metrics for model selection.101010Importantly, with this procedure we do not aim at identifying a definitive model and coefficients to forecast phishing success: regardless of the amount of observations in the dataset, that would not be possible because the ‘click generation process’ generating the observations necessarily varies from domain to domain (e.g. finance vs health), from organization to organization (e.g. national vs international), and from customer base to customer base (e.g. sensibility of application domain). Therefore, coefficient estimations out of this type of models cannot be ‘plug-and-play’ across organizations and domains and will require tuning before being applied in-house. Results are reported in Table 6.

All model coefficients estimations are relatively stable across the seven models. Coefficients for the Poisson models are presented with 95% confidence intervals in parentheses. Social proof and Spoof distance of From: addresses appear to have the largest effects on predicted number of clicks. Higher spoof distances (i.e. higher dissimilarity between From: domain and original domain) result in a lower number of expected clicks. We only report coefficient significance (indicated by a for significance at the level) for the reader’s reference; however due to the relatively small sample size coefficient estimations should only be interpreted relative to each other as opposed to in absolute terms. Model power w.r.t. the baseline model is reported by the adjusted McFadden Pseudo-; a test is employed for model comparison (; ). Standard model checks do not reveal issues or biases in the model fit.
M1 M2 M3 M4 M5 M6 M7
4.38 3.89 3.79 3.63 3.37 3.37 4.22
(4.33, 4.42) (3.81, 3.97) (3.71, 3.87) (3.54, 3.73) (3.17, 3.44) (3.23, 3.51) (4.02, 4.42)
Reciprocity -0.02 -0.01 -0.02 -0.02 -0.02 -0.02 -0.02
(-0.02, -0.02) (-0.02, -0.01) (-0.03, -0.02) (-0.02, -0.01) (-0.02, -0.01) (-0.02, -0.01) (-0.02, -0.01)
Consistency 0.02 0.02 0.02 0.03 0.03 0.01
(0.02, 0.02) (0.02, 0.02) (0.02, 0.02) (0.02, 0.03) (0.02, 0.03) ( 0.01, 0.02)
Social proof 0.14 0.11 0.04 0.04 0.10
(0.11, 0.16) (0.08, 0.14) (0.01, 0.08) (0.01, 0.07) (0.06, 0.13)
Authority 0.01 0.02 0.02 0.00
(0.01, 0.02) (0.02, 0.03) (0.02, 0.02) (0.00, 0.01)
Scarcity 0.02 0.02 0.02
(0.02, 0.03) (0.02, 0.03) (0.01, 0.02)
Liking -0.02 0.04
(-0.04, -0.01) (0.02, 0.06)
Spoof dist. -0.10
(-0.12, -0.08)
Adj. Pseudo- 0.09 0.23 0.28 0.30 0.33 0.33 0.41
1390 1136 1054 1012 958 951 814
N 38 38 38 38 38 38 38
Table 6: Regression results for Eq. 1

All models have relatively stable coefficient estimations showing no evident interaction effects between the regressors (correlation matrix presented in Table 9 in the Appendix). Coefficients should be interpreted relative to each other as opposed to in absolute terms. Because of the relatively small sample size, we refrain from drawing direct conclusions on the model coefficients. For this reason statistical significance is better served in the analysis reported in Figure 13 and is only detailed in Table 6 for the reader’s reference. Within our sample, model coefficients can be interpreted as the relative change in number of clicks for every additional vulnerability trigger of that type in an email. For example, the M7 coefficient for Scarcity () indicates an increase of in the number of expected clicks for every new trigger of that category. Likewise, an increase in one point on the Levenshtein distance scale is related to a decrease in clicks of . A first informal look at the McFadden’s s, Reciprocity, Consistency, and Spoof dist. appear to have the strongest effect in increasing the explanatory power of the model. Scarcity appears to contribute modestly, whereas Liking appears to have the smallest effect on the model. The negative effect of Reciprocity as shown in Figure 13 is confirmed in the model as well.

5.1 Cognitive triaging of phishing success

We now extend the model evaluation to estimate the amount of clicks generated by other emails for which Org has detected no click (e.g. because no call-back to Org resources has originated from the phishing website, remaining therefore invisible to Org’s detection infrastructure, ref. Fig 1). Recall however that our model estimates are likely subject to overfitting issues due to the inevitably small sample size. This only means that predicted outcomes could be unreliable over arbitrarily diverse email corpora (i.e. not represented in the training data); on the other hand, predictions over similar emails to those provided to the fitted models will not suffer from unmodelled biases and will generate reliable estimations. For this reason we only limit our analysis to emails with a distribution of vulnerability triggers within plus or minus one standard deviation from the mean for that trigger in the model’s respective training set.

To choose the model for the prediction we perform a set of ANOVA tests (), which indicate all factors add significant information to the model, albeit Liking only marginally. However, due to the statistical limitations of estimations in our dataset, we also consider a second model that considers only isolated factors for which we observe a clear effect as reported in Figure 13. Based on these observations we consider two different prediction models (PM), each with different regressors, namely: PM1: Reciprocity, Consistency, Scarcity and Spoofing distance; PM2 all six cognitive vulnerabilities + Spoofing distance (i.e. equal to M7 as suggested by the ANOVA tests). This leaves us with and suspicious emails on which to run the predictions for PM1 and PM2 respectively. To build robust confidence intervals around the estimations, we run a bootstrap simulation (). Table 7

PM1 PM2
0.025q Med 0.975q 0.025q Med 0.975q
3.37 4.35 4.90 2.84 4.22 5.17
Recip. -0.05 -0.01 0.00 -0.08 -0.02 0.00
Cons. 0.00 0.01 0.04 0.00 0.01 0.05
Soc.Pr. -0.18 0.10 0.37
Auth -0.03 0.00 0.05
Scar. 0.00 0.02 0.04 -0.03 0.02 0.05
Liking -0.02 0.04 0.05
Sp.dist. -0.17 -0.09 0.03 -0.12 -0.10 0.18
Table 7: Bootstrapped regression coefficients

reports median coefficients and 95% confidence intervals of the estimations. Notice that the estimated coefficients remain largely similar to those of the original models for most coefficients. PM1 shows much tighter confidence intervals for the estimated coefficients in comparison with PM2, suggesting more reliable predictions. Notice that the distribution of the coefficient estimations in PM1 tends to remain on the same side of zero, again suggesting statistically robust results for this model. This suggests the exclusion of Liking, Authority and Social proof in PM1 may lead to more realistic estimations.

Estimations are generated from 50,000 simulations run on the bootstrapped model coefficients (Table 7).
PM1
Min. 1st Qu. Median Mean 3rd Qu. Max.
30 48 54 56 62 99
PM2
Min. 1st Qu. Median Mean 3rd Qu. Max.
26 43 50 53 60 129
Table 8: Descriptive statistics of average predicted clicks
Figure 15: Distribution of predicted average clicks

We simulate model predictions for the undetected clicks by randomly sampling () model coefficients from the two distributions and report aggregate statistics (Table 8) of the estimated number of generated clicks. Figure 15 reports the results. The simulation results indicate that the average ‘undetected’ email has potentially generated 50 - 55 clicks, with a long tail of (few) emails generating up to 100 clicks.111111Notice that additional organization-specific features of the email (e.g. presence of the company logo), may also have an effect on the number of clicks. Whereas this is out of the scope of this paper, which only looks at the cognitive effects, a fully-operative model within an organization can easily integrate other factors in the prediction. This suggests that prioritization efforts based on the cognitive characteristics of a phishing email could help in more efficiently addressing attacks (e.g. by means of takedown actions), by targeting first attacker resources that are likely to generate more impact on the organization’s customer base: by targeting first the emails that are most likely to engage users in compliant behaviour, organizations can effectively triage the stream of incoming phishing attacks to minimize the impact on their customer base.

6 Discussion

The previous sections have demonstrated how quantitative measurements of cognitive vulnerabilities employed in phishing attacks can be used to develop a model to make predictions about the expected efficacy of these attacks. This characterization allows one to assess the threat of these attacks in an automated way such that instant prioritization of phishing incident responses becomes possible. This paper’s contributions go beyond the scope of earlier works on cognitive factors for phishing by providing an empirical estimation and operable implementation to estimate phishing success.

In this work we identified several correlations between different cognitive vulnerabilities and the average number of clicks an email can be expected to generate. In line with the hypothesis that the presence of any individual cognitive vulnerability increases user response to the phish, we found that Consistency and Scarcity exercise a clear positive effect on the number of generated clicks. We find no evident effect from Social proof and Liking, whereas Authority appears to have a positive effect albeit driven by only a few non-zero data points. Interestingly, Reciprocity even shows a counterproductive effect, albeit only marginal. This difference may well be explained by the specific application domain, as corporate customers subject to financial threats from phishing can generally be expected to have different sensitivity to specific principles of influence than other groups [22]. Although this suggests that full generalizability can not be expected for any one set of results, conclusions similar to ours could be drawn for specific contexts close in nature to the one in which Org operates. In particular, our finding of the reduced effect of Authority in the banking domain is in contrast with results from Wash and Cooper [53], who found Authority to be the most effective strategy for the presentation of certain phishing education materials. This difference may illustrate the context-dependency of the relative efficacy of these influence tactics, and indicates a need for careful consideration of such differences across domains. For example, the effect of Authority can be mediated here by the already relatively high authoritive position a bank has on its customers; this suggests that, on one side, depending on the domain it may be more difficult for an attacker to devise effective attacks adding to baseline cognitive effects; on the other, this also suggest that a relevant metric to evaluate could be the relative increase (or decrease) in the cognitive effect w.r.t. the baseline. A similar consideration could be drawn for Reciprocity, whereby we observe a negative effect on the generated ‘clicks’. An explanation could be that these type of triggers rise a red flag in the context of banking operations, for example as a bank’s ‘environmental friendliness’ may not be a convincing-enough reason to act on a request (e.g. to renew one’s debit card).

These observations also provide useful input to training campaigns regularly run by medium and large organizations in an attempt to increase their customers and employee’s awareness of the social engineering threat. Replications of this study in specific domains could reveal which principles of influence the ‘average’ customer of an organization is more vulnerable to; awareness campaigns run by the organization could then target those specific traits by providing specific examples or information material built ad-hoc for the consumer base (or targeting sections of it). For example, consumers particularly vulnerable to Scarcity may benefit from knowing the organization’s policies in terms of change deadlines and processes, such that an email stating unrealistic and short cutoff dates to react lose credibility.

Operationally, the presented procedure could be applied both client and server side to automate the risk evaluation of potential phishing emails for the enforcement of security policies; for example, mail client plugins or server-side processes could automatically divert or forward high-risk emails to phishing investigation and response teams for further evaluation, while delaying the delivery of messages waiting for a diagnosis. Furthermore, we have described how these observed effects can be used in the construction of a prediction model for the triaging of incoming phishing attacks. By enabling the triaging of incoming phishing attacks, our results will enable incident response teams to focus on the most prominent threats immediately, without having to manually filter out the noise from the bulk of low priority emails in their phishing abuse inbox, thereby minimizing reaction costs and increasing response effectiveness. The practicality of this is evidenced in Figure 15, where by addressing the small fraction of emails associated with the highest expected click counts one can mitigate a large fraction of potential attacks. This is critical to minimize overall victimization rates, as the short-lived nature of phishing domains stresses a need for prompt identification of which domains are most likely to be reached by customers falling for the phish. By contrast, addressing attacks in no particular order would most likely result in wasting valuable time and resources by addressing first the vast majority of attacks that are likely to generate few clicks only (ref. Figure 15 and Table 8).

Finally, our method opens up new opportunities in terms of automated incident handling and security orchestration, e.g. by enabling incident handlers to apply automated follow up procedures to incoming phishing attacks that fall within a certain threat range; for example, reported measures on the vulnerability triggers could be used to implement dynamic risk-based access control policies to limit immediate follow-up actions. Similarly, CSIRTs (Computer Security Incident Response Team) could implement automated network-level containment procedures based on the profile of incoming emails, and avoid additional (and unnecessary) victimization by delaying follow-up actions by the users until the risk is cleared.

7 Conclusions

In this work we presented an empirical method and evaluation of the effect of cognitive vulnerability triggers in phishing emails on the expected ‘success’ of an attack. We employed a unique dataset from a large European financial organization with data from their phishing response division. Our results indicate that response teams’ operations, such as take-down actions against rogue phishing domains, could largely benefit from a (fully automated) cognitive assessment of the email body to predict relative success of the attack, given the relevant user base. Our findings and method could also be employed to deploy more effective training and awareness campaigns in response to the more prominent threats suffered by the potential victims. Future work could explore automated response strategies to contain potential attacks and/or delay user response where needed.

References

  • [1] Afroz, S., and Greenstadt, R. PhishZoo: Detecting phishing websites by looking at them. In Proc. of ICSC (2011), IEEE, pp. 368–375.
  • [2] Akbar, N. Analysing Persuasion Principles in Phishing Emails. PhD thesis, 2014.
  • [3] Butavicius, M., Parsons, K., Pattinson, M., and McCormac, A. Breaching the Human Firewall: Social engineering in Phishing and Spear-Phishing Emails. In Proc. of ACIS (2015), pp. 1–11.
  • [4] Campaign Monitor. What Our Data Told Us about the Best Time to Send Email Campaigns, 2014.
  • [5] Chen, K.-T., Chen, J.-Y., Huang, C.-R., and Chen, C.-S. Fighting Phishing with Discriminative Keypoint Features. IEEE Internet Comput. (2007), 1–6.
  • [6] Cialdini, R. Influence: The Psychology of Persuasion. 1984.
  • [7] Cialdini, R., and Goldstein, N. The science and practise of persuasion. Cornell Hosp. Q. 43, 2 (2002), 40–50.
  • [8] Cichonski, P., Millar, T., Grance, T., and Scarfone, K. Computer security incident handling guide. NIST Special Publication 800, 61 (2012), 1–147.
  • [9] Dhamija, R., Tygar, J. D., and Hearst, M. Why phishing works. In Proc. of CHI (2006), p. 581.
  • [10] Dunlop, M., Groat, S., and Shelly, D. GoldPhish: Using images for content-based phishing analysis. In Proc. of ICIMP (2010), IEEE, pp. 123–128.
  • [11] Efron, B., and Tibshirani, R. J. Introduction to the Bootstrap. 1993.
  • [12] Ferreira, A., Coventry, L., and Lenzini, G. Principles of persuasion in social engineering and their use in phishing. In Lecture Notes in Computer Science, vol. 9190. 2015, pp. 36–47.
  • [13] Ferreira, A., and Lenzini, G. An analysis of social engineering principles in effective phishing. In Proc. of STAST (2015), pp. 9–16.
  • [14] Fürnkranz, J., Brinker, K., Mencía, E. L., and Hüllermeier, E. Multilabel Classification via Calibrated Label Ranking. Mach. Learn. 73, 2 (2008), 1–23.
  • [15] Hale, M. L., Gamble, R. F., and Gamble, P. CyberPhishing: A game-based platform for phishing awareness testing. In Proc. of HICSS (2015), IEEE, pp. 5260–5269.
  • [16] Ho, G., Javed, A. S. M., Paxson, V., and Wagner, D. Detecting Credential Spearphishing Attacks in Enterprise Settings. In Proc. of USENIX-Security (2017), pp. 469–485.
  • [17] Huang, C. Y., Ma, S. P., Yeh, W. L., Lin, C. Y., and Liu, C. T. Mitigate web phishing using site signatures. In Proc. of TENCON (2010), IEEE, pp. 803–808.
  • [18] Jain, A. K., and Gupta, B. B.

    Phishing detection: Analysis of visual similarity based approaches.

    Secur. Commun. Netw. 2017 (2017).
  • [19] Kanich, C., Kreibich, C., Levchenko, K., Enright, B., Voelker, G. M., Paxson, V., and Savage, S. Spamalytics: an empirical analysis of spam marketing conversion. In Proc. of CCS (2008), pp. 3–14.
  • [20] Kumaraguru, P., Cranshaw, J., Acquisti, A., Cranor, L., Hong, J., Blair, M. A., and Pham, T. School of phish: a real-world evaluation of anti-phishing training. In Proc. of SOUPS (2009), p. 3.
  • [21] Lawless, J. F. Regression methods for Poisson process data. J. of the Am. Stat. Assoc. 82, 399 (1987), 808–815.
  • [22] Lawson, P., Zielinska, O., Pearson, C., and Mayhorn, C. B. Interaction of personality and persuasion tactics in email phishing attacks. In Proc. of HFES (2017), vol. 61, pp. 1331–1333.
  • [23] Le Blond, S., Uritesc, A., Gilbert, C., Chua, Z. L., Saxena, P., and Kirda, E. A Look at Targeted Attacks Through the Lense of an NGO. In Proc. of USENIX-Security (2014), pp. 543–558.
  • [24] Le Page, S., Jourdan, G. V., Bochmann, G. V., Flood, J., and Onut, I. V. Using URL shorteners to compare phishing and malware attacks. In Proc. of eCrime (2018), pp. 1–13.
  • [25] Liu, G., Qiu, B., and Wenyin, L. Automatic detection of phishing target from phishing webpage. In Proc. of ICPR (2010), pp. 4153–4156.
  • [26] Mailchimp. Insights from Mailchimp’s Send Time Optimization System, 2014.
  • [27] Mao, J., Li, P., Li, K., Wei, T., and Liang, Z. BaitAlarm: Detecting phishing sites using similarity in fundamental visual features. In Proc. of INCoS (2013), IEEE, pp. 790–795.
  • [28] Marchal, S., Armano, G., Gröndahl, T., Saari, K., Singh, N., and Asokan, N. Off-the-Hook: an efficient and usable client-side phishing prevention application. IEEE Trans. Comput. 66, 10 (2017), 1717–1733.
  • [29] Marchal, S., and Asokan, N. On Designing and Evaluating Phishing Webpage Detection Techniques for the Real World. In Proc. of USENIX-Security (2018).
  • [30] Mishr, A., and Gupta, B. B. Hybrid Solution to Detect and Filter Zero-day Phishing Attacks. In Proc. of ERCICA (2014), pp. 373–379.
  • [31] Mitnick, K. D., and Simon, W. L. The Art of Deception: Controlling the Human Element in Security. 2002.
  • [32] Moghimi, M., and Varjani, A. Y. New rule-based phishing detection method. Expert Syst. Appl. 53 (2016), 231–242.
  • [33] O’Keefe, D. J. Elaboration likelihood model. The International Encyclopedia of Communication (2008).
  • [34] Oliveira, D., Rocha, H., Yang, H., Ellis, D., Dommaraju, S., Muradoglu, M., Weir, D. H., and Ebner, N. C. Dissecting Spear Phishing Emails for Older vs Young Adults: On the Interplay of Weapons of Influence and Life Domains in Predicting Susceptibility to Phishing. In Proc. of CHI (2017), pp. 1–13.
  • [35] Pearce, P., Dave, V., Grier, C., Levchenko, K., Guha, S., McCoy, D., Paxson, V., Savage, S., and Voelker, G. M. Characterizing large-scale click fraud in zeroaccess. In Proc. of CCS (2014), pp. 141–152.
  • [36] PhishLabs. Phishing Trends & Intelligence Report: Hacking the Human. Tech. rep., 2018.
  • [37] Pitsillidis, A., Kanich, C., Voelker, G. M., Levchenko, K., and Savage, S. Taster’s choice: A comparative analysis of spam feeds. In Proc. of IMC (2012), pp. 427–440.
  • [38] Propeller. The 2017 Email Marketing Field Guide: The Best Times and Days to Send Your Message and Get It Read, 2017.
  • [39] Ramage, D., Hall, D., Nallapati, R., and Manning, C. D. Labeled LDA. In Proc. of EMNLP (2009), vol. 1, p. 248.
  • [40] Redmiles, E. M., Chachra, N., and Waismeyer, B. Examining the Demand for Spam: Who Clicks? In Proc. of CHI (2018), pp. 1–10.
  • [41] Robertson, J. L., and Barling, J. Greening organizations through leaders influence on employees pro-environmental behaviors. J. of Organ. Behav. 34, 2 (2013), 176–194.
  • [42] Rosiello, A. P., Kirda, E., Kruegel, C., and Ferrandi, F. A layout-similarity-based approach for detecting phishing pages. In Proc. of SecureComm (2007), pp. 454–463.
  • [43] Rubin, T. N., Chambers, A., Smyth, P., and Steyvers, M. Statistical topic models for multi-label document classification. Mach. Learn. 88, 1-2 (2012), 157–208.
  • [44] Sagarin, B. J., and Mitnick, K. D. The Path of Least Resistance. In Six Degrees Of Social Influence: Science, Application, and the Psychology of Robert Cialdini. 2012, ch. 3.
  • [45] SendInBlue. Best Time to Send an Email: User Data Study by Industry, 2017.
  • [46] Shah, A., Ganesan, R., Jajodia, S., and Cam, H. Understanding tradeoffs between throughput, quality, and cost of alert analysis in a csoc. IEEE Transactions on Information Forensics and Security 14, 5 (2019), 1155–1170.
  • [47] Singhal, A., Buckley, C., and Mitra, M. Pivoted Document Length Normalization. ACM SIGIR Forum 51, 2 (2011), 21–29.
  • [48] Stanovich, K. E., and West, R. F. Individual differences in reasoning: Implications for the ratinality debates? Behav. Brain Sci. 23 (2000), 645–726.
  • [49] Stringhini, G., and Thonnard, O. That ain’t you: Blocking spearphishing through behavioral modelling. In Proc. of DIMVA (2015), pp. 78–97.
  • [50] Szurdi, J., Kocso, B., Cseh, G., Felegyhazi, M., and Kanich, C. The Long Taile of Typosquatting Domain Names. In Proc. of USENIX-Security (2014), pp. 191–206.
  • [51] Tay, L., Tan, K., Diener, E., and Gonzalez, E. Social Relations, Health Behaviors, and Health Outcomes: A Survey and Synthesis. Appl. Psychol. Health Well-Being 5, 1 (2013), 28–78.
  • [52] Tversky, A., and Kahneman, D. Judgment under Uncertainty: Heuristics and Biases. Science 185, 4157 (1974), 141–162.
  • [53] Wash, R., and Cooper, M. M. Who Provides Phishing Training? Facts, Stories, and People Like Me. In Proc. of CHI (2018), pp. 1–12.
  • [54] Workman, M. Wisecrackers: A Theory-Grounded Investigation of Phishing and Pretext Social Engineering Threats to Information Security. J. of Am. Soc. Inf. Sci. 59, 4 (2008), 1–12.
  • [55] Wright, R. T., Jensen, M. L., Thatcher, J. B., Dinger, M., and Marett, K. Influence techniques in phishing attacks: An examination of vulnerability and resistance. Inf. Syst. Res. 25, 2 (2014), 385–400.
  • [56] Wright, R. T., and Marett, K. The Influence of Experiential and Dispositional Factors in Phishing: An Empirical Investigation of the Deceived. J. of Manag. Inf. Syst. 27, 1 (2010), 273–303.
  • [57] Wu, M., Miller, R. C., and Garfinkel, S. L. Do security toolbars actually prevent phishing attacks? In Proc. of CHI (2006), p. 601.
  • [58] Yip, M., Shadbolt, N., and Webber, C. Why forums? An empirical analysis into the facilitating factors of carding forums. In Proc. of WebSci (2013).

Appendix

We find no evident correlations between regressors. Only Scarcity and Social proof, and Liking and Spoof dist. show a higher than average correlation of 0.50 and 0.57 respectively, which is unlikely to affect estimation results.
(1) (2) (3) (4) (5) (6) (7)
(1) Reciprocity 1.00 -0.19 0.13 -0.06 -0.11 -0.09 -0.17
(2) Consistency 1.00 -0.08 -0.24 0.10 -0.09 -0.41
(3) Social proof 1.00 0.25 -0.04 0.50 0.13
(4) Authority 1.00 0.06 -0.08 -0.10
(5) Liking 1.00 0.09 0.57
(6) Scarcity 1.00 0.24
(7) Spoof dist. 1.00
Table 9: Correlations between regression variables

Bootstrap analysis for similar email detection

The optimal threshold was found at 0.91 based on the intersection of the mean sensitivity and specificity metrics at all decimal thresholds in [0,1] across 10,000 bootstrap simulations with sample size 300.
Figure 16: Simulated optimal cosine similarity threshold for duplicate detection

After computation of the full pairwise similarity matrix for all suspect emails in our dataset, a threshold value was used to determine the lower-bound for the similarity score of emails we consider to be duplicates. In order to determine the most optimal threshold value for our specific dataset we performed a bootstrap analysis [11]. A bootstrap analysis generally involves repeatedly running simulations on samples drawn with replacement from an original sample set in order to estimate statistics on a larger population. This is a fitting solution for problems concerning dataset of large sizes like ours, which do not generally allow for efficient derivation of the full set of results that qualify as “ground-truth”.

We started our bootstrap analysis with a random sample of 300 suspect phishing emails for which we made a manual assessment of all pairwise similarities to test the performance of our cosine similarity algorithm across different thresholds. Then, we repeatedly (

) drew samples with replacement of size 300 from our manually classified sample and computed the pairwise cosine similarity matrix for all decimal thresholds in the interval [0,1]. For each combination of bootstrap sample and threshold value we computed the performance using the sensitivity (true positive rate) and specificity metrics (true negative rate). A high sensitivity score refers to a high probability of duplicate detection, measured by the proportion of actual duplicates that are correctly identified as being similar, whereas a high specificity score refers to a high probability of non-duplicate rejection, measured by the proportion of actual non-duplicates that are correctly identified as not being similar. The intersections of the mean results for these two performance measures indicate that 0.91 is the optimal threshold value for our dataset, as is visualized more elaborately in Figure

16.

We use this threshold to calculate the pairwise similarity matrix for all emails in our dataset and assign emails that are found to be similar the same duplicate ID to allow us to filter for unique emails.

Distribution of phishing emails by user

Figure 17: Estimation of rates of arrival of phishing emails to users
Figure 18: Number of users reporting emails ()

We perform a robustness check to evaluate the robustness of our results against large biases in the distribution of phishing emails across potential victims (whereby higher reported clicks may relate to higher email delivery volumes). We base the following on the sole assumption that attackers “sample” victims from the same pool (i.e. Org customers). As we of course cannot measure email delivery rates in user inboxes, our aim is to ask how would the dataset look like if large sample biases were present, and look for evidence in the data.

We developed the following data generation model to formalize and test this: a phishing email can reach a user with probability . A suspicious email will be detected by a user with a certain probability . Notice that this depends on the email and the specific user that receives it, as different users may have different sensibilities to emails with different characteristics. The probability of remaining undetected is simply the complement, and is defined as .

Further, each user has a certain probability and of, respectively, notifying a detected email, and clicking on a link if the email is not detected. Hence, the probability of an email being reported by a user is . Conversely, is the probability of a click for each and .

Let be the set of clicked emails, and the set of reported emails, we’d then have:

(2)
(3)

corresponds to the whole set of suspicious emails reported in the organization’s phishing inbox. corresponds to the set of emails clicked (of which we observe ).

Notice that (i.e. the probability of an email arriving to a user’s inbox) is the only variable that is outside of the direct influence of the user. On the contrary, , , and directly depend on the characteristics of the user and of the email. Hence, we would expect them to be approximately constant as long as within the comparison the users are the same, and the emails are similar to each other. As we can control for ‘similar’ emails (Section 3.1.3) and users are all pooled from the set of Org’s clients, we can isolate large effects on and as being caused by large fluctuations in .

Specifically, we would expect iff also as all other terms in the two equations will remain approximately the same for similar emails and the users sampled from the same pool. Hence, we measure the ratio of as a proxy to estimate how much can be expected to vary across emails. This holds under the sole assumption that emails in are indistinguishable (from the perspective of the user) from those in ; this is uncontroversial as the detection mechanism for the inclusion of an email in only depends on the phishing webpage and not on the email per se.

Figure 17 reports the ratio distribution calculated over emails with cosine similarity above the defined threshold (left), and over emails employing the same cognitive attacks (right). The ratios are small (i.e. only a small fraction of reported emails of a certain type can be expected to generate at least one click). This is qualitatively and quantitatively in line with previous findings in the literature [19]. Importantly, we observe that under both measures of similarity the rate is essentially constant and settles around 1-5% for all emails. This observation is incompatible with a significantly skewed distribution of emails per user. Breaking down the figure by users receiving the phishing email does not reveal any additional pattern as most users report only a few emails each (ref Figure 5 and Figure 18).

This is in line with previous literature on phishing attacks [9, 16] suggesting that no specific pre-selection of users charcterizes untargeted phishing attacks. We therefore do not expect significant biases in the analysis to emerge by the otherwise unmeasurable distribution of emails in users’ inboxes.