Online services are an integral part of our personal and professional lives. To support widespread adoption and improve usability, large-scale online service providers (LSOSPs) have made it simple for users to access any of the provided services using a single credential. Such “single sign-on” systems make it much easier for users to manage their interactions through a single account and sign-in interface. As users become more invested in the platform, the single login credential becomes a valuable key to a whole set of services, as well as the ‘key’ to their digital identity and the personal information stored on the platform. As a consequence, these credentials are highly attractive targets to attackers as well.
As LSOSPs improve their defense systems to protect their user base, attackers have shifted their efforts to social engineering attacks, e.g, attacks that exploit incorrect decisions made by individual users to trick them into disclosing their login credentials (Jagatic et al., 2007). Once an account is compromised, the attackers hijack the account from its legitimate owner and, typically, use it for their own purposes (Moore et al., 2009): for example, to evade detection while perpetuating an attack (e.g., multi-stage phishing, or malware distribution campaigns) or to carry out other fraudulent activity (e.g., sending out spam email).
Thus, detecting compromised accounts early and giving back control to their legitimate owners quickly, and even designing defense mechanisms that add additional layers of defense to protect users likely to fall prey to social engineering attacks, is crucial. Doing so can mitigate the damage an attacker can do while in control of a compromised account, protect the account owner’s digital identity, and reduce the damage inflicted by an automated large-scale social-engineering attack to a LSOSP and its user community. It should be noted that, detecting compromised accounts is much more challenging than just identifying fake ones (i.e., those created by an attacker) since, in such cases, suspicious activity is typically interleaved with the account owner’s legitimate activity (Egele et al., 2013).
This paper tests the hypothesis that it is feasible to identify likely future victims of mass-scale social-engineering attacks. In a nutshell, we postulate that the behavioral patterns of the users that have little incentives or low ability to fend off social-engineering attacks can be learned. To this end we propose an early warning system based on a completely automated pipeline using machine learning (ML) to identify the accounts with similar behavioral patterns to accounts that have been flagged as suspicious in the past.
We postulate that predicting accounts that are more likely to be compromised in the future can be used to develop new defenses, to fine-tune and better target existing defense mechanisms, as well as to better protect vulnerable users (Halawa et al., 2016). For example, in the context of online social networks, identifying potentially vulnerable accounts - even with low accuracy - was sufficient to develop a defense system for detecting fake accounts that significantly outperformed other state of the art approaches (Boshmaf et al., 2015). While we briefly discuss the intuition behind some of these defense mechanisms in the discussion section (§8), their design and evaluation, however, is beyond the scope of this paper and we focus here solely on evaluating our conjecture that predicting accounts which are more likely to be compromised is feasible.
We have tested our hypothesis using real-world data from a large LSOSP (i.e., at the scale of Amazon, Facebook, Google, Yahoo, etc.). Throughout this paper we will refer to it as a LSOSP (in italics, the non-italicized LSOSP refers to a generic Large-Scale Online Service Provider). Our experiments were carried out over four months worth of production data covering hundreds of millions of users generating hundreds of billions of login events to LSOSP’s platform.
Overall, our results indicate that it is feasible to achieve good classification accuracy as well as a low false positive rate. Our evaluation demonstrates that the proposed approach is not only feasible but also offers promising performance, based on which new or improved defense mechanisms can be developed. It is important to note that our results should be seen as a lower bound of achievable classification performance, which can likely be further improved by using richer data or additional computational resources (e.g., to support more sophisticated learning methods) as we discuss in §8.
This paper makes the following contributions:
We demonstrate the feasibility and applicability of the proposed approach on real-world production data (§6
). We show that, even using low-cost features extracted from two basic datasets (§5) and a simple model selection approach (§4) for acceptable training runtime, the proposed classifier can be tuned to achieve good classification quality based on recall, precision, and false positive rate metrics (§6). For example (§6.3), using only one week of login event history and predicting one month in advance, our classifier predicts more than half of the accounts later flagged as having suspicious behaviour (i.e., achieves a recall of 50.62%) and, at the same time, around one in five of the accounts predicted to generate suspicious behaviour is actually labeled as suspicious at LSOSP within a 30-day prediction horizon (i.e., precision of 18.33%, with a corresponding false positive rate of 0.49%).
2. Background: Account Life Cycle
Figure 1 provides an overview of the typical life cycle of an account that at some point is compromised (e.g., through a social engineering attack). After registering an account, a legitimate user starts using the provided services. We refer to the behavioral patterns over this time as the user’s normal behavior. Attackers may carry out automated social engineering attacks that attempt to compromise user accounts (e.g., falling for a phishing attack) (Jagatic et al., 2007; Boshmaf et al., 2011; Yan et al., 2011).
If the attack is successful, the user becomes a victim of the attack and her account is considered compromised, and eventually the attacker may start to (mis)use the account. Typically, some time is required before the defense systems employed by LSOSPs can identify such activity and, in turn, flag the account as suspicious.
At that stage, remedial procedures at the LSOSP start to take place (Freeman et al., 2016). For example, a popular low-cost measure is to use CAPTCHAs to make it more difficult for an attacker to automate the use of compromised accounts. The protective measures can be escalated and extra challenges can be then thrown at identified suspicious accounts such as prompting for answers to secret questions, or second factor authentication (2FA) through another service or through a mobile device. Their goal is to prevent the attacker from continuing to use the account while providing the legitimate user with some means of regaining control. After remediation, the user takes back control of her account. The account is no longer compromised and the observed behavioral patterns return back to normal over time.
If the suspicious behavior persists, the employed protective measures can be escalated again to limit potential damage to other users on the platform. For example, the LSOSP can quarantine the suspicious account (e.g., impose rate limits on the outbound messages or deny certain types of service) or even suspend it.
We note that fake accounts, i.e., accounts created by the attacker from the beginning, share some of the same phases during their lifetime as those discussed above for legitimate accounts. Fake accounts are typically disguised to appear legitimate in order to evade detection for as long as possible. For example, realistic looking profiles can be faked, normal behavioral patterns can be emulated, and fake accounts can be registered through unique IP addresses to avoid clustering.
3. Problem Formulation
We present a formal description of our problem by abstracting away from all company- and experiment-specific details. We describe those details in §4, §5 and §6 respectively. Here, we go over the assumptions and objectives that influenced our approach, we elaborate on the datasets required to carry out the classification task (§3.2), and we introduce our classification exercises (CEs), which are the means by which we organize our experiments (§3.3).
Our goal is to develop an early warning system that can be used by LSOSPs to harness observable legitimate user behavior to identify accounts likely to be labeled as suspicious in the future. Our intuition is the following: over the course of everyday use, the history of user interactions encapsulates information from which one can infer whether an account is more likely to be compromised (e.g., because the user does not have the interest or the ability to fend off social-engineering attacks); eventually (some of) these accounts are compromised, generate suspicious activity, and may later be flagged. In other words, to forecast future suspicious activity, we aim for features that approximate user behavioral patterns to infer similarity to legitimate accounts that are later flagged as suspicious.
We are developing a binary classifier to act as an early warning system. We chose a supervised machine learning classifier as, over the past few years, such approaches have been shown to achieve good performance for a variety of classification tasks (Liu et al., 2015; Soska and Christin, 2014; Shon and Moon, 2007).
3.2. Assumptions, Objectives, and Datasets
Assumptions. We treat the prediction of suspicious accounts as a binary classification problem (suspicious vs. non-suspicious). We assume that only a small subset of the overall population is likely to exhibit suspicious activity. We believe that this is true for large providers that offer services to a large number of users around the world (up to billions of users) and dedicate resources to maintain a ‘healthy’ user population. The direct implication is that the ML techniques used, the data selection for the training of the classifiers, and the success metrics used are all tuned for imbalanced data.
Objectives. We aim to meet the following objectives when designing and tuning the binary classifier. First, a low rate of false positives: accounts incorrectly predicted as suspicious (i.e., false positives) should be minimized even at the cost of decreasing the number of correctly predicted suspicious accounts (i.e., true positives). This trade-off can be controlled by tuning the classifier’s prediction threshold when generating the final binary classification. We also discuss tuning for a low rate of false negatives in (§8).
Second, and crucially for deployment at a LSOSP, the classifier should be optimized for runtime efficiency during both training (feature extraction and model building) and testing/use (prediction and classification). This can be accomplished by employing features that can be easily extracted/computed from the raw data, and by choosing ML models that offer a good trade-off between the quality of prediction and performance. Balancing this trade-off is crucial for timely forecasting of suspicious activity and thus faster remediation (as well as adoption in realistic settings).
Required Datasets. We assume that the LSOSP has access to at least two types of data. First, data that can be mined to extract behavioral patterns. Second, a sample of accounts previously flagged as suspicious is required as ground truth. We detail the data we use from LSOSP in §5.
3.3. Experiment Organization
Here we establish the terminology we use for the rest of this paper. We define the means by which we organize our experiments (Classification Exercises) and we detail the categories of accounts that can be observed in the datasets and how we use them.
A Classification Exercise
(CE) is our way of grouping together all parameters of a binary classification experiment (e.g., training time interval, testing time interval, ML model hyperparameters) and the associated results. As with any typical ML approach, a CE is divided into two distinct phases:training and testing (Figure. 2 provides an overview). During training, our goal is to fit a model that learns user behavioral patterns that can be used as early predictors of suspicious account activity. During testing, the fitted model is applied to new data not seen during training and the classifier’s performance is evaluated against a labeled ground truth.
|U||Set of all registered accounts|
|Set of accounts with login activity on day d|
|Set of accounts flagged as suspicious on day d|
|Set of accounts under attacker control on day d|
|Set of fake accounts on day d|
|Set of compromised accounts on day d|
|Set of accounts labeled as suspicious during a|
|Set of accounts predicted as suspicious during a|
Categories of Accounts. Table 1 presents a summary of the set notation used to categorize user accounts. We consider as the set of all accounts registered with the LSOSP. Depending on the scale and popularity of the LSOSP, can be extremely large potentially exceeding a billion users.
We use days as a coarse-grain measure of time. We consider as the set of users with login activity on day d. For the set , we extract easy-to-compute low-cost features representing the users’ login behavior on day d. We aim to learn the behavioral patterns of legitimate accounts prior to them being flagged as suspicious and, as such, require some information about accounts later flagged as having suspicious behaviour. We denote with the set of user accounts flagged as suspicious on day d. Existence of an account in set on day d is a clear indication that the account exhibited some suspicious activity prior to or on day d.
However, it is important to note that the opposite is not true: if an account is absent from the set on day d that does not imply that it did not exhibit any irregular activity prior to or on day d. The reason for this is that the pipeline used for detecting suspicious accounts at the LSOSP is expected to have some lag. In other words, it takes time for an account to be flagged as suspicious after it first starts exhibiting irregular behavior.
Avoiding Attacker-Controlled Accounts. The set contains not only legitimate user accounts but also those that are under the control of an attacker (the set ). These include fake as well as compromised accounts (considered as sets and respectively). We implement several heuristics to prune such accounts and avoid learning user behavioral patterns from accounts that may be under attacker control
We implement several heuristics to prune such accounts and avoid learning user behavioral patterns from accounts that may be under attacker control. We discuss this preprocessing step in detail in §4.3.
Classification exercise data. This data cleaning leads to , the set of labeled accounts during our CE (where d d Training Interval ). We denote the set of accounts our pipeline forecasts as suspicious during a CE as . To evaluate the quality of our predictions, we then compare the accounts in to the ground truth in the testing interval ( d d Testing Interval ).
Formalization. During training, a CE uses as input both and d d Training Interval with some pre-processing to avoid learning from attacker controlled accounts. The output after training is the fitted model . The set of labels used as ground truth when fitting the model is d d Training Interval . During testing, the inputs are the fitted model and d d Testing Interval . The output of the CE is the set of predicted suspicious accounts that are evaluated against a labeled ground truth from d d Testing Interval .
Success Metrics. Based on our formalization above, the outcome of the binary classifier can only fall in one of four possible categories: True Positive , False Positive , True Negative , and False Negative , where d d Testing Interval . Based on these one can derive classification precision, recall, accuracy and false positive rate.
4. Proposed Approach
This section outlines our proposed approach: the details of our classification exercises (§4.1), the proposed ML pipeline (§4.2), and the heuristics we implement to avoid learning from accounts under the control of an attacker and to reduce bias when evaluating our approach (§4.3). The following sections describe our datasets (§5) and the evaluation results (§6).
4.1. Classification Exercise Composition
We organize our classification exercises (CEs) as outlined in Figure 2. During training, we attempt to fit a model () that learns which behavioral patterns during the training Data Window (training-DW111Where the context makes the notation unambiguous, we skip the prefix and use DW only for training-DW or testing-DW. Similarly for LW.) correlated to the account being labeled as suspicious later in the Label Window (LW). We introduce a Buffer Window (BW) between the DW and LW, to account for any lag (delay) in the suspicious account flagging pipeline used to generate the ground truth of suspicious accounts. The reason is that, in the absence of the BW, a lag in the pipeline will cause the fitted model to learn user behavioral patterns from accounts that are already under the control of an attacker. In §4.3
, we present our heuristics to estimate the width of the Buffer Window (BW).
During testing, the fitted model () obtained during training, is applied during the testing-DW to forecast the set of accounts that are likely to have suspicious behaviour (). The quality of those predictions is then evaluated against the ground truth of labeled suspicious accounts extracted from the testing-LW.
4.2. The Early Warning Pipeline
Our system is composed of a pipeline that can be easily integrated into existing systems (outlined in Figure 3). We note that our pipeline design stresses efficiency, scalability, and, ultimately, achieving a practical training runtime sometimes even to the detriment of the learned classifiers (e.g., using simple low-cost features as opposed to sophisticated feature extraction). With production data, similar in scale to what we have access to at LSOSP, our pipeline is designed to extract behavioral patterns and to train in reasonable time on log traces from hundreds of millions of accounts leading to hundreds of billions of log entries over the duration of each CE.
We developed our pipeline in Scala 2.11, employed SparkML for all our developed classifiers, and ran our CEs on Spark 2.0.2 (Zaharia et al., 2010).
Data Pre-Processing. We pre-process the datasets from which we extract the user behavioral patterns (e.g., login activity dataset) as well as the ground truth (e.g., accounts flagged as suspicious). Importantly, we also carry out a series of pruning operations in order to exclude accounts that may bias either learning or evaluation as discussed in §4.3. During this stage, for each account, we extract features at the day level and aggregate them for the intervals associated with the classification exercise. There is an inherent tradeoff here: extracting and computing a large number of features over a long duration of time could potentially include more behavioral information thereby increasing the prediction accuracy. However, this comes at the cost of longer run-time and might affect prediction timeliness. At LSOSP, we find that extracting only a relatively small set of low-cost features that are both simple and quick enough to compute is both sufficient and also more practical from a performance perspective in a production environment (details in §5).
Pre-processing Imbalanced Data. Typically at LSOSPs, suspicious accounts (the positive class) are a minority compared to the overall population. Naively training an ML classifier on such imbalanced data will typically result in a classifier that always predicts the dominant class (the negative class in our case) to achieve the highest accuracy (Provost and Fawcett, 2001). Approaches to mitigate this problem include simple preprocessing techniques such as undersampling the majority class or oversampling the minority class (He and Garcia, 2009), or Cost-Sensitive Learning (Lomax and Vadera, 2013) that attempts to minimize the cost of misclassifications by assigning asymmetrical costs during the training process. At LSOSP, given the scale of the data and our focus on building a practical pipeline with good balance between runtime and classification performance, we use undersampling during training (however, we test on the whole set of labeled data in the test set).
Second, during the hyperparameter optimization stage, model selection is carried out in order to find the best model (or set of parameters) for the classification task. This only needs to be done once during training (or periodically, with low frequency and offline, to learn new user behavioral patterns) and is not carried out during inference using the fitted model in production. We use a Random Forest (RF) classifier considering the good trade-off it offers between runtime and classification accuracy(Fernández-Delgado et al., 2014). We carry out the hyperparameter optimization on an independent dataset extracted from the available history and specifically reserved for this purpose (§6.1). The extracted model parameters are then fixed for all the subsequent CEs.
Model Fitting and Inference. Third, after data preprocessing and hyperparameter tuning, a ML model is fitted and later applied to make predictions on new data (i.e., inference). On the one hand, this data could be one for which there already exists labeled ground truth. In that case, the goal is to evaluate the performance of the developed classifier. On the other hand, this could be new data from production for which no ground truth exists (i.e., during the real-world deployment) and in this case, the goal is to put the classifier into practice to predict accounts likely to generate suspicious activity in the future based on their recent behavioral patterns.
Finally, we carry out model evaluation as the last stage in our pipeline. We obtain the confusion matrix comparing the ground truth with the resulting predictions and collect statistical measures of the classifier’s performance.
Our goal is to learn behaviour from legitimate accounts (i.e., that are not attacker-controlled: fake and compromised accounts — d d Training Interval ) and predict which legitimate accounts may later get compromised and become labeled as suspicious. To this end we use a number of heuristics. We also implement additional heuristics to increase the confidence in our evaluation results.
Heuristics to increase the chance that we capture only the behaviour of accounts under the control of legitimate users. During training, we attempt to exclude all accounts that are potentially under the control of an attacker. In practice, the set of accounts is unknown, even for historical data for which there is collected ground truth, as this set may include not-yet-detected fakes and compromised accounts. We take advantage of the fact that we have an extremely large dataset to carry out aggressive exclusions that reduce the chance that we capture behaviour from attacker-controlled accounts. We use three heuristics: First, we exclude any account flagged as suspicious during the training DW or at a later point of time within the Buffer Window (BW). By excluding these accounts, we reduce the likelihood that our classifier learns behavioral patterns stemming from detected compromised accounts. Second, to the same end, for the classification exercises where there is available data before the start of the training interval (§6.3), we exclude accounts flagged as suspicious before the start of training (as they are more likely to be compromised in the future). Finally, to eliminate fakes, one of our classification exercises (§6.3) attempts to eliminate all recently-created or dormant fakes by selecting for training only accounts that are older than two months old and have at least one month of activity (our assumption is that once fakes generate enough activity the LSOSP can detect them through existing techniques (Yang et al., 2011; Wang et al., 2013) as detecting fakes is easier than detecting compromised accounts (Egele et al., 2013)).
|Lag (in days)||Cumulative Percentage of Accounts (%)|
Heuristics to reduce bias during classifier evaluation. Our preliminary experiments suggest that user accounts that have been flagged as suspicious in the past are more likely to be flagged again in the future (a possible indication that their users are more vulnerable to attacks than the general user population). To provide a conservative (lower-bound) evaluation of the developed classifier’s performance, we exclude all accounts that have been previously labeled as suspicious during training (i.e., flagged at any point during the training-LW or before). Moreover, one of our classification exercises (§6.3), also excludes any accounts flagged as suspicious during the first month of the data collection. As a result, the developed classifier is evaluated on never seen before true positives.
Heuristics to size the buffer window (BW). It is expected that, at any LSOSP, detection of suspicious activity is not instantaneous, thus accounts may be under the control of an attacker for a while before they are flagged. We developed an experiment to estimate how aggressive is LSOSP’s suspicious activity flagging pipeline. For this experiment, we only rely on two types of events: flagging events for accounts marked as suspicious on day (extracted from set ) and login events for these accounts (extracted from set ). For this experiment we include only user accounts that have at least one login event and at least one flagging event within the period of time over which we run the experiment. We define the lag per flagged user as the number of days between the first time that account is flagged and the most recent previous login event.
We run this experiment to gain some intuition for the lag of the flagging pipeline over a period of 30 days. As shown in Table 2, 90% of accounts flagged within that period have a lag of at most one week and 98.6% have a lag of less than three weeks. As such, we decided on a 1-week buffer window (BW) for most of our CEs, yet we also experiment with a 3-week BW (§6.4).
Overall, we have access to 118 days (4 months or 16 weeks) worth of production data collected from September 1st, 2016 to the December 27th, 2016. Overall, these datasets are representative of any LSOSP with a global user base, an extensive set of offered online services, as well as the latest techniques to identify potentially compromised accounts.
5.1. Extracting Features
We have access to two datasets extracted and updated daily from production servers. The first is a dataset that includes features associated with all login events. Whenever a user logs-in to a service offered by LSOSP or has her session re-authenticated, a login event is recorded into this dataset with all relevant features that can be associated with the event at that time. We use this dataset to extract a minimal set of 13 basic and easy to compute features that reflect users’ behavioral patterns (summarized in Table 3) from login traces at a day-level granularity, and then aggregate them for each user account as a way of characterizing its behavioral pattern over the DW. It is important to note that we do not have access to any fine-grained account features such as account/user details. Importantly, we do not have access to any personally identifiable information. Moreover, given the diversity of the login methods as well as the services offered at LSOSP, the features extracted for each login event are not uniform and the set of features extracted for each user is sparse.
|# Login Attempts||Numeric|
|# Unique Login Sources (e.g., Web Login, Mobile Login, etc.)|
|# Unique Login Types (e.g., Password Login, Account Switch, etc.)|
|# Unique Login Statuses (e.g., Success, Session Extension, etc.)|
|# Unique Password Login Statuses (e.g., Success, Invalid Password, etc.)|
|# Unique Actions (e.g., Login/Logout, Device Authentication, etc.)|
|# Unique Login Geographical Locations|
|# Unique Login Geographical Location Statuses (e.g., Neutral Location, White-listed Location, etc.)|
|# Unique Login Autonomous Systems (ASNs)|
|# Unique Login User Agents (e.g., Browser, Mobile App, etc.)|
|# Successful Logins|
|# Unsuccessful Logins|
|User has a ’verified’ mobile number||2-Categorical|
5.2. Groundtruth: Suspicious Account Flagging
At LSOSP, a list of accounts flagged as suspicious is generated daily by combining information from various sources that include human content moderators, manual reports from internal teams, user reporting, in addition to automated systems employing heuristics (that include clustering techniques to identify anomalies, and regression models to identify spammers). We use this daily list of accounts flagged as suspicious as our ground truth: we extract labels for accounts indicating whether it has been flagged as suspicious or not during the training/testing intervals.
Accounts flagged as suspicious may include compromised accounts, as well as fake accounts. It is worth noting that, on the one hand, flagging is a result of undesired behavior exhibited by an account (e.g., sending spam), that is then subsequently detected either manually by moderators or automatically by some internal security system. On the other hand, this may be due to association or similarity with prior flagged suspicious accounts in terms of account behavioral characteristics (e.g., accounts logging in from a blacklisted autonomous system).
For this study, we had access to the daily list of accounts flagged as suspicious and a high-level description of the system. The detailed internals of the flagging pipeline were not available. As a consequence, we are neither able to distinguish between the different classes of suspicious accounts nor to identify the reason why a particular account had been flagged. We believe that, the lack of such fine-grained information poses limited threats to the validity of our findings: on the one side we have developed heuristics to exclude attacker-controlled accounts from training (see §4.3), and, on the other side, at this point our machine learning model aims to provide only predictive power (will an account be flagged as suspicious?) rather than explanatory power (why will the account be flagged?). We extend this discussion in §8.
6. Evaluation Results
The Objectives of our Classification Exercises. We present four of the classification exercises (CEs) carried out at LSOSP labeled , , , and in Table 4. The table outlines the Training and Testing intervals assigned to each CE and their respective Data Window (DW), Buffer Window (BW), and Label Window (LW). For each CE, we have a specific objective:
: evaluating the feasibility of our proposed pipeline, its applicability at LSOSP, and optimizing hyperparameters.
: testing the tuned model on new data to ensure that no overfitting occurred in .
: investigating how the performance of our classifier changes when excluding accounts previously flagged as suspicious (higher chance to be flagged again) or accounts that have little previous activity (lower chance to include fakes).
: evaluating the impact of more training data (longer data and label windows) and more aggressive exclusion of potentially not-yet-flagged attacker-controlled accounts (longer buffer window).
|DW||BW||LW||DW||LW||Extended LW (H = [7, 90] days)|
Performance Evaluation Metrics
|D||21||34||0.947||89.41%||H ¡ H||10.64%||57.42%||11.68%||48.96%||12.34%||48.13%|
|CE||Performance Evaluation Metrics|
Summary of Results.
Tables 5 and 6 summarize the results for all CEs carried out (their setup is outlined in Table 4). For conciseness, we focus here only on the most relevant metrics
222 Notation used in Tables 5 and 6 : AUC-Area Under Receiver Operating Characteristic Curve, BTR-%-tile better than a random classifier, PRE-Precision, REC-Recall, ACC-Accuracy, FPR-False Positive Rate. Values in bold represent the best result for that performance metric.
: AUC-Area Under Receiver Operating Characteristic Curve, BTR-%-tile better than a random classifier, PRE-Precision, REC-Recall, ACC-Accuracy, FPR-False Positive Rate. Values in bold represent the best result for that performance metric.we collected. The two tables highlight how several metrics are impacted by the selected operating threshold of the classifier as well as by the duration of the prediction horizon (presented as Test-LW and Extended-Test-LW in Table 4 and whose combined size in days is denoted as the prediction horizon: ). The tables present results for and and the values of , in separate columns. Note that the minimum and maximum values of vary, depending on the CE.
In summary, these results show:
High accuracy (ACC) 99.9% and low false positive rate (FPR) ¡0.01% for an operating threshold ,
Good evidence for the absence of overfiting ( in §6.2),
Good balance between precision (PRE) and recall (REC): 18.33% and 50.62% respectively, when forecasting with a Horizon H = 30 days and Operating Threshold T = 0.5 ( in §6.3),
A small improvement after excluding recent/no activity accounts (more likely to be fakes) and those flagged as suspicious before training (Comparing and ),
As the Horizon (H) increases, precision increases while recall stays roughly constant (We expand on this in §6.5),
More training data and a more aggressive exclusion of not-yet-flagged attacker-controlled accounts do not significantly impact classification performance ( in §6.4).
6.1. Pipeline Tuning ()
uses data from the first five weeks of our traces for hyperparameter optimization. After a grid search, we extracted the hyperparameters of the best performing RF Classifier and fixed these for all the other CEs.
Our classifier performs well even for this heavily imbalanced dataset. The AUC attained is 0.928 that is 85.61% better than the performance of a completely random classifier (with its AUC of 0.5). Note that at deployment, a LSOSP can tune the classifier to achieve the desired balance between precision and recall by adjusting the classifier’s operating thresholdas highlighted by comparing the corresponding PRE/REC results in Tables 5 and 6, and in Figure 5.
To pick only one datapoint from this experiment and highlight the performance of our classifier: more than one in four of the predicted accounts are actually later flagged to have suspicious activity within the next 90 days (PRE=24.99%), and our classifier uncovers almost one third of the accounts later flagged as suspicious (REC=31.02%) with a low false positive rate (FPR=0.42%).
6.2. Testing for Over-fitting ()
To test for overfitting, through data reuse due to the hyperparameter optimization in , we repeated the same CE on entirely new data, while using exactly the same experimental parameters. The model parameters identified during the hyperparameter optimization stage () are reused for the rest of our CEs as well. uses the data collected over the last eight weeks of our traces: five weeks of data are used for training/testing similar to ; this leaves three extra weeks of ground truth to explore the impact of a longer prediction horizon (in §6.5).
The classifier maintains the same levels of performance as in (and these general trends are consistent across all CEs, none reusing data). This suggests no overfitting for , and that the performance of the classifier is generally stable with different parameters (training/testing intervals, time windows, etc.).
6.3. Excluding New/Pre-Flagged Accounts ()
Next, we investigate how the performance of our proposed classifier changes when we aggressively exclude accounts previously flagged as suspicious and accounts that have little past activity to increase the chance that we learn behaviours from accounts controlled only by legitimate users as explained in §4.3. excludes accounts that do not have any login activity within the one month pre-processing interval at the start of the experiment and any accounts that are flagged as suspicious within that same one month interval. With this extra pre-processing step, while there was a small improvement in performance, we did not observe any major impact. This is an indication that fakes are unlikely to be a major problem at LSOSP.
6.4. Expanding the Training Data and BW ()
Finally, as an extra experiment, we test our proposed classifier’s performance when we increase the volume of training data, and aggressively increase the duration of the Buffer Window. To this end, we decided to employ three week windows (for DW, BW, LW during training/testing respectively) as opposed to one week. For , the proposed classifier attained the highest AUC among all our CEs 0.947 (89.41% better than random). The classifier also achieved a PRE and REC of 10.64% and 57.42% respectively at = 21 days and = 0.5. At , precision reached 28.47%, recall 11.3% and the FPR an impressive 0.04% (at days).
6.5. The Impact of the Prediction Horizon
Our classifier’s precision markedly improves with the depth of the prediction horizon (Figure 4). Some of the accounts that are false positives for a small precision window then become true positives as the prediction window increases. We speculate that those accounts are owned by users that do not have the ability or the interest to fend off social engineering attacks, and thus a longer horizon increases the chance that they fall victim to an attack, and then generate suspicious activity which gets them flagged during the longer prediction horizon.
We note that although precision increases with the prediction horizon, recall only stays stable (Figure 4). We hypothesize that the reason is that, during training, our classifier learns behavioral patterns from a discrete set of attacks only. As new attacks are developed, new categories of users become compromised and then their behavior is labeled as suspicious (yet our classifier can not label them as it has not been exposed to this data).
6.6. Feature Importance
Since we chose a Random Forest binary classifier, we can extract information regarding the relative importance of the features selected. Forthe top four most important features are: the number of distinct geographic locations (20.07%), the number of successful login events (15.94%), the number of distinct ASNs (15.14%), and the number of unsuccessful logins (8.697%). While the relative importance of the features does not give an intuition as to why those features are suitable for the early identification of suspicious accounts in particular, they indicate that the same set of features that are typically used to flag suspicious accounts after the fact can also be used as early warning signs. We omit the relative feature importances for the rest of the CEs as they do not differ significantly from those presented for .
7. Related Work
Statistical methods (including ML) have achieved widespread adoption within LSOSPs not only to provide rich business features (e.g., product recommendations) but also for cybersecurity purposes. For instance, such approaches have been used for detecting compromised accounts, fake accounts, spam, and phishing. None of these approaches has focused on evaluating the feasibility of predicting which legitimate accounts are more vulnerable and likely to be compromised in the future (our long term aim). In this section each paragraph focuses on a specific area, surveys some related approaches, and outlines the statistical methods and features used.
Compromised Accounts. Egele et al. (Egele et al., 2013)
combined statistical modeling and anomaly detection techniques in order to detect compromised accounts on Online Social Networks (OSNs). Their approach was based on identifying sudden changes in user behavioral patterns in addition to observing whether those changes are common to a large group of accounts therefore potentially a result of a malicious campaign. Thomas et al.(Thomas et al., 2014)
employed clustering and classification (via logistic regression) in order to detect account hijacking on Twitter. Their approach was based on the observation that legitimate account owners frequently delete tweets posted via their accounts after recognizing the compromise. Those deletions are thus used as a feature to retroactively identify hijacked accounts and clustering is then used to detect similarly compromised accounts. Zhang et al.(Zhang et al., 2012) made use of a ML-based approach to automatically detect compromised accounts at a large academic institution. Their approach employed logistic regression on features extracted from web login and VPN authentication logs.
Fake Accounts. Yang et al. (Yang et al., 2011)
proposed approaches to identify Sybil (i.e., fake) accounts on the Renren OSN. One approach was based on ML and employed Support Vector Machines (SVMs) on basic user-level features (e.g., the frequency of friendship requests and the fraction of accepted incoming friendship requests). Wang el al.(Wang et al., 2013) instead used clustering to identify fake accounts on Renren. Their approach clustered users with similar behavior based on features extracted from their clickstreams (e.g., the average session length, the average number of clicks per session).
Spam. Benevenuto et al. (Benevenuto et al., 2010)
developed an ML-based approach to identify spammers on Twitter. Their approach was based on a non-linear Support Vector Machine (SVM) classifer with the Radial Basis Function (RBF) kernel and made use of both content- and user-level features (e.g., the age of the user account, the number of followers, the average number of URLs per tweet). Castillo et al.(Castillo et al., 2007)
developed a ML-based approach using cost-sensitive decision trees to detect spam pages on the Web. Their approach makes use of content- and link-based features extracted from the Web graph (e.g., the ratio between the average degree of a page and that of its neighbours, number of words in the page/title). In the context of email spam, Blanzieri et al.(Blanzieri and Bryl, 2008) carried out a survey of many of the approaches to detect email spam proposed in the literature based on statistical methods (including ML).
Phishing. Ludl et al. (Ludl et al., 2007) developed a ML-based approach to identify phishing web pages. Their approach was based on the C4.5 decision tree algorithm and made use of features extracted from a page’s content as well as its URL (e.g., the number of forms/fields tags on the page, whether the page is served over HTTPS, whether the URL’s domain appears on a Google whitelist). Whittaker et al. (Whittaker et al., 2010) developed a scalable ML-based approach to detect phishing websites that is used to maintain Google’s phishing blacklist automatically. Their approach is based on a Random Forest (RF) classifier and employed both content-, host- and URL-based features (e.g., PageRank, the host geolocation/ASN).
8. Summary and Discussion
Summary. We explore the feasibility of predicting the legitimate (i.e., not attacker-controlled) accounts more likely to generate suspicious activity in the future, a likely indication that they have fallen for a mass-scale social engineering attack. To this end, we propose an early warning system that employs supervised machine learning to identify the accounts whose behavioral patterns indicate that they are similar to other legitimate accounts that have been eventually labeled as suspicious in the past. We implement this early warning system at a Large-Scale Online Service Platform (LSOSP) and evaluate it on four months of real-world production data covering hundreds of millions of users. Our evaluation demonstrates that our approach is not only feasible but that it also offers promising classification performance based on which further defense mechanisms can be developed as we discuss below.
Discussion. We continue by exploring several interrelated topics:
How can a defense system use information about which users are likely to be compromised in the future, and thus more ‘vulnerable’, to enhance its robustness? User vulnerability can be thought of as an additional ‘signal’ that can inform a number of defense mechanisms. For example, it can: (i) serve as an indicator to prioritize the allocation of limited defense resources (e.g., use of human analyst time (Ho et al., 2017), or compute-intensive filters (Stein et al., 2011)), (ii) support differentiated defenses that take into account user vulnerability (e.g., additional CAPTCHAs on login attempts into vulnerable accounts (Ahn et al., 2003), or imposing rate limits on the outbound messages of vulnerable users to slow-down the spread of multi-stage — and potentially epidemic — phishing attacks), (iii) enable faster remediation of compromised accounts (e.g., by enabling more efficient inspection campaigns that focus on the accounts of vulnerable users instead of the entire user population (Liu et al., 2011)), (iv) facilitate the detection of the origin of an attack (as, in effect, the differentiated response between vulnerable and robust users to similar interactions initiated by the same source can be used as a weak yet effective signal (Boshmaf et al., 2015)); and (v) even facilitate the detection of new attacks (as, in effect, the differentiated response between vulnerable and robust — yet otherwise similar — user groups to the same ‘stimuli’ is an indication of an attack). We explore the use of such information for several cybersecurity domains in (Halawa et al., 2016).
Is the prediction quality good enough? Even if defense mechanisms based on vulnerability predictions can be imagined, an immediate subsequent question is whether the classification quality implied by our results (e.g., , , and ) is good enough to support such mechanisms. While we have not yet extensively studied such mechanisms, our intuition is that this signal, although noisy, is useful. Consider, for example, defense resource prioritization - it is evident that a heuristic that uses this signal, as weak as it is, to prioritize resources is better than randomly allocating resources (when capacity is constrained). Others have also experimented with a heuristic that harnesses the different response to similar requests between vulnerable and robust users (Boshmaf et al., 2015) to infer attack source(s) (although in the context of a social network). In this case, even a vulnerability predictor significantly weaker than the one we have obtained here has proven useful, leading to a technique that improves over the state-of-the-art. While the above indicates that even low quality predictions can still be used to improve defenses, we believe that the prediction quality threshold above which these mechanisms become valuable is context specific and we are studying this issue in a related project.
Why do we focus on minimizing the false positive rate (FPR)? What if the focus were on maximizing recall instead? We envisage that the predictions made by our early warning system will be used to better target existing defenses. As many of these defenses are not lightweight and may lead to increased friction for users (e.g., rate-limiting outbound emails of vulnerable users to prevent an attack outbreak, delaying incoming suspicious email addressed to vulnerable users to give enough time for more robust users to report mass-phishing emails), or allocating costly resources (e.g., human analyst time), the resulting cost of false positives is high: thus, we have focused on minimizing the FPR at the expense of lower recall. Other situations, however, offer a different cost/benefit balance between the false positive rate and recall. For these situations, our classifier can be tuned by either using lower threshold values ( as highlighted by the ROC across all CEs available in Figure 5), or by specifically optimizing for recall.
What are the threats to validity? Our study indicates that it is feasible to harness account behaviour to predict the accounts that are more likely to generate suspicious traffic in the future (an indicator that they may be compromised). There are two main concerns regarding the validity of our conclusions. The first one relates to the quality of the ground truth we use — this is a threat to validity common to any study using a methodology based on machine learning.
The second one relates to the accuracy of the heuristics used to avoid learning behavioural patterns from accounts that may be controlled by an attacker (i.e., compromised or fake accounts) detailed in §4.3. We prune: (i) all accounts flagged for suspicious activity in the data window (DW) - as they are highly likely to be compromised, (ii) all accounts flagged as suspicious in the buffer window (BW) - as these accounts are more likely to have been compromised but not yet flagged as such (thus contaminating our training data), (iii) all accounts which have been labeled as suspicious at any point before the training data window - as our experience shows that these accounts are more likely to be compromised again (in experiment ); and, finally (iv) new / low activity accounts (for which the system may not have enough history to determine whether the accounts are fakes). We run various experiments that compare the impact of these heuristics - even the most conservative experiments appear to support our conclusions.
It is worth discussing, however, the alternative: assume that our heuristics fail to eliminate a large portion of attacker controlled accounts. Even in this case, we believe that our pipeline provides value through forecasting. Assume, for example, that these accounts are predominantly (dormant) fakes that mimic legitimate user behaviour. In this case, our pipeline predicts the fakes that will likely be ‘awakened’ by the attacker and start generating suspicious activity. Assume, on the other side, that these are compromised accounts not yet exploited by the attacker, then our pipeline predicts which compromised accounts are under the control of the attacker but not yet exploited. In this case as well the forecasting pipeline can give an early sign of the attacker resources and strategy.
A final concern may be that our proposed approach may be learning the heuristics by which some accounts are flagged as suspicious in the ground truth (other accounts in the ground truth are flagged by humans). We believe that this represents a limited threat due to the way we formulated our forecasting problem (i.e., making future predictions) as opposed to the underlying heuristics which operate in real-time by design.
Why are the presented results positioned as lower-bounds? Our goal was to test the feasibility of our proposed approach within constraints related to access to data and computational resources. We believe that classification performance can be improved by: (i) access to data beyond the login traces (e.g., email traces, or browsing patterns) which are likely to contain additional information that characterizes user behaviour; and (ii)
additional computational resources - as our exploration was constrained by run-time feasibility in terms of data preprocessing (e.g., to extract complex aggregate features), model optimization, or sophisticated learning methods (e.g., deep neural networks) even though we used tens of computational nodes.
- Ahn et al. (2003) Luis Von Ahn, Manuel Blum, Nicholas J. Hopper, and John Langford. 2003. CAPTCHA: Using Hard AI Problems for Security. In Proceedings of the 22Nd International Conference on Theory and Applications of Cryptographic Techniques (EUROCRYPT’03). Springer-Verlag, Berlin, Heidelberg, 294–311. http://dl.acm.org/citation.cfm?id=1766171.1766196
- Benevenuto et al. (2010) Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida. 2010. Detecting spammers on twitter. In Collaboration, electronic messaging, anti-abuse and spam conference (CEAS), Vol. 6. 12.
- Blanzieri and Bryl (2008) Enrico Blanzieri and Anton Bryl. 2008. A survey of learning-based techniques of email spam filtering. Artificial Intelligence Review 29, 1 (2008), 63–92. https://doi.org/10.1007/s10462-009-9109-6
- Boshmaf et al. (2015) Yazan Boshmaf, Dionysios Logothetis, Georgos Siganos, Jorge Lería, José Lorenzo, Matei Ripeanu, and Konstantin Beznosov. 2015. Integro: Leveraging Victim Prediction for Robust Fake Account Detection in OSNs. In 22nd Annual Network and Distributed System Security Symposium (NDSS). San Diego, California, USA, 1–15. http://www.internetsociety.org/doc/integro-leveraging-victim-prediction-robust-fake-account-detection-osns
- Boshmaf et al. (2011) Yazan Boshmaf, Ildar Muslukhov, Konstantin Beznosov, and Matei Ripeanu. 2011. The socialbot network: when bots socialize for fame and money. In Proceedings of the 27th Annual Computer Security Applications Conference (ACSAC ’11). ACM, New York, NY, USA, 93–102.
- Castillo et al. (2007) Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock, and Fabrizio Silvestri. 2007. Know Your Neighbors: Web Spam Detection Using the Web Topology. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’07). ACM, New York, NY, USA, 423–430. https://doi.org/10.1145/1277741.1277814
- Egele et al. (2013) Manuel Egele, Gianluca Stringhini, Christopher Kruegel, and Giovanni Vigna. 2013. COMPA: Detecting Compromised Accounts on Social Networks. In Proceedings of the Network & Distributed System Security Symposium (NDSS ’13). ISOC.
- Fernández-Delgado et al. (2014) Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. 2014. Do We Need Hundreds of Classifiers to Solve Real World Classification Problems? J. Mach. Learn. Res. 15, 1 (Jan. 2014), 3133–3181. http://dl.acm.org/citation.cfm?id=2627435.2697065
- Freeman et al. (2016) David Mandell Freeman, Sakshi Jain, Markus Dürmuth, Battista Biggio, and Giorgio Giacinto. 2016. Who Are You? A Statistical Approach to Measuring User Authenticity. In Proceedings of the 23rd Annual Network and Distributed System Security Symposium (NDSS Symposium’16). ISOC, San Diego, CA, USA. http://theory.stanford.edu/~dfreeman/papers/ato-model.pdf
- Halawa et al. (2016) Hassan Halawa, Konstantin Beznosov, Yazan Boshmaf, Baris Coskun, Matei Ripeanu, and Elizeu Santos-Neto. 2016. Harvesting the Low-hanging Fruits: Defending Against Automated Large-scale Cyber-intrusions by Focusing on the Vulnerable Population. In Proceedings of the 2016 New Security Paradigms Workshop (NSPW ’16). ACM, New York, NY, USA, 11–22. https://doi.org/10.1145/3011883.3011885
- He and Garcia (2009) Haibo He and Edwardo A. Garcia. 2009. Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering 21, 9 (Sept 2009), 1263–1284. https://doi.org/10.1109/TKDE.2008.239
- Ho et al. (2017) Grant Ho, Aashish Sharma Mobin Javed, Vern Paxson, and David Wagner. 2017. Detecting Credential Spearphishing Attacks in Enterprise Settings. In Proceedings of the 26rd USENIX Security Symposium (USENIX Security ’17). 469–485.
- Jagatic et al. (2007) T. N. Jagatic, N. A. Johnson, M. Jakobsson, and F. Menczer. 2007. Social phishing. Commun. ACM 50, 10 (2007), 94–100.
- Liu et al. (2011) Gang Liu, Guang Xiang, Bryan A. Pendleton, Jason I. Hong, and Wenyin Liu. 2011. Smartening the Crowds: Computational Techniques for Improving Human Verification to Fight Phishing Scams. In Proceedings of the Seventh Symposium on Usable Privacy and Security (SOUPS ’11). ACM, New York, NY, USA, Article 8, 13 pages. https://doi.org/10.1145/2078827.2078838
- Liu et al. (2015) Yang Liu, Armin Sarabi, Jing Zhang, Parinaz Naghizadeh, Manish Karir, Michael Bailey, and Mingyan Liu. 2015. Cloudy with a chance of breach: Forecasting cyber security incidents. In Proceedings of the 24th USENIX Security Symposium (USENIX Security ’15). 1009–1024.
Lomax and Vadera (2013)
Susan Lomax and Sunil
A Survey of Cost-sensitive Decision Tree Induction Algorithms.ACM Comput. Surv. 45, 2, Article 16 (March 2013), 35 pages. https://doi.org/10.1145/2431211.2431215
- Ludl et al. (2007) Christian Ludl, Sean McAllister, Engin Kirda, and Christopher Kruegel. 2007. On the Effectiveness of Techniques to Detect Phishing Sites. Springer Berlin Heidelberg, Berlin, Heidelberg, 20–39. https://doi.org/10.1007/978-3-540-73614-1_2
- Moore et al. (2009) Tyler Moore, Richard Clayton, and Ross Anderson. 2009. The Economics of Online Crime. Journal of Economic Perspectives 23, 3 (September 2009), 3–20. https://doi.org/10.1257/jep.23.3.3
- Provost and Fawcett (2001) Foster Provost and Tom Fawcett. 2001. Robust Classification for Imprecise Environments. Mach. Learn. 42, 3 (March 2001), 203–231. https://doi.org/10.1023/A:1007601015854
- Shon and Moon (2007) Taeshik Shon and Jongsub Moon. 2007. A hybrid machine learning approach to network anomaly detection. Information Sciences 177, 18 (2007), 3799–3821.
- Soska and Christin (2014) Kyle Soska and Nicolas Christin. 2014. Automatically detecting vulnerable websites before they turn malicious. In Proceedings of the 23rd USENIX Security Symposium (USENIX Security ’14). 625–640.
- Stein et al. (2011) Tao Stein, Erdong Chen, and Karan Mangla. 2011. Facebook immune system. In Proceedings of the 4th Workshop on Social Network Systems (SNS ’11). ACM, New York, NY, USA, Article 8, 8 pages. https://doi.org/10.1145/1989656.1989664
- Thomas et al. (2014) Kurt Thomas, Frank Li, Chris Grier, and Vern Paxson. 2014. Consequences of Connectivity: Characterizing Account Hijacking on Twitter. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS ’14). ACM, New York, NY, USA, 489–500. https://doi.org/10.1145/2660267.2660282
- Wang et al. (2013) Gang Wang, Tristan Konolige, Christo Wilson, Xiao Wang, Haitao Zheng, and Ben Y. Zhao. 2013. You Are How You Click: Clickstream Analysis for Sybil Detection. In Proceedings of the 22Nd USENIX Conference on Security (SEC’13). USENIX Association, Berkeley, CA, USA, 241–256. http://dl.acm.org/citation.cfm?id=2534766.2534788
- Whittaker et al. (2010) Colin Whittaker, Brian Ryner, and Marria Nazif. 2010. Large-Scale Automatic Classification of Phishing Pages.. In Proceedings of the 17th Annual Network and Distributed System Security Symposium (NDSS Symposium’10). San Diego, CA, USA.
- Yan et al. (2011) Guanhua Yan, Guanling Chen, Stephan Eidenbenz, and Nan Li. 2011. Malware propagation in online social networks: nature, dynamics, and defense implications. In Proceedings of the 6th ACM Symposium on Information, Computer and Communications Security. ACM, 196–206.
- Yang et al. (2011) Zhi Yang, Christo Wilson, Xiao Wang, Tingting Gao, Ben Y. Zhao, and Yafei Dai. 2011. Uncovering Social Network Sybils in the Wild. In Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference (IMC ’11). ACM, New York, NY, USA, 259–268. https://doi.org/10.1145/2068816.2068841
- Zaharia et al. (2010) Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud’10). USENIX Association, Berkeley, CA, USA, 10–10. http://dl.acm.org/citation.cfm?id=1863103.1863113
- Zhang et al. (2012) Jing Zhang, R. Berthier, W. Rhee, M. Bailey, P. Pal, F. Jahanian, and W. H. Sanders. 2012. Safeguarding academic accounts and resources with the University Credential Abuse Auditing System. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012). 1–8. https://doi.org/10.1109/DSN.2012.6263961