With internet becoming ubiquitous, cyber attacks have become increasingly prevalent. Cyber attacks have had significant negative impact and implications in major sectors such as healthcare, finance, manufacturing, etc. As per a study by the Internet Society , over 5 billion data records were exposed in 2018 and the total cost of these attacks was estimated to be over $45 billion. There are several different types of attacks such as phishing, wiretapping, denial of service, ransomware, etc. In particular, with the advent of cryptocurrencies, which are used for making payments to attackers, ransomware attacks have increased exponentially over the last decade. Ransomware encrypts and locks the victims’ files until they pay money to the attackers. The financial damage due to ransomware attacks is estimated to increase from $8 billion in 2018 to $20 billion in 2021(isoc-report). Several cities and organizations have been impacted by these attacks (nytimes-report) affecting millions of people in the process.
Given the significance of these attacks, it’s critical to understand the impact and scale of these attacks. Most of the information available today is from manually curated public repositories (kharraz2015cutting) or by cyber security vendors (mcafee-report). However, these suffer from fragmentation and delays. So, it’s important to explore alternative data sources and methods to improve our understanding of these attacks. Web search query logs offer a unique way to do population-scale analysis (jhaver2019measuring; bansal2019usage). For instance, Paparrizos et al. (paparrizos2016detecting) and Xu et al. (xu2011predicting) have analyzed health related search queries to predict trends for various diseases and epidemics. Chancellor et al. (chancellor2018measuring) have shown that macro-economic factors like employment demand can be characterized using query logs. In the security domain, Canali et al. (canali2014effectiveness) used the browsing history of users to predict the risk of visiting malicious websites.
In this work, we do the first study to analyze cyber attacks, specifically ransomware, using query logs from Bing, a major web search engine. We mine ransomware related queries from anonymized query logs and use machine learning models to extract queries where users are seeking support for ransomware attacks. This is critical since we want to analyze queries and sessions where the users were likely attacked rather than those who were just looking for information about these attacks. Next, we do feature correlation analysis to understand if search behavior and user attributes are correlated with attacks. We also report on temporal and geographic trends for users who were seeking support for ransomware attacks. Lastly, we do a case study on the Nemty ransomware (Nemty) and show that just by query log analysis we are able to learn about the origin and the effectiveness of the attack.
We perform our analysis using the anonymized query logs from Bing. The analysis is conducted over a four month time span between July 1st, 2019 and October 31st, 2019. As web search patterns tend to vary significantly based on several factors, we focus this study on queries from US region with English locale. However, the methodology is generic and can be expanded to other regions and locales.
Below, we define some key terms that we use throughout the paper:
Ransomware Queries - Queries related to ransomware which contains the keyword ‘ransomware’ in the query or the clicked URL(s).
Support Queries - Ransomware queries which indicate that the user is trying to seek solutions for an attack. Sample queries: ‘how to recover encrypted files’, ‘.besub ransomware decryption software’.
Non-support Queries - Ransomware queries where the intent is not to find support or solution for attacks but for seeking general information, facts, etc. Sample queries: ‘top ransomware attacks’, ‘20 Texas cities attacked with ransomware’.
Attacked Users - Users who searched for at least one support query for ransomware attacks.
Safe Users - Users who did not search for any support queries for attacks. We have randomly sampled one million safe users for the study.
Limitations: Please note that since the query logs are anonymized, we lack ground truth about individual users to validate our observations. However, in Section 5, we show that the insights from this study are consistent with public information about the attacks.
2.2. Manual Annotation
Owing to the large volume of query logs, manually labeling each query will be a mammoth task. So, we manually label 1000 queries and then train a machine learning model to find the support queries and the attacked users. First, four annotators individually label a random sample of ransomware queries as either support queries or non-support queries. We then calculate the inter-annotator agreement score using Fleiss kappa (fleiss1971measuring). With the resulting score being , translating to almost perfect agreement, each of the four annotators were asked to label a disjoint set of samples each. Including the initial set of samples, a labeled dataset of samples was created with support queries being % of the data.
2.3. Support Query Classification
The labeled data (see Section 2.2) is then processed before we train a binary classification model. We tokenize the query string and the clicked URLs and compute the word embeddings of tokens that are not stopwords using a pretrained Word2Vec model (word2vec). The individual token embeddings are then aggregated together resulting in a
dimension feature vector. Several classification models are trained on the data and the five-fold cross validation scores are reported in Table1.
We observe that LinearSVC is the best performing model with the highest five-fold cross validation accuracy of % and a F1 score of . For the ransomware queries that were found in the four month duration, the trained LinearSVC model was used to derive the inference labels. A total of unique users were identified as attacked users which corresponds to % of the total users that searched for ransomware queries. The resulting dataset, which is a union of all the queries searched for by the attacked users and the safe users, comprises of queries out of which queries belong to attacked users.
3. User Behavior Analysis
The data collected from the previous section is analysed to identify the behavioral differences in attacked users and safe users. To this end, we identify different features and group them into different categories based on the type of behaviour it indicates. The list of categories and the corresponding features are as follows:
Volume of search - number of queries, number of adult queries, dwell time, clicks, sat clicks (clicks where the dwell time is more than s (fox2005evaluating)).
Diversity in searches - unique URL domains.
Time of search - morning (AM - PM), evening (PM - AM) or night (AM - AM)
Day of the week - weekday (Monday to Friday) or weekend (Saturday and Sunday).
Device used - device type, operating system and browser type.
Along with the total counts, we normalize the features at a session level as well as the user level. The feature values are computed for all users in the dataset. We then analyse the differences in distribution of feature values for all attacked users and safe users. Table 2 summarises the percentage difference in mean values of the feature distributions of attacked users and safe users where the feature values are aggregated at a session level. Note that only the features where the percentage difference was higher than % are shown in the table.
|Total Number Of Queries||192.16|
|Total Number Of Adult Queries||191.91|
|Total Number Of Clicks||193.25|
|Total Number Of Unique URL Domains||193.22|
|Total Number Of Sat Clicks||193.49|
|Total Dwell Time||193.55|
|Total Number Of Requests At Morning||184.22|
|Total Number Of Requests At Evening||196.72|
|Total Number Of Requests At Night||188.81|
|Total Number Of Requests On Weekday||189.21|
|Total Number Of Requests On Weekend||189.13|
|Mean Total Number Of Queries||107.84|
|Mean Total Number Of Adult Queries||103.21|
|Mean Total Number Of Clicks||110.38|
|Mean Total Number Of Unique URL Domains||110.64|
|Mean Total Number Of Sat Clicks||127.80|
|Mean Total Dwell Time||130.40|
Following the feature comparison, a feature correlation analysis is carried out using Spearman’s correlation coefficient (spearmancorrelation)
as it is able to capture monotonic relationships between variables without assuming the data to be of normal distribution. The values of the coefficient range fromto to denote negative and positive correlations. Once the coefficients are computed, the confidence of the results obtained is tested via a standard significance test. The correlation value of a feature is considered statistically significant if the significance level (or p-value) is less than indicating a confidence level of %. Figure 1 summarizes the set of features which satisfy this threshold condition. It is evident from the coefficient values that there is very weak or no correlation between the variables and likeliness of being attacked by ransomware.
An interesting observation made was that attacked users generally had a much higher search volume compared to safe users which implies that the more the users searches the web, the more likely they are to be attacked. There was also significant positive correlation between the percentage of queries searched at night time and a negative correlation for percentage of queries searched in the morning indicating that users are more likely to get attacked at night time. Another interesting behavior seen was that attacked users had higher positive correlations with adult queries.
4. Trend analysis
4.1. Hourly Trends
We analyzed how the behaviour of attacked users seeking solutions to ransomware attacks changes at each hour of the day by plotting hourly trends emerging from our dataset. Figure 2 shows that users were searching for solutions mostly during non-working hours (outside of the 9AM - 5PM window). This makes sense as users who are really determined to mitigate ransomware attacks on their own, by leveraging web search, are sparing some focused time outside of their regular working hours to find solutions. However, it could also be that the regular web search volume is very high during those hours. To better understand the trend, we plotted a graph shown in Figure 3 that shows the normalized distribution of ratio of support queries and how it is varying at different hours of the day. This bolsters our earlier finding (i.e., users spend more time searching for solutions outside of working hours) and highlights the fact that a lot of search activity to find solutions to ransomware attacks happens between 6PM - 11PM, which makes sense as this is the time window where users usually spare more focused time for finding solutions to their non-work problems.
4.2. Geographical Trends
In this study we focused on understanding how the search trends vary across different states in the US. In Figure 4 (left) we plotted a heat map of how the volume of support queries vary across different states in the US. Noticeably, states that are large (in size or population or internet penetration), like California, Texas, New York, is where a lot of activity is seen. However, this could also be because the total search volume in these states is generally higher. To better understand which states record higher volumes of ransomware queries, we normalized the data by calculating the ratio of support queries compared to non-support queries in that state. This yielded some interesting insights, as seen in Figure 4 (right): states like North Dakota, Arkansas, Oklahoma is where the ratio of support queries is high though the overall search volume is low (compared to states like California or Texas). This can also be intuitively correlated to the massive ransomware attacks that were seen in various schools and public offices in states like North Dakota and Arkansas in the year 2019 (ND-Ransomware; AR-Ransomware), which could have caused users in these states to record a higher normalized support query ratio.
5. Case Study
We present a case study to see if we can learn insights about specific attacks using query log analysis . We looked at all recent (second half of 2019) ransomware attacks with significant impact listed by the NJ Cybersecurity and Communications Integration Cell (NJCCIC)111https://www.cyber.nj.gov/threat-profiles/ransomware-variants/. For this study, we focus on the Nemty (Nemty) ransomware, however our technique generalizes to any attack.
Nemty is a ransomware that infects Windows OS users, encrypts their files, searches and deletes any shadow copies of these files, and finally asks victims to pay a ransom for restoring their data. Nemty started affecting users end of August 2019, and spread worldwide through distribution campaigns during September, October and November 2019, as seen in the timeline of Figure 5 (black boxes above the timeline show published news about Nemty (Nemty)).
We gathered attack-related search engine query logs (with English locale) about Nemty between the start of August to end of November 2019 from all countries. We then classified the queries as support and non-support using our ML model (see Section 2.3). Finally, we analyzed the results to gain insights about Nemty, such as when the ransomware started infecting users and how its distribution evolved over time (see blue boxes below the timeline in Figure 5). Our insights found trends related to the distribution of Nemty that start days before they are reported in the news (Nemty). This result shows that our query log analysis technique could be used to timely learn about the origin and effectiveness of a ransomware attack, even in the early days of its distribution.
In Figure 6, we show the number of such queries per 4-day periods between August and November 2019. We see that until 08/19, there were no queries about Nemty. However, users started submitting support queries on 08/20, which is likely when the ransomware started first spreading. Indeed, on 08/26 the first news about Nemty are published (Nemty) (see Figure 5). On 09/14, an article (Nemty) discusses how the Nemty authors are enhancing the ransomware with the goal of achieving a wider distribution. Our analysis indeed found that although the support queries started declining early September, on 09/09 there was a spike in such queries (see Figure 6). The insight we gained from this corresponds to the news about the ransomware becoming more efficient and sophisticated.
There was a tremendous increase in support and non-support queries about Nemty between 10/03 and 10/06 (see Figure 6). Published news confirm this finding, as there was a new distribution campaign during October that was targeting enterprise users (Nemty). We found that Nemty was now spreading worldwide, as support queries started being submitted by an increasing number of countries. After this period of time, the support queries started decreasing, most likely because more people became aware of the ransomware and how to protect from it. In early November there was a minor increase in the queries (see Figure 6), which corresponds to a new distribution method of Nemty via Trik botnet. However, this method was not very efficient, because after this minor increase in queries, the query volume continued to decline.
In this work, we did the first study to find insights about ransomware attacks using web search logs. We analyzed query logs from a major web search engine using a machine learning classifier to extract support queries by users who were attacked by ransomware. We did a correlation analysis and found that certain features such as query volume and click counts are correlated with attacks. Further, we analyzed geographical and temporal trends and validated our findings from publicly available information. Lastly, we did a case study on the Nemty ransomware and showed that with query log analysis, it is possible to mine key insights about the origin and spread of specific attacks.