Many modern malware families communicate with a centralized command and control (C&C) server. In order to do this, the malware must know the location of the C&C server to connect to—simple approaches might hardcode an IP or a domain name. But, these are easy to mitigate: the traffic to a specific IP can be trivially blocked, and domain names can be easily seized. Therefore, modern malware authors use domain generation algorithms (DGAs) in order to generate a large set of possible domain names where the C&C server may exist.
Typically, an infected machine will use the DGA to serially generate domain names. Each of these domain names will be resolved, and if the DNS resolution does not result in an NXDOMAIN response (i.e. if the domain name is registered), then the machine will attempt to connect to the resolved IP as if it is the C&C server. If any step of that process is not successful, then the machine will generate another domain name with the DGA and try again, until it is successful. Some DGA families generate random-looking domain names such as xobxceagb[.]biz; others generate difficult-to-distinguish domain names like dutykind[.]net.
This DGA-based approach to finding the C&C server is robust to IP blocking and domain name seizure; the C&C server operator can use any IP they have access to (and they may use different IPs at different times), and typically the number of unique domain names a DGA can generate is quite large, and sometimes the DGA itself may be hard to reverse engineer. Therefore, it is not generally feasible to pre-emptively seize all domain names that a DGA could generate. In fact, DGAs may even generate domain names that are not malicious or compromised, and this does not affect the malware’s ability to reach the C&C server eventually.
As a consequence, the task of determining whether or not a given domain name is produced by a DGA is an integral part of modern malware defenses. Simultaneously, as DGA authors create DGAs that generate domain names that look increasingly benign, the challenge of detecting these domain names increases.
A large body of related work seeks to use machine learning techniques directly to classify domains as generated by DGAs or not. Our contribution adds to this lineage of work; here, our machine learning detector is one component of an effective malware detection system. To this end, we describe a machine learning system that is able to accurately classify a domain name as DGA-generated or clean using only the domain name itself and some simple additional features derived from WHOIS data. This system is especially effective on DGA families that generate domain names based on English wordlists (i.e., domains that look benign to a human observer). Compared to previous approaches, our system performs better on difficult-to-detect DGA families that resemble English words (such as thematsnu and suppobox families), and the system is not difficult to deploy in a real-world environment—either as a standalone detector or as part of a larger malware detection system.
Overall, this paper makes the following contributions:
We provide a novel machine learning system built partially on recurrent neural networks that is capable of classifying DGA-generated domain names even from families traditionally understood as difficult. To achieve this degree of performance, our model takes advantage of side information such as WHOIS.
This model is robust: although it is trained with WHOIS information, predictions can still be made if WHOIS or other network level information is not available. This is crucial for real-time detection and prevention of malware outbreaks.
We devise a new measure that we term the smashword score. We rank 41 DGAs in terms of detection difficulty using this measure, giving an intuitive measure of difficulty related to how closely the domain resembles English words. Our approach can be re-used for new DGA families, and we believe our measure is useful for other DGA detection works in the future.
We successfully classify difficult DGA-generated domains using our model that other state-of-the-art approaches could not conclusively label; this includes domains with high smashword scores (e.g., those that are composed of combinations of English words). Note that these domains can even be difficult for humans to classify correctly.
2. Related Work
The problem of distinguishing legitimate domain names from algorithmically generated is certainly not new, and has been studied for a number of years. DGAs first became widely known to the community with the introduction of Kraken (kraken09, ) and Conficker (conficker2009, ) in 2008. Since that time, DGAs in malware have proliferated.
The early efforts to stop this threat were dealing with lack of sufficient training data to apply machine learning approaches (avivbotnet11, ). One decade later it continues to be a problem but to a smaller extent. Thus early proposed approaches and techniques were rather statistical. For example Yadav et al. (domainflux12, ) applied such technique to show differences between valid domain names and algorithmically generated ones. The limitation of such approach would be that it often does not transfer to a different DGA family.
Another milestone in detection techniques was credited to more extensive usage of DNS data. For example Zhou et al. (Zhou2013DGABasedBD, ) gathered DNS NXDOMAIN data from RDNSs and then used it to assemble a set of suspicious algorithmically generated domain names.
A different approach was proposed by Jian et al. (dnsgraph10, ). It relied on DNS traffic analysis but only for failed lookups. In this technique interactions between hosts and failed domain names would be extracted. Then a graph decomposition algorithm using tri-nonegative matrix factorization technique to iteratively extract coherent co-clusters would be applied. The obtained sub-graphs would be further analyzed by exploring temporal properties of generated clusters. The authors claim that their anomaly based approach can detect new and previously undiscovered threats.
Further research efforts evolved towards more and more extensive usage of machine learning techniques. At a large scale, it was pioneered by Phoenix (phoenix14, ) that was able to use both the URLs and other side information to detect DGA botnets. The list of parameters observed by this system includes some handcrafted features like pronounceability, blacklist information, DNS query information. This approach does not use any recurrent neural network (RNN) or powerful modeling technique for the domains themselves leaving a room for improvement. Tong and Nguyen (semanticdga16, )
have already proposed extensions to the Phoenix system. They included additional measures such as entropy, n-grams and modified distance metric for domain classification.
Further progress in DGA detection was reported when using machine learning techniques. For example, Zhao et al. (aptdns15, ) addressed the problem in the context of detecting APT malware. The authors proposed 14 features based on their big data research to characterize different properties of malware-related DNS and the ways that they are queried as well as defined network traffic features that can identify the traffic of compromised clients that have remotely been controlled. The features are comprised of signature-based engine, anomaly-based engine and so-called dynamic DNS features. The data was filtered by using Alexa111See https://www.alexa.com/topsites.
popularity and prevalence based on the number of hosts connecting to domains. As the outcome, an engine was built and it was used to compute reputation scores for IP addresses using extracted features vectors. The results are produced by using the J48 decision tree algorithm.
A comparable approach was presented by Luo et al. (dgasensor17, ) who described a system using lexical patterns that were extracted from clean domains listed in the Alexa top 100k domains as well as confirmed malicious DGA cases. The proposed approach is machine learning-based and achieves 93% accuracy for detecting malicious domains on the test dataset.
Additional improvements for state of the art results were reported by Woodbridge et al. (woodbridge2016predicting, )
. Despite a relatively simple Long Short-Term Memory (LSTM) network used to classify DGA domains, the approach was reported to have a high level of effectiveness. The presented results still have certain shortcomings, especially for difficult DGA classes that resemble English words, for which it did not perform satisfactory.
The same problem was approached from a different angle by Anderson et al. (deepdga16, ). The authors use a Generative Adversarial Network (GAN) to generate adversarial DGA domain names to try and deceive a classifier. The authors were able to achieve this goal. Then the GAN-generated domain names were added to the training set, which resulted in improved DGA detection performance. However, the authors did not test on any DGA families that look like they are made up of English words.
Shibahara et al. (Shibahara2016EfficientDM, ) proposed a slightly different algorithm that is using RNN on changes in network communication with a goal of reducing malware analysis time. This approach is not DGA-only specific but rather generic and attempts to cover other types of malware. However, it could successfully be used against DGA-type of threats based on their communication patterns. Thus this technique requires additional run-time type of data that is not required in many of the other approaches as it requires malware sandboxing. The authors claim that without their optimization the analysis time takes as much as 15 min and their approach reduces this time by 67%, preserving the detection rate of malicious URLs at of 97.9%.
The area of DGA defense and evasion techniques constantly evolves. Various improvements in blocking have resulted in a recent concept of domains shadowing. This attack relies on using existing and legitimate domains, which most likely were hacked or compromised. The malware will generate attacker controlled subdomains and use them for further communication. It was specifically exploited during an outbreak of the Angler malware222See https://blogs.cisco.com/security/talos/angler-update and https://www.symantec.com/security_response/attacksignatures/detail.jsp?asid=27430.. To combat this threat Liu et al. (ccs17, ) proposed a machine learning-based approach. It demonstrated promising results with just 17 features that fell in 4 categories: usage, hosting, activity and name. Some of their features include verification if many subdomains were create around the same time-frame or rather gradually; are subdomains hosted close to the original page; or if a given subdomain is linked from the homepage. There is also side information supplied such as a list of the top 50 legitimate subdomains. The authors show compelling results for the shadowed domain detection problem; however, shadowed domains are typically algorithmically generated and thus do not present the same difficult challenge as DGA families that produce domains that look like ordinary English words.
Overall, though the task of DGA detection is certainly not new, there has not been much focus on directly detecting DGA families made from English wordlists using the domain itself as a feature. This task has been described as ‘extremely difficult’ in some previous works (woodbridge2016predicting, ). Here, our focus is specifically on those DGA families.
3. Measuring the difficulty of detection of a DGA family
Since data-based approaches for the detection of malicious domains have been a recurrent trend during recent years, it is inevitable that malware authors would shift to generation algorithms that overlap with lexical patterns commonly found in clean datasets to avoid being detected. Taking into account this adversarial environment, we need to be able to measure how our DGA detection models will perform not only overall, but also against the most difficult samples. In this context, ‘difficult’ samples can be understood to be those that trick existing detectors—the most relevant example is those DGA families that combine English words, like the matsnu family (skuratovich2015, ), which was one of the first of many families to build domain names from English wordlists. These generate domains like the natural-looking domains songneckspiritprintmetal[.]com and westassociatereplacerisk[.]com, which present a much harder challenge to the many detection systems that depend on lexical features (yu2018character, ; woodbridge2016predicting, ; phoenix14, ; deepdga16, ).
An exploratory data analysis of our dataset shows that DGA families have characteristics that can affect the performance of different classification approaches. From an information theory point of view, both the average length and the average character entropy (shannon48, ) of the domain names seem likely to be interesting features to compare. The entropy of a single domain is calculated as below:
is the empirical probability of the characterin the string . However, in our experiments, we found no serious correlation between the average character entropy of a DGA family and whether that family was made up of difficult English-like words. Thus, we cannot use as a proxy for the difficulty of detecting a family.
Therefore, we have developed the smashword score , which is the the average -gram overlap (with ranging from 3-5) with words from an English dictionary. The computation of the smashword score amounts to calculating term-frequency inverse-document-frequency (TF-IDF) (sparck1972statistical, ) scores for a domain name using the English wordlist as a reference document set. Specifically:
In this equation, refers to the set of character -grams in the domain of length or , refers to the English wordlist, and refers to the set of character grams in the entire wordlist of length or . The term is the count of times an -gram appears in the entire wordlist . Thus, the smashword score is bounded by 0 below, if there is no overlap in any -grams between the domain and the wordlist, and the upper bound depends on the wordlist used. The score is normalized to the number of -grams in the domain.
Computing the smashword score for a string can be done in operations; but since will generally be much smaller than , we can say that the computation will generally take time linear in the length of the string , since in a string with length , there are 3-grams, 4-grams, and 5-grams.
The average smashword score of a DGA family is then calculated by simply taking the average smashword score of all of the domains in that family that are present in the data.
We can expect that domains with a high smashword score will resemble English words, and thus we expect that is a good indicator of the difficulty of detecting a DGA domain. Indeed, in the following section we find that our data bears out this expectation.
4. DGA Families
|DGA family||sample 1||sample 2|
Before introducing our proposed classifier and experiments, we introduce our dataset of DGA families and clean domains in order to perform some exploratory analyses. In this section we establish the Ground Truth (GT) datasets for both confirmed DGA and non-DGA domains. Each of our sources are taken from public locations, making our dataset straightforward to reproduce.
4.1. DGA Ground Truth Set
The GT for DGA domains consists of domains generated using Python implementations of real-world malware families using various seeds if necessary as an input, as well as domains collected from the wild. In order to have a sufficiently diverse coverage, the following entities have been selected in order to represent several most wide-spread patterns of DGA domains seen in-the-wild in 2017/2018:
random-looking 2nd level domain names
random-looking 3rd level domain names with generic 2nd level domain (usually dynamic DNS provider)
domain names comprised of random words (generally English)
The last type of domains was of our particular interest as lexically they are virtually indistinguishable from legal domains which means that some extra techniques are required in order to deal with them.
In almost all cases DGAs are using some sort of input seeds in order to either randomize the output and don’t generated the same domains twice, or make it unpredictable for researchers to avoid blocking or sinkholing. Here are some of the popular values used as seeds:
current date and/or time
value embedded into a sample/group of samples by campaign (usually one DWORD)
string(s) available online either on a malware authors or public server
3rd party public online document (for example, The US Declaration of Independence, the Apple license, etc)
In addition, some work has been done to make sure that there are diverse top level domains (TLDs) represented as malware authors tend to use only some particular ones which may introduce substantial skew to our dataset. Overall, we have collected 41 DGA families. Information on each family is given in Table1, including the average entropy and average smashword score . The families are collected from multiple sources and denoted in the table: DGArchive333See https://dgarchive.caad.fkie.fraunhofer.de/., an implementation for the locky family found on Github444See https://github.com/sourcekris/locky., Andrey Abakumov’s DGA repository on Github555See https://github.com/andrewaeva/DGA., and Johannes Bader’s DGA implementations666See https://johannesbader.ch and https://github.com/baderj/domain_generation_algorithms.. Smaller or unknown families were grouped as others_dga and others_dga_b 777Sinkholed domains collected from public WHOIS registration information containing firstname.lastname@example.org as the contact email.. All of this data is publicly available and thus our set of DGA domains is reproducible.
4.2. Non-DGA Ground Truth Set
For non-DGA domains the GT is comprised of domains found in the Alexa top 1 million sites and the OpenDNS public domain lists888These lists can be found at https://github.com/opendns/public-domain-lists., giving 1.02M clean domains for the clean GT set. There were multiple major problems which should be handled when using such an approach:
some prevalent DGA domains can manage to get into the list of world top popular domains,
3rd level domains should be covered separately,
some DGA domains are so short that they collided with the non-DGA domains, and
some DGA domains use combinations of English words, so the chances that they collide with non-DGA domains are quite high.
In the last two cases, malware authors have no problem when collisions take place, as in any case malware will be waiting for a valid response from the C&C before following up. Just the opposite, such cases can make the work of security engineers more complicated as they cannot simply ban all domains generated by the DGA—since some domains are known to be clean. In our dataset, we only found 12 such domains that existed in both the non-DGA and DGA sets. It presents no modeling problem to leave these points in both sets.
5. Side Information
DGA families with a high average smashword score are very hard to classify based on the domain names alone. In fact, human analysts may even have a difficult time differentiating—for instance, it is plausible at first glance that darkhope.net might be a personal website for a 1990s-era teenaged computer enthusiast. In reality, that domain name is generated by the suppobox DGA. Thus, we cannot hope to build an effective classification system using single domain names alone.
Therefore, we augment our domain names with side information, which we collect from the WHOIS database (rfc3912, ). Specifically, given a domain name, we perform a WHOIS lookup, and extract the following numeric or Boolean features:
has_registrarname: Boolean, indicates whether a registrar name is available.
has_contactemail: Boolean, indicates if any contact email is available.
days_until_expiration: numeric, the number of days since the domain was created, updated, or until expiration
status_length: numeric, length of the “status” field
has_zonecontact_info: Boolean, indicates whether each of the types of contact information are available.
has_registrar_iana_id: indicates whether a registrar IANA ID is given.
Note that for a non-registered domain name (NXDOMAIN), the boolean features will all be false, and the numeric features will all be taken as 0.
We do not perform any semantic analysis on the content of the WHOIS record; instead, we focus on those features most likely to give us information relevant to DGAs and C&C servers: temporal information about the registration, and whether the domain itself is registered. The features we are using roughly match the type of features used by Ma et al. (ma2009beyond, ).
For our dataset, we used a snapshot of collected WHOIS data with 245M records. For our clean domain names, we matched 927k domains (91.7%) to WHOIS data, and for the DGA domains, we matched only 2.3k domains (0.18%) to WHOIS data. This is expected, given that most DGA domains are never registered.
In our dataset, DGA families have an average of 3.5% of their domains matched to WHOIS data; with the ramnit family matching the highest percentage at 84%, and the pandex family matching the lowest nonzero percentage at 0.008% (only 7 out of 91758 domains registered). 19 families, totaling 321k domains, have no domains matched to any WHOIS data.
Although having matching WHOIS data for a domain is strongly correlated with whether or not the domain arises from a DGA, note that a detector built to classify a domain as malicious simply if there is no WHOIS data would not be very effective: with our data, it would achieve a true positive rate (TPR) of 96.5%, but with an unacceptably high false positive rate (FPR) of 8.3%. Thus, though WHOIS data gives us good information, it is not sufficient for prediction by itself.
5.1. WHOIS and GDPR
After the passing of the European privacy bill GDPR (gdpr, ), it is unclear how WHOIS lookups will be affected (icann-gdpr, ). At the time of our experiments, WHOIS data was still publicly available. However, if this is not the case in the future, it would be easy to find alternatives. Given that the important features we extract depend more on the temporal registration information than the contact details of the registrant, we could replace the WHOIS features we use here with DNS tracking systems like Active DNS (kountouras2016enabling, ) or the Alembic system (lever2016domain, ).
At the time of this writing, it is not clear what the long-term solution for WHOIS data will be. But, since WHOIS data is widely used for security applications (ma2009beyond, ; bilge2011exposure, ; canali2011prophiler, ), it seems very unlikely that the types of features we are using for our system will become unavailable.
6. Model Architecture
Given the effectiveness of deep learning classifiers for character-level DGA modeling(woodbridge2016predicting, ; yu2018character, ), we have designed our our DGA detector on character-level RNNs (karpathy2016visualizing, ). Instead of training the RNNs to predict the class of the domain, we instead train two RNNs to predict the next character in the domain and combine these predictions via a generalized likelihood ratio test (GLRT). In addition, our model also incorporates the WHOIS side information discussed in the previous section via model stacking. This allows us to achieve significantly better performance on more difficult DGA families.
Overall, our model is a logistic regression classifier built on the output of four different models:
A character-level RNN GLRT model built only on the subdomains in the training set.
A character-level RNN GLRT model built only on the domains in the training set.
One-hot encoded top-level domain features (for the most popular 250 TLDs).
Extracted features from the WHOIS information.
The overall architecture of the model can be seen in Figure 2. In the following subsections we describe the details of the full model.
6.1. Character-level RNN GLRT
The core of the model is the character-level RNN that uses the generalized likelihood ratio test to classify a domain or a subdomain as DGA or non-DGA. Previous approaches and other uses of RNNs often predict the class of the output directly (woodbridge2016predicting, ; graves2005framewise, )
; however, this only allows backpropagation of the error signal at the end of the entire sequence, which can slow the learning process.
Therefore, we build one RNN on each class in the input dataset (in our case, there are only two classes: DGA and non-DGA). Each input sequence is converted to a one-hot character encoding, and the label or expected output of the RNN for each time step is the one-hot encoding of the next character in the sequence. This means the RNN is trained to predict the next character in the sequence. Thus, backpropagation can be done at every timestep, instead of waiting until the end of the sequence to compare the output of the RNN with the desired label. Our model’s architecture is a single LSTM layer (hochreiter1997long, ) followed by a single dense layer, pictured in Figure 3. We use LSTMs to help avoid the vanishing and exploding gradient phenomenons (pascanu2013difficulty, ). Although it is possible to build a more complex network, we found that this provides a good balance between training time and the accuracy of the model.
In order to perform the one-hot encoding, we first build a dictionary on the entire training set, including the ‘unknown’ character ’?’. If a character is encountered at prediction time that is not in , then it is encoded as ’?’.
We use the categorical cross-entropy (goodfellow2016deep, )
for the loss function. Then, during prediction, at each time stepfor the input , the output of the model is a probability
and with this we can construct an estimate of the likelihood of the pointarising from the model :
For the generalized likelihood ratio test (neyman1933ix, ), if we calculated both likelihood estimates and , we could then set a threshold and compute
and if , we classify the point as a DGA domain; otherwise, we classify the point as non-DGA. The value of can be swept in order to control the false positive and true positive rate.
is directly related to the typical posterior probability of a classifier; in fact, if we normalize the likelihood estimates we can produce a posterior probability ofbeing a DGA domain:
Then, setting a threshold for is reducible to setting a GLRT threshold .
For our DGA classifier, we build two separate RNN-GLRT models as described above: one on the subdomains of our training set, and one on the domains. Each of these two models, in turn, contains a separately-trained LSTM RNN, whose outputs are combined to perform the GLRT as shown above.
As input to the logistic regression model, we extract six features from each RNN-GLRT model, giving a total of twelve features. The features are listed below.
A boolean feature indicating whether a domain or subdomain could be extracted from the input domain .
The likelihood estimate .
The likelihood estimate .
The posterior probability .
The posterior probability .
The likelihood ratio .
Since we are extracting the likelihood estimates and posterior probabilities into a logistic regression model, we actually have no need to select a threshold —that is only needed for a standalone GLRT LSTM model. Instead, in our combined model, the logistic regression will learn directly from the probabilities and likelihoods.
6.2. Top-level domain features
Since TLDs are so short (usually two or three characters), it is excessive to train an RNN on them. Therefore, we use a one-hot encoding of the TLD, matching against the 249 most frequent TLDs in our training dataset; if there is no match, the TLD is encoded as ‘other’, giving a total of 250 binary features out of the TLD.
In order to perform the conversion, we used the TLD list available from http://publicsuffix.org. The most common TLDs in our dataset were .com, .org, .ru, .net, and .info. We found that the .ru, .info, .biz, and .cc TLDs contained significantly higher concentrations of DGA domains, with each of those TLDs containing at least 3 times as many DGA as non-DGA domains. Since we have split these into separate features, we can expect our model to learn which TLDs domain generation algorithms are more likely to use.
6.3. WHOIS side information
The last input to our logistic regression model is the WHOIS data described earlier, in Section 5.
The WHOIS data makes up the rest of the input to the logistic regression model. It is concatenated with the RNN-GLRT features for the domain, the RNN-GLRT features for the subdomain, and the one-hot encoded TLD features.
Before all of these concatenated features are fed into the logistic regression model, we perform whitening via PCA for decorrelation and scaling (kessy2015optimal, ). This step can improve the performance of the model, although it generally also makes interpretability more difficult.
6.4. Computational concerns
Recurrent neural networks, especially those with complex memory cells like LSTMs, are well-known to be time-consuming to train (li2015fpga, ; doetsch2014fast, ). Our model is not exempt from this; for large datasets, it may take many hours to train999Our training was conducted on a high-end consumer-grade system with a single GPU.. However, in practice this is not a concern—a single forward pass through the model for classification is comparatively very fast, and once our model is trained, there are no computational difficulties with deployment in a low-latency or high-throughput detection system. This means that the model can be, e.g., deployed into a consumer endpoint security product without problems.
7. Adversarial samples
In recent years, the phenomenon of adversarial samples has surfaced in the deep learning community (szegedy2013intriguing, ; goodfellow2014explaining, ). In essence, a malicious actor could take a sample that was correctly classified by the model, perturb the input slightly, and the perturbed sample would be misclassified. When images are used, these perturbations are often invisible to the eye. These adversarial attacks have been successfully applied to fields outside of images, including audio (carlini2018audio, ) and malware classification (grosse2017adversarial, ). Though there are some defense mechanisms that have been developed (papernot2016distillation, ; feinman2017detecting, ), many of these are later found to be circumventable (carlini2017adversarial, ).
Given that adversarial samples are not limited to images, it is reasonable to believe that neural network-based DGA detectors could also suffer from this vulnerability. In our situation a malicious actor would wish to take a domain that is detected as from a DGA and have it labeled as a non-DGA domain. It would be very straightforward to perform an attack like the Fast Gradient Sign Method (szegedy2013intriguing, ) to modify the characters in a domain name. In fact, it is not (generally) important to DGA authors what the domain name looks like, so there is no cost to modify the letters of the domain itself.
Such a technique would likely prove effective against an approach that only incorporated the domain name itself. However, note that our model also incorporates domain registration side information from WHOIS. Although a malware author can change the domain name they are using at will and nearly arbitrarily, it is significantly more difficult to cause the WHOIS registration information (such as registration date) to have specific values. To do that, a malware author might need to register a domain perhaps months in advance and host a clean website on it, which is both expensive and time-consuming. Thus, it would be more difficult for a malware author to work around our proposed model.
The most important situation for any DGA detection model is when it encounters an entirely new DGA family that it has never seen before. This is the situation that we focus on in our experiments, since it reflects the real-world ‘zero-day’ situation. We compare our model to several baselines that reflect the state-of-the-art for machine learning systems that do not use network traffic data.
Leave-one-out models. To simulate the situation where a DGA family has not been seen, we validate the performance of our DGA detection model by performing leave-one-out experiments, where we train the model on all DGA families except one, and then the test set consists entirely of the left-out DGA family combined with some never-before-seen non-DGA domains. This shows us how well the model is able to generalize to unseen DGA family types.
Dataset details. Our collected dataset, as described in Section 4, includes 41 DGA families plus non-DGA data, totaling 2.3 million domain names (1.01 million non-DGA, 1.28 million DGA). Of these 41 DGA families, many with high average smashword score have been specifically mentioned in related work as difficult. The LSTM model of Woodbridge et al. (woodbridge2016predicting, ) is specifically shown to perform very poorly on the matsnu, suppobox, and beebone families, each of which have above average to very large average smashword scores. Mac et al. (mac2017dga, ) claim that matsnu is not differentiable from non-DGA domains at all, and show very poor performance on all their surveyed algorithms for the nymaim DGA, which is very similar to the gozi DGA that we use here. Because our model has been specifically designed to focus on DGA families that are understood to be more difficult, we will focus our results on these families.
Baseline models. We wish to compare the performance of our proposed model with existing and baseline approaches. Therefore, we compare our model with four other models, which we now introduce. Two of these are simple baseline models, with and without WHOIS information, and the other two are based on LSTM architectures that represent the most closely related state-of-the-art work of Woodbridge et al. (woodbridge2016predicting, ).
: logistic regression on TF-IDF features extracted from the domain name using 2-grams.101010We did not use 3-grams, because the memory usage on our system was too large. Any WHOIS side information is not used here, so this model presents a reasonable baseline using only the domain name.
lr-tfidf-aug: logistic regression on TF-IDF features extracted from the domain name using 2-grams, and augmented with the WHOIS features. This is a reasonable baseline for classification using both the domain name and the side information (WHOIS features).
glrt-lstm: a GLRT LSTM model built only on the full domain name (no side information). This can be considered to be a slight improvement over the model of Woodbridge et al. (woodbridge2016predicting, ) due to the use of the GLRT.
glrt-lstm-aug: a GLRT LSTM model built only on the full domain name, and then used as input to a logistic regression model, with the WHOIS features augmented.
Our model. We refer to our model as the split-glrt-lstm-aug model; this is the model from Section 6.
Training and implementation details. The lr-tfidf and lr-tfidf-aug models were implemented with scikit-learn (pedregosa2011scikit, )
, and the three LSTM-based models were implemented with Keras(chollet2015keras, )
using the TensorFlow backend(abadi2016tensorflow, )
. Each LSTM model used 500 LSTM units and was trained for 100 epochs (passes over the dataset) with early stopping using the RMSprop optimizer, with dropout of 0.2. With our setup (one nVidia GeForce GTX TITAN X), each LSTM model took approximately 8-10 hours to train. In our experiments, we found that changing the optimizer made little difference to the resulting model, and we found that increasing or decreasing the number of LSTM units decreased performance slightly.
shows receiver operating characteristic curves (ROC curves) on the four datasets with highest smashword score. We can see in the figures that thesplit-glrt-lstm-aug model (our proposed model) outperforms each of the other models, providing better performance at lower false positive rates. For instance, on the difficult matsnu family, when the false positive rate is chosen to be 0.5%, the split-glrt-lstm-aug model operates at a true positive rate of 95%, whereas the next best model (lr-tfidf-aug) operates at a true positive rate of only 70%.
In typical application scenarios, we typically care only about running our classifier at false positive rates less than or equal to 1% (FPR ). Therefore, we study the performance of the classifiers using the partial AUC (mcclish1989analyzing, ) measure, which is the standard area-under-the-curve (AUC) measure specific to false positive rates less than a given threshold. In Table 2, we show the partial AUC of each model for each leave-one-out family experiment, sorted by decreasing .
On the most difficult families (with large ), the proposed split-glrt-lstm-aug reliably and significantly outperforms all other compared models. This is the region of most interest in our work, as these families are difficult to detect—even with WHOIS data. Each of these difficult families generates domains that resemble English words; see Table 1. Note that the lr-tfidf-aug model and glrt-lstm-aug models both have access to the WHOIS features; however, only split-glrt-lstm-aug is able to take advantage of these to provide good performance for families with high .
For ‘easier’ families with lower , where the generated domains typically look more like random characters, classification can be performed more reliably with only the text of the domain itself; thus, the glrt-lstm model is dominant in this regime.
Overall, we see that our model is successful in detecting DGA-generated domains that resemble English words. The model appears to generalize well to different families, given the nature of our leave-one-out experiments.
In this paper we have considered the problem of DGA domain detection. We introduced a measure of complexity for DGA families called the smashword score, which reflects how closely a DGA’s generated domains resemble English words. Because DGA families with higher smashword scores have typically posed greater difficulty for detection, we build a novel machine learning model consisting of recurrent neural networks (RNNs) using the generalized likelihood ratio test (GLRT), and augment these models with a logistic regression model that also includes side information such as WHOIS information.
This combined model notably outperforms existing state-of-the-art approaches on DGA families with high smashword score, such as the difficult matsnu and suppobox families. We believe that this model could be used as either a standalone model or as a part of a larger DGA detection system that could also incorporate network traffic, such as something more like the Pleiades system (antonakakis2012throw, ).
There is room for future improvement in our work. The model we have used is specialized for DGA families based on English words, and therefore is less effective for those DGA families that do not look like natural domain names. Thus, in a production environment or in an improved system, our model could be ensembled with other techniques that are more effective for DGA families with lower smashword scores. In addition, our model uses LSTMs for memory units, which are effective, but it is possible that more complex memory units such as Neural Turing Machines(graves2014neural, ) or attention-based models (mnih2014recurrent, ) could provide better performance.
The authors would like to thank Symantec STAR members for fruitful discussions, valuable feedback and support, specifically to Robert Leyden and Sean Kiernan. The authors would also like to thank Nikolaos Vasiloglou and Panagiotis Kintis for helpful feedback.
-  Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. TensorFlow: a system for large-scale machine learning. In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation, pages 265–283. USENIX Association, 2016.
Hyrum S. Anderson, Jonathan Woodbridge, and Bobby Filar.
DeepDGA: Adversarially-Tuned Domain Generation and Detection.
Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security, AISec ’16, pages 13–21, New York, NY, USA, 2016. ACM.
-  Manos Antonakakis, Roberto Perdisci, Yacin Nadji, Nikolaos Vasiloglou, Saeed Abu-Nimeh, Wenke Lee, and David Dagon. From throw-away traffic to bots: detecting the rise of DGA-based malware. In Proceedings of the 21st USENIX conference on Security symposium, pages 24–24. USENIX Association, 2012.
-  Adam J. Aviv and Andreas Haeberlen. Challenges in experimenting with botnet detection systems. In Proceedings of the 4th Conference on Cyber Security Experimentation and Test, CSET’11, page 6, Berkeley, CA, USA, 2011. USENIX Association.
-  Leyla Bilge, Engin Kirda, Christopher Kruegel, and Marco Balduzzi. EXPOSURE: Finding Malicious Domains Using Passive DNS Analysis. In Proceedings of the 18th Annual Network and Distributed System Security Syposium (NDSS 2011), 2011.
-  Davide Canali, Marco Cova, Giovanni Vigna, and Christopher Kruegel. Prophiler: a fast filter for the large-scale detection of malicious web pages. In Proceedings of the 20th International World Wide Web Conference (WWW 2011), pages 197–206. ACM, 2011.
-  Nicholas Carlini and David Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 3–14. ACM, 2017.
-  Nicholas Carlini and David Wagner. Audio adversarial examples: Targeted attacks on speech-to-text. arXiv preprint arXiv:1801.01944, 2018.
-  François Chollet et al. Keras. https://keras.io, 2015.
-  Council of European Union. Council regulation (EU) no. 2016/679 (General Data Protection Regulation), 2016. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=uriserv:OJ.L_.2016.119.01.0001.01.ENG.
-  L. Daigle. WHOIS Protocol Specification. RFC 3912, RFC Editor, September 2004.
-  Patrick Doetsch, Michal Kozielski, and Hermann Ney. Fast and robust training of recurrent neural networks for offline handwriting recognition. In Proceedings of the 2014 14th International Conference on Frontiers in Handwriting Recognition (ICFHR), pages 279–284. IEEE, 2014.
-  Reuben Feinman, Ryan R Curtin, Saurabh Shintre, and Andrew B Gardner. Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410, 2017.
-  Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep Learning. MIT Press Cambridge, 2016.
-  Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
-  Alex Graves and Jürgen Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5-6):602–610, 2005.
-  Alex Graves, Greg Wayne, and Ivo Danihelka. Neural Turing Machines. arXiv preprint arXiv:1410.5401, 2014.
-  Kathrin Grosse, Nicolas Papernot, Praveen Manoharan, Michael Backes, and Patrick McDaniel. Adversarial examples for malware detection. In European Symposium on Research in Computer Security, pages 62–79. Springer, 2017.
-  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  ICANN. Data Protection/Privacy Issues, May 2018. https://www.icann.org/dataprotectionprivacy.
-  Nan Jiang, Jin Cao, Yu Jin, Li Erran Li, and Zhi-Li Zhang. Identifying Suspicious Activities Through DNS Failure Graph Analysis. In Proceedings of the The 18th IEEE International Conference on Network Protocols, ICNP ’10, pages 144–153, Washington, DC, USA, 2010. IEEE Computer Society.
-  Andrej Karpathy, Justin Johnson, and Li Fei-Fei. Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078, 2016.
-  Agnan Kessy, Alex Lewin, and Korbinian Strimmer. Optimal whitening and decorrelation. The American Statistician, pages 1–6, 2018.
-  Athanasios Kountouras, Panagiotis Kintis, Chaz Lever, Yizheng Chen, Yacin Nadji, David Dagon, Manos Antonakakis, and Rodney Joffe. Enabling network security through active DNS datasets. In International Symposium on Research in Attacks, Intrusions, and Defenses, pages 188–208. Springer, 2016.
-  Felix S. Leder and Peter Martini. Ngbpa next generation botnet protocol analysis. In Dimitris Gritzalis and Javier Lopez, editors, Emerging Challenges for Security, Privacy and Trust, pages 307–317, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg.
-  Chaz Lever, Robert Walls, Yacin Nadji, David Dagon, Patrick McDaniel, and Manos Antonakakis. Domain-Z: 28 registrations later measuring the exploitation of residual trust in domains. In 2016 IEEE Symposium on Security and Privacy (S&P), pages 691–706. IEEE, 2016.
-  Sicheng Li, Chunpeng Wu, Hai Li, Boxun Li, Yu Wang, and Qinru Qiu. Fpga acceleration of recurrent neural network based language model. In 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 111–118. IEEE, 2015.
-  Daiping Liu, Zhou Li, Kun Du, Haining Wang, Baojun Liu, and Hai-Xin Duan. Don’t let one rotten apple spoil the whole barrel: Towards automated detection of shadowed domains. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30 - November 03, 2017, pages 537–552, 2017.
-  Xi Luo, Liming Wang, Zhen Xu, Jing Yang, Mo Sun, and Jing Wang. DGASensor: Fast Detection for DGA-Based Malwares. In Proceedings of the 5th International Conference on Communications and Broadband Networking, ICCBN ’17, pages 47–53, New York, NY, USA, 2017. ACM.
-  Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker. Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2009), pages 1245–1254. ACM, 2009.
Hieu Mac, Duc Tran, Van Tong, Linh Giang Nguyen, and Hai Anh Tran.
DGA Botnet Detection Using Supervised Learning Methods.In Proceedings of the Eighth International Symposium on Information and Communication Technology, SoICT 2017, pages 211–218, New York, NY, USA, 2017. ACM.
-  Donna K. McClish. Analyzing a portion of the ROC curve. Medical Decision Making, 9(3):190–195, 1989.
-  Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. Recurrent models of visual attention. In Advances in Neural Information Processing Systems 27 (NIPS 2014), pages 2204–2212, 2014.
-  Jerzy Neyman and Egon S Pearson. IX. On the problem of the most efficient tests of statistical hypotheses. Phil. Trans. R. Soc. Lond. A, 231(694-706):289–337, 1933.
-  Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In Security and Privacy (SP), 2016 IEEE Symposium on, pages 582–597. IEEE, 2016.
-  Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning (ICML ’13), volume 28, pages 1310–1318, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.
-  Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct):2825–2830, 2011.
-  Phillip Porras, Hassen Saïdi, and Vinod Yegneswaran. A Foray into Conficker’s Logic and Rendezvous Points. In Proceedings of the 2Nd USENIX Conference on Large-scale Exploits and Emergent Threats: Botnets, Spyware, Worms, and More, LEET’09, pages 7–7, Berkeley, CA, USA, 2009. USENIX Association.
-  Stefano Schiavoni, Federico Maggi, Lorenzo Cavallaro, and Stefano Zanero. Phoenix: DGA-Based Botnet Tracking and Intelligence. In Sven Dietrich, editor, Detection of Intrusions and Malware, and Vulnerability Assessment, pages 192–211, Cham, 2014. Springer International Publishing.
-  C. E. Shannon. A mathematical theory of communication. Bell Systems Technical Journal, 27:623–656, 1948.
-  Toshiki Shibahara, Takeshi Yagi, Mitsuaki Akiyama, Daiki Chiba, and Takeshi Yada. Efficient dynamic malware analysis based on network behavior using deep learning. 2016 IEEE Global Communications Conference (GLOBECOM), pages 1–7, 2016.
-  Stanislav Skuratovich. MATSNU. Technical report, Check Point Software Technologies Ltd., May 2015.
-  Karen Spärck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11–21, 1972.
-  Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
Van Tong and Giang Nguyen.
A method for detecting DGA botnet based on semantic and cluster analysis.In Proceedings of the Seventh Symposium on Information and Communication Technology, SoICT 2016, Ho Chi Minh City, Vietnam, December 8-9, 2016, pages 272–277, 2016.
-  Jonathan Woodbridge, Hyrum S Anderson, Anjum Ahuja, and Daniel Grant. Predicting domain generation algorithms with long short-term memory networks. arXiv preprint arXiv:1611.00791, 2016.
-  Sandeep Yadav, Ashwath Kumar, Krishna Reddy, A L. Narasimha Reddy, and Supranamaya Ranjan. Detecting Algorithmically Generated Domain-Flux Attacks With DNS Traffic Analysis. 20, 10 2012.
-  Bin Yu, Jie Pan, Jiaming Hu, Anderson Nascimento, and Martine De Cock. Character level based detection of DGA domain names. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN ’18), 2018.
-  Guodong Zhao, Ke Xu, Lei Xu, and Bo Wu. Detecting APT Malware Infections Based on Malicious DNS and Traffic Analysis. 3:1132–1142, 01 2015.
-  Yonglin Zhou, Qing-Shan Li, Qidi Miao, and Kangbin Yim. DGA-Based Botnet Detection Using DNS Traffic. J. Internet Serv. Inf. Secur., 3:116–123, 2013.