Inline Detection of DGA Domains Using Side Information

03/12/2020 ∙ by Raaghavi Sivaguru, et al. ∙ 0

Malware applications typically use a command and control (C C) server to manage bots to perform malicious activities. Domain Generation Algorithms (DGAs) are popular methods for generating pseudo-random domain names that can be used to establish a communication between an infected bot and the C C server. In recent years, machine learning based systems have been widely used to detect DGAs. There are several well known state-of-the-art classifiers in the literature that can detect DGA domain names in real-time applications with high predictive performance. However, these DGA classifiers are highly vulnerable to adversarial attacks in which adversaries purposely craft domain names to evade DGA detection classifiers. In our work, we focus on hardening DGA classifiers against adversarial attacks. To this end, we train and evaluate state-of-the-art deep learning and random forest (RF) classifiers for DGA detection using side information that is harder for adversaries to manipulate than the domain name itself. Additionally, the side information features are selected such that they are easily obtainable in practice to perform inline DGA detection. The performance and robustness of these models is assessed by exposing them to one day of real-traffic data as well as domains generated by adversarial attack algorithms. We found that the DGA classifiers that rely on both the domain name and side information have high performance and are more robust against adversaries.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Domain Generation Algorithms (DGAs) are subroutines that generate pseudo-random combinations of characters or words, and output domain name strings [19]. DGAs often use a seed input such as a number, which is embedded as part of the code, or a time-based element such as the system date, time etc., or a combination of both, to generate random strings. These strings are then concatenated with an available top level domain (TLD) to form domain names. The key idea behind DGAs is to generate the same set of domain names when executed by two different machines, such as by a botmaster and on an infected machine, at a given time. The botmaster registers one of the generated domain names, while the infected machines systematically query the domains from the generated list until one of them is resolved. The domains from the list that have not been registered by the botmaster will typically result in an NXDomain (non-existent domain) response when queried, and can be discarded by the infected machine. This technique is often used by a command and control (C&C) center and an infected bot to establish communication and perform malicious activities as instructed by the C&C server.

Once communication between the infected machines and the botmaster has been established, the C&C server can issue commands to the bots to perform malicious activities such as distributed denial of service (DDoS) attacks, spamming, stealing sensitive information from the compromised machines, etc. In the past, malware authors used a predefined list of domain names, which was embedded in the malware, to communicate with the bots. This technique made it easy for the defenders to blacklist the malicious domain names and block further communication, effectively rendering the malware useless. To overcome this, modern C&Cs use DGAs to randomly generate domain names that are registered on the go, making them harder to detect. It is therefore important to identify the domains generated by DGAs and block them before they can be used to establish communication between the bot and the C&C center. There are several machine learning approaches proposed in the literature to address this issue including [4, 21, 22, 31, 33, 12, 27, 20] and other work that we cite later in this paper. These well known state-of-the-art classifiers can be deployed in real-world DNS applications to detect DGA domain names and block them. While some work focuses on detecting DGAs from NXDomains [22], our work aims to detect DGAs from traffic to domains that have already been resolved.

Commonly used approaches for DGA detection can be categorized according to how fast they are able to flag malicious activity in DNS traffic. As illustrated in table I, some techniques work in a retrospective manner, in which past DNS traffic, which is logged over a certain window, is analyzed in batches to detect anomalies. Other techniques work inline, meaning that they can detect DGA domains as soon as they are queried. There are two ways in which inline DGA detection can happen:

  • [leftmargin=*]

  • The domain first reaches the DGA classifier and if the classifier flags the domain as benign, then the query is passed to the DNS resolver to fetch the resolved IP address of the domain. However, if the classifier flags the domain as DGA, then the query will not be forwarded to the DNS resolver and it simply blocks the communication with that domain.

  • The domain first queries through the DNS resolver; the DGA classifier uses the features learned from the DNS response to decide on whether the domain is DGA or not.

Our work fits into the second category of inline detection, where both the domain name and the side information features learned from the DNS query/response are used by the classifier for DGA detection. The side information features are carefully selected to allow inline DGA detection in the broader sense. In the strict sense, inline DGA detection means that the information required to determine whether a domain name is DGA or not is available from the DNS query data alone. A DNS resolver can use the strictly inline DGA classifier’s decision to determine if it is safe or not to resolve the query. A less conventional version of inline detection, which we refer to as “inline DGA detection in the broader sense”, is one where data attributes from DNS responses are required (in addition to DNS queries). This means that the DNS resolver must resolve the query first, feed the information obtained to the DGA classifier, and then use the DGA classifier’s decision to determine if it is safe to get the DNS response to the client or not. As we observe in our experimental results, taking information from DNS responses into account improves the ability of DGA classifiers to correctly detect DGA domains among resolvable traffic. We note that any dependence on data requiring queries to additional sources, such as the WHOIS database (as used for instance in [8, 6, 14]), would disqualify the approach from inline detection, even in the broader sense.

Machine learning based approaches to detect DGA domain names in practice can also be categorized according to the information they leverage. One way is to train classifiers to detect DGA domain names using only the domain name string itself, see e.g. [22, 31, 33, 27, 7, 25, 35]. The alternative is to train the classifiers using context information such as the IP address of the domain, its geographic location, attributes from DNS response records etc. in addition to the domain name [21, 8, 32, 5, 15, 14]. In our work, we combine both approaches. The advantage of the former approach is that it does not require gathering of additional information, which may be expensive to collect in real time, and that it allows the defenders to detect the DGA domain names and block them even before they can be resolved. The advantage of the latter approach is that side information is a lot harder for the attacker to manipulate than the domain name string itself, making machine learning models trained on side information potentially more robust against adversarial attacks.

Adversarial machine learning is a research area focused on problems introduced by the use of machine learning techniques in adversarial environments in which an intelligent adversary attempts to exploit the weaknesses in such techniques [28]. The adversarial attacks of interest in this paper are evasion attacks in which an adversary uses artificially crafted instances, called adversarial samples, that are intentionally used to mislead a machine learning system and produce erroneous results. The goal of evasion attacks in the context of DGA detection is to generate domains that will be labeled as benign by the DGA classifier. The vulnerability of a classifier against evasion attacks is measured in terms of DGA detection rate, which is the proportion of the adversarial samples predicted as malicious by the classifier. Lower DGA detection rates indicate high vulnerability of the classifier to the attack. There exists several evasion attacks against DGA classifiers such as CharBot [17], DeepDGA [2], DeceptionDGA [26], MaskDGA [23] and the DGAs (HMM & PCFG-based) proposed by [9]. CharBot and MaskDGA are black-box targeted evasion attacks that do not require any knowledge about the DGA classifier and are intended to generate samples that can evade detection by any

classifier. On the other hand, DeceptionDGA is a white-box attack algorithm that uses the knowledge of features used by the DGA classifier to generate evading instances specific to a given classifier. Both types of attacks are found to be extremely powerful in generating domains that can evade detection by the DGA classifiers with high probability.

The main contributions of our work are:

  • [leftmargin=*]

  • A comprehensive survey of lexical and side information features proposed in the literature on DGA detection.

  • An experimental evaluation of the feasibility in collecting the features and their effectiveness when deployed for inline detection of DGAs in real streams of passive DNS traffic, which leads to a shortlist of features that are actually beneficial in practice.

  • Experimental results that show how the side information features can make DGA classifiers more robust against adversarial attacks.

The paper is organized as follows. Section II gives an overview of related work in the fields of adversarial machine learning and DGA detection. Section III provides a detailed overview of side information features that can be extracted from DNS traffic to aid in the detection of DGA domains. In section IV, we list the 26 human engineered lexical features that are extracted manually from the domain name string in order to train the RF classifier for DGA detection. Section V gives an overview of the different classifiers we will be studying and attempting to harden against adversarial attacks. Section VI describes the experimental setup and reports all of our empirical results. Finally, section VII concludes the work.

Ii Related Work

Given the importance of being able to detect and block DGA domain related traffic, it comes as no surprise that the problem of automatic DGA detection has received a considerable amount of attention over the last decade. There are various ways in which existing DGA detection approaches differ from each other. As illustrated in table I, DGA detection can be categorized according to the kind of input that is required. Some techniques require just the domain name string, while other techniques require side information

, or a combination of both. Both kinds of input have their own advantages and disadvantages. Methods that rely only on the domain name string are popular because side information is typically harder to obtain. On the other hand, features extracted from side information are harder to manipulate, making methods based on them more robust against adversarial attacks. All the approaches presented in our paper perform

inline DGA detection using domain name only, side information only and a combination of both domain name & side information features.

Furthermore, the classifiers can be trained in two ways to detect if a given domain name is generated by a DGA or not. The first technique is the featureful approach, where the classifier relies on human engineered features extracted from the domain names. The second technique is the featureless approach

, where the classifier learns the features automatically during the training process. Classifiers that are based on deep learning architectures like Long Short-Term Memory (LSTM)

[31, 27]

and Convolutional Neural Network (CNN) models

[33, 20] leverage the featureless approach, whereas models such as random forests (RFs) adopt the featureful approach. In our work, we will be using both featureful and featureless approaches to train random forest and deep learning classifiers for DGA detection.

Input Retrospective Inline
Domain name string [4, 18] [27, 31, 22, 35, 10]
[34, 33, 25, 7, 29]
Side information features [13, 3, 32, 30, 11] our work
Domain name string + [15, 16, 21, 24, 1] our work
side information features [5, 8, 6, 14]
TABLE I: Overview of existing work on DGA detection

Iii side information Features

Fig. 1: An example DNS resource record

In this section we provide a detailed overview of side information features that can be extracted from DNS traffic to aid in the detection of DGA domains. An overview of all features is presented in table II, accompanied by a list of citations that illustrates the popularity of each kind of feature in the literature. The order of the side information features listed in table II indicates the importance of those features in DGA detection as ranked by the Random Forest model (see section V). Not all features are equally easy to obtain in practice, and their contribution to the predictive accuracy of DGA classifiers varies. The last column of table II indicates whether we retained the feature in our DGA-classifiers. Figure 1 shows a sample resource record from which the side information features are extracted. In figure 1, the attribute “name” represents the fully qualified domain name (FQDN), “ttl” represents the time-to-live of the DNS query, “type” represents the resource record type, “class” represents the class of resource record and “data” represents the resolved IP address. Below we give a more in-depth description of each kind of feature, and its typical use in the the literature on DGA detection. Figure 2 shows a comparison of density plots for some of the side information features extracted from benign and DGA domain names, illustrating their predictive power. The different side information features are as follows:

Feature Description Reference Retained
rrlength Resource record length [30]
country Country name that the domain maps to [6, 14, 15, 16]
ttl Time-to-live of the DNS query [16, 30]
n_ip Number of distinct IP addresses the domain maps to [5, 6, 14, 15]
qtype Type of DNS packet requested [30]
rtype Record type of the DNS response [30]
n_asn Number of distinct ASNs the domain maps to [11]
subnet Do all IPs belong to same subnet [14, 6]
n_countries Number of distinct countries the domain maps to [5, 6, 14, 15, 24]
timestamp Features derived from timestamp of the DNS query [13, 5, 15]
opcode Kind of DNS query [30]
AA Authoritative answer [30]
QDCOUNT Number of entries in question section [30]
ANCOUNT Number of resource records in answer section [30]
NSCOUNT Number of name servers in authoritative section [30]
ARCOUNT Number of resource records in additional record section [30]
RCODE Response code [30]
rDNS Reverse DNS query results [5, 6, 14]
TTL statistics

Mean, standard deviation etc. of time-to-live

[5, 6, 14, 15]
n_domains Number of distinct domains associated with the IP [5, 6, 14, 15]
n_queries Number of queries for the domain and (domain, IP) pair [15]
WHOIS features Registrar, domain creation/expiration date etc. [8, 6, 14, 16]
TABLE II: side information features
Fig. 2: Comparison of values for side information features extracted from benign and DGA domains
  • [leftmargin=*]

  • rrlength: This feature measures the length of the RData field, which is extracted directly from the DNS response resource record. The RData in a DNS response encompasses a list of resolved IP addresses, the time-to-live value of the query and the type of resource record.

  • country: This feature refers to the geographic location that the resolved IP address maps to. If the DNS resource record contains multiple IP addresses, the country for each of the IP addresses is first identified. If all of the IP addresses belong to the same country, then this feature takes up that name. On the other hand, if any of the IP addresses map to a different location, then the value of this feature would be “multi-valued”. Alternatively, if the location could not be identified, then this feature takes the value “unknown”. This feature is then converted to categorical values that range between 0 and 185, which means that the domains in our data set map to 184 different countries plus the values “multi-valued” and “unknown”.

  • ttl: This feature represents the time-to-live value of the DNS query, which is the time interval that the resource record can be cached by the DNS resolver, and is directly obtained from the DNS response resource record. table III compares the distribution of TTL values (in seconds), in terms of mean, standard deviation and median, for benign and DGA domains in our data set (see section VI-A). It can be seen that DGA domains are in general far more short-lived than benign domains. For better visibility in figure 2, the density plot for TTL values are shown in hours instead of seconds.

    Type Mean TTL SD TTL Median TTL
    Benign 109,447 1,421,829 3,600
    DGA 29,255 4,701,205 900
    TABLE III: TTL distribution (in seconds) for benign and DGA domains
  • n_ip: This feature indicates the number of distinct IP addresses that are returned for the DNS domain lookup. It is manipulated directly by accessing the list of IPs contained in the RData field of the DNS response resource record.

  • qtype: This feature represents the DNS query type that can be extracted from the question section of the DNS query. Figure 2 shows the different values for this features in our data set.

  • rtype: This feature represents the resource record type that can be extracted directly from the RData field in the DNS response resource record. Figure 2 shows the different values for this features in our data set.

  • n_asn: This feature indicates the number of distinct autonomous system numbers that the IP addresses map to. The ASN for a given IP address is obtained by using Python Geolite2 Maxmind API.111https://geoip2.readthedocs.io/en/latest/

  • subnet: This feature is a boolean value that represents if all the IP addresses belong to the same subnet. A value of 0 indicates that one or more of the IP addresses, returned in the DNS response, belong to a different subnet and value of 1 indicates that all the IP addresses map to the same subnet.

  • n_countries: This feature represents the distinct number of countries that the resolved IP addresses map to. This feature has a very similar distribution when compared to the “n_asn” feature, which can be observed in figure 2.

  • timestamp: The timestamp denotes the time at which the DNS query was issued by a host. This feature in itself may not be useful in detecting DGAs. Some of the past studies record all of the timestamps at which a particular domain name was queried and construct time-series data to analyze the periodicity at which the benign and DGA domains are queried [5, 13], whereas [15] computes the lifespan of a domain by subtracting the first and last seen timestamps of the domain name. Such approaches require access to past DNS traffic and hence are regarded as “retrospective”. Since we only focus on performing inline DGA detection in our work, we do not use the timestamp feature to perform DGA classification.

  • opcode: This feature represents the kind of query such as standard query, inverse query, request for server status etc. In our data set, all the domains being queried belong to standard query type and hence using this feature does not contribute in the prediction of DGA domain names.

  • aa: This feature is a boolean flag which represents if the responding name server is an authority for the domain name being queried. The AA flag for all DNS responses in our data set has the same value “True” and hence we do not leverage the AA flag information while training our DGA classifiers.

  • qdcount, ancount, nscount, arcount: At this time, our DNS traffic collector do not capture this information & hence we do not use these features to train our model. However, it can be easily obtained from the DNS query and resource records.

  • rcode: Since our data set comprises of resolved domain names only, the rcode remains “0” for all the samples and hence we discard this information.

  • TTL statistics: This refers to a collection of features such as standard deviation, mean, minimum, maximum etc. of all time-to-live values extracted from the DNS response. While these features are relevant in a retrospective approach that investigates a domain based on all DNS resource records related to it say during the past 24 hours, it is not meaningful for fast inline DGA detection. Indeed, since all of the TTL values in a single response record have constant values, it would not add value to include these statistics as features.

  • n_domains: This feature represents the number of distinct domain names that are mapped to a given IP address. In order to use this feature, one needs to maintain a bipartite graph that depicts the mapping for each (domain, IP) pair. Again, this method of performing graph inference is computationally intensive and does not contribute towards inline detection of DGA domains. Therefore we refrain from using this side information feature while training our DGA classifiers.

  • n_queries: Similar to “timestamp” and “n_domains”, this feature also requires storing and fetching of information from past DNS traffic and hence n_queries cannot be used for inline detection of DGAs.

  • WHOIS features: Extracting WHOIS features such as registrar, domain creation/expiration date etc. involves very expensive WHOIS queries. This affects the capability of the classifier to perform inline DGA detection on-the-go and hence we do not use any feature that require WHOIS queries.

Iv Lexical Features

In this section we list the 26 human engineered lexical features that are extracted manually from the domain name string in order to train the RF classifier for DGA detection. Table IV shows a list of the lexical features used in B-RF and details on how the feature values are calculated are given below:

Feature Description Reference Retained
domain_len Domain name length [17, 25, 33, 7, 22, 10, 29, 6, 14]
sld_len Second level domain length [17, 25, 7]
tld_len Top level domain length [17, 25, 7]
uni_domain Domain Unique Characters length [17, 25, 7]
uni_sld SLD Unique Characters length [17, 25, 7]
uni_tld TLD Unique Characters length [17, 25, 7]
flag_dga Has malicious TLD [17, 25, 7, 10]
tld_hash TLD Hash [17, 25, 7, 33]
flag_dig Starts with Digit [17, 25, 7, 33]
sym Symbol ratio [17, 25, 7, 33]
hex Hex ratio [17, 25, 7, 33]
dig Digit Ratio [17, 25, 7, 22, 29, 5, 6, 14]
vow Vowel Ratio [17, 25, 7, 33, 22, 29]
con Consonant Ratio [17, 25, 7]
rep_char_ratio Ratio of Repeated Characters [17, 25, 22]
cons_con_ratio Ratio of Consecutive Consonants [17, 25, 22, 29]
cons_dig_ratio Ratio of Consecutive Digits [17, 25, 22]
tokens_sld Number of tokens in SLD [17, 25, 7, 10]
digits_sld Number of digits in SLD [17, 25, 7, 10]
ent Entropy of characters in SLD [17, 25, 7, 33, 22, 29]
gni Gini Index of characters in SLD [17, 25, 7, 33]
cer Classification error of characters in SLD [17, 25, 7, 33]
2gram_med 2-Gram Median of characters in SLD [17, 25, 7, 33]
3gram_med 3-Gram Median of characters in SLD [17, 25, 7, 33]
2gram_cmed 2-Gram Circle Median of characters in SLD [17, 25, 7]
3gram_cmed 3-Gram Circle Median of characters in SLD [17, 25, 7]
TABLE IV: Lexical features used by B-RF
  • [leftmargin=*]

  • domain_len: This feature represents the length of the domain name, which is the number of characters in the SLD.TLD pair. For example, we refer “google.com” as the domain name, where “google” indicates the SLD (second level domain) and “com” indicates the TLD (top level domain). The value of the feature domain_len for the domain name “google.com” is 10.

  • sld_len: This feature represents the number of characters in the second level domain.

  • tld_len: This feature represents the number of characters in the top level domain.

  • uni_domain: This feature represents the number of unique characters in the domain name, after removing special characters such as ‘.’ & ‘-’ from the domain name.

  • uni_sld: This feature represents the number of unique characters in the second level domain, after removing special characters such as ‘.’ & ‘-’ from the SLD.

  • uni_tld: This feature represents the number of unique characters in the top level domain, after removing special characters such as ‘.’ & ‘-’ from the TLD.

  • flag_dga: This feature represents a boolean value (0 or 1) that indicates if the domain name contains any of the following TLDs, which are known to be frequently associated with malicious activities222https://www.spamhaus.org/statistics/tlds/: “study”, “party”, “click”, “top”, “gdn”, “gq”, “asia”, “cricket”, “biz”, “cf”.

  • tld_hash: This feature represents the hash value of top level domain.

  • flag_dig: This feature represents a boolean value that indicates if the domain name starts with a digit/number (0-9).

  • sym: This feature represents the ratio of number of special characters in the SLD to the total number of characters in SLD (sld_len).

  • hex: This feature represents the ratio of number of hexadecimal characters (0-9 & a-f) in the SLD to the total number of characters in the SLD.

  • dig: This feature represents the ratio of number of digits (0-9) in the SLD to the total number of characters in the SLD.

  • vow: This feature represents the ratio of number of vowels (‘a’, ‘e’, ‘i’, ‘o’, ‘u’) in the SLD to the total number of characters in the SLD.

  • con: This feature represents the ratio of number of consonants in the SLD to the total number of characters in the SLD.

  • rep_char_ratio: This feature represents the ratio of number of characters that occurs more than once in the SLD to the total number of unique characters in the SLD.

  • cons_con_ratio: This feature represents the ratio of consecutive consonants (such as “ct”, “fk”, “ns” etc.) to the length of the domain (domain_len).

  • cons_dig_ratio: This feature represents the ratio of consecutive digits (such as “92”, “24”, “75” etc.) to the length of the domain (domain_len).

  • tokens_sld: This feature represents the number of tokens in the SLD, where a token indicates sequence of characters separated by ‘-’.

  • digits_sld: This feature represents the total number of digits in the SLD.

  • ent: This feature represents the normalized entropy value of the characters in SLD and is calculated using the formula:

    where represents the number of unique characters in the SLD and represents the proportion between the frequency of the unique character in the SLD to the total number of unique characters in the SLD.

  • gni: This feature represents the Gini value of the characters in SLD and is calculated using the formula:

    where represents the number of unique characters in the SLD and represents the proportion between the frequency of the unique character in the SLD to the total number of unique characters in the SLD.

  • cer: This feature represents the classification of error of characters in SLD, which is computed using the formula:

    where represents the proportion between the frequency of the unique character in the SLD to the total number of unique characters in the SLD.

  • 2gram_med: This feature represents the median of 2-gram frequencies in SLD.

  • 3gram_med: This feature represents the median of 3-gram frequencies in SLD.

  • 2gram_cmed: In order to compute this feature, the SLD of the domain is concatenated again with the SLD. (i.e) For example, if “google” is the SLD, a string such as “googlegoogle” is formed. The 2gram_med is then calculated on this newly formed string “googlegoogle” to obtain the value of this feature.

  • 3gram_cmed: In order to compute this feature, the SLD of the domain is concatenated again with the SLD. (i.e) For example, if “yahoo” is the SLD, a string such as “yahooyahoo” is formed. The 3gram_med is then calculated on this newly formed string “yahooyahoo” to obtain the value of this feature.

V DGA Classifiers

We consider three different DGA classifiers in this work, which we detail below. We chose one model representative of the featureful approach (B-RF), one deep learning model which represents the featureless approach (LSTM.MI) and finally a hybrid model which combines both approaches (LSTM.MI+B-RF).

V-a B-Rf

B-RF is a DGA classifier based on random forests. It consists of 100 trees and each tree is trained using a subset of the feature space to avoid overfitting. Entropy is used as the criterion to decide the split attribute while growing the trees in the random forest. There are 3 variants of B-RF classifier, each trained either on lexical features (as the RF classifier in [25]) or DNS features, or a combination of both lexical and DNS features. The performance of these variants of the B-RF classifier is listed in the first 3 rows of table VI.

V-B lstm.mi

Woodbridge et al. [31] were the first to propose deep learning for DGA domain name detection. Their DGA classifier is a neural network consisting of an embedding layer, an LSTM layer, and a single node output layer with sigmoid activation. In this paper, we use the LSTM.MI model that was proposed recently by Tran et al. [27]. Its architecture is very similar to that of Woodbridge et al. [31]; the main distinction is that the LSTM.MI model is trained with a cost-sensitive learning algorithm that takes class imbalances into account. This allows the LSTM.MI approach to achieve slightly better results than the original LSTM approach (see [27, 25]). The 4th row in table VI

shows the performance of the LSTM.MI classifier. It operates directly on the domain name string, instead of on lexical features extracted from it. Characters in the domain name are converted to lower case and are encoded with categorical values, ranging from 1 to 38, to represent ‘.’, ‘-’, digits from 0 to 9 & characters from ‘a’ to ‘z’. All the domains in our data are fixed to a length of 77 characters, which is the length of the longest domain name in our data set. Domains that are shorter than 77 characters are padded with zeroes in the left.

V-C lstm.mi+b-Rf

The hybrid LSTM.MI+B-RF classifier combines both LSTM.MI and B-RF architectures by training a B-RF classifier with features listed in tables IV and II, in addition to the confidence score obtained from the LSTM.MI model for that domain name. The confidence score ranges between 0 and 1, signifying the probability of the domain being a DGA as predicted by the LSTM.MI classifier. The above workflow of DGA detection using LSTM.MI+B-RF setup is depicted in figure 3. The last two rows in table VI represent the performance of this DGA classifier.

Fig. 3: DGA detection using LSTM.MI + RF model

Vi Experimental results

Vi-a Dataset

In the first experiment, we train and evaluate the DGA classifiers from section V on a dataset with 600,000 DGAs (positive) and 600,000 benign (negative) samples. Table V shows some examples of DGA & benign domains. The training data points originate from a real-time stream of passive DNS data, consisting of roughly 10-12 billion DNS queries per day collected from subscribers including ISPs (Internet Service Providers), schools, and businesses. From this traffic, the positive samples are collected by retaining resolved domain names that are listed in DGArchive333https://dgarchive.caad.fkie.fraunhofer.de/, a blacklist containing known DGA domains [19]. Dictionary DGAs, which are human-readable DGA domains belonging to malware families such as suppobox, gozi, matsnu and nymaim2 are discarded from the training set. This is because these DGAs look more like benign domains and confuse the DGA classifiers [18]

. Since this work is primarily aimed at measuring the impact of adversarial instances such as CharBot, we exclude samples from Dictionary DGAs. The benign samples are collected based on a predefined set of heuristics as listed below:

  • [leftmargin=*]

  • Domain name should have valid DNS characters only (digits, letters, dot and hyphen)

  • Domain has to be resolved at least once for every day between June 01, 2019 and July 31, 2019.

  • Domain name should have a valid public suffix

  • Characters in the domain name are not all digits (after removing ‘.’ and ‘-’)

  • Domain should have at most four labels (Labels are sequence of characters separated by a dot)

  • Length of the domain name is at most 255 characters

  • Longest label is between 7 and 64 characters

  • Longest label is more than twice the length of the TLD

  • Longest label is more than 70% of the combined length of all labels

  • Excludes IDN (International Distribution Network) domains (such as domains starting with xn--)

  • Domain must not exist in DGArchive

Both the DGA and benign domains in the data set are collected from real-time passive DNS traffic that was observed in February 2019. The domains in the data set are then preprocessed by following the two steps mentioned below:

  • [leftmargin=*]

  • Retain only the SLD & TLD of the domain name and discard any 3LD (third level domain) or any other label if present. For example, for the domain name “www.google.com”, the 3LD which is “www” is removed and the SLD.TLD which “google.com” is retained.

  • All the alphabetical characters in the domain name are converted to its corresponding lower case characters.

Benign domains (labeled 0) DGA domains (labeled 1)
7ft4.com vocom.eu
sgtobel.ch leadhelp.net
intimvoronezh.net 1b6a95e6b5d4.com
essc-tabriz.com korpncyeajsgeatkopoqs.info
konsaltbezopasnost.ru kndydusmrlrofrcmfuayfmswrkytl.biz
TABLE V: Some examples fo benign vs DGA domain names

Vi-B Performance evaluation of DGA classifiers

The true positive rate (TPR) and false positive rate (FPR) for the DGA classifiers are calculated as follows:

TPR FPR

where TP, TN, FP & FN represent the number of true positives, true negatives, false positives and false negatives respectively. The predictive performance of the classifiers is evaluated using 5-fold cross-validation for metrics such as TPR and Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) as tabulated in table VI. In cybersecurity applications, it is important to achieve high TPR for a very low FPR. This is because it is undesirable to block a large number of benign domains in real-world traffic as this interferes with users’ legitimate business. Hence all the reported metrics are thresholded at a very low FPR of 0.1%. We also obtain the ROC curve by plotting the TPR against the FPR of the classifiers and the AUC is subsequently obtained by taking the integral of the ROC curve. The AUC is a measure of how well the trained classifier can distinguish between the classes. Specifically, it can be interpreted as the probability that the classifier will output a higher score for a randomly chosen DGA domain than it would for a randomly chosen benign domain. An ideal classifier has an AUC score of 1, indicating it will always rank DGA domains higher than benign domains. This makes it possible to use the classifier to perfectly separate the classes via an appropriate threshold on its output scores. A classifier that just randomly guesses the outcome achieves an AUC of 0.5 and a classifier with AUC 0 has basically inverted all predictions, i.e. samples labeled as 0 are predicted as 1 by the classifier and vice versa. In addition to the AUC score, the AUC at a fixed FPR of 0.1% is also reported. This thresholded AUC represents the integral of the ROC curve for a FPR of 0 to 0.001.

Model Features Performance metrics
AUC@ TPR@
0.1%FPR 0.1%FPR
B-RF DNS 53.23% 16.21%
Lexical 89.78% 97.44%
DNS + Lexical 98.19% 99.42%
LSTM.MI Domain name string 94.47% 98.80%
LSTM.MI + B-RF Domain name string + DNS 96.51% 99.89%
Domain name string + DNS + Lexical 99.17% 99.91%
TABLE VI: Performance evaluation of DGA classifiers using 5-fold cross-validation

There are several interesting observations to be made based on table VI. First, looking at the AUC@1%FPR column, one can see that the predictive performance for inline DGA detection based on DNS features alone does not perform well: the B-RF/DNS based model achieves an AUC@1%FPR of only 53.23%. Second, when it comes to DGA detection based on the domain string alone, the deep learning approach (LSTM.MI) clearly outperforms the random forest approach (B-RF/Lexical) at 94.47% vs. 89.78%. This is fully in line with previous findings [27, 34]. Third, the most interesting and novel result from table VI is that the DGA classifiers, when trained with both lexical and side information features, have the best overall performance in terms of AUC score and TPR, namely 99.17% for the architecture from figure 3.

Vi-C Real Traffic Analysis

Next, we apply the best performing classifiers in table VI on one day of real traffic DNS traffic to evaluate their predictive performance in real-time. We collected a set of resolved domains that were observed on March 26, 2019 to perform this analysis. As part of pre-processing, the fully qualified domain names are validated against the heuristics mentioned in section VI-A, in order to maintain consistency with the training data set. The domains that satisfy the heuristics are then retained in this experiment after discarding the third level domain (3LD/subdomain) from the domain name, if present. This resulted in a set consisting of 66,440,681 domains (contains duplicate domains with SLD.TLD pairs), out of which 1,159,662 domains were found in DGArchive and 14,653,217 domains were found in Alexa. There is also an overlap of 1,124,467 domains between the Alexa whitelist and DGArchive blacklist.

Table VII shows a comparison of the number of domains that were flagged as DGA by the LSTM.MI, B-RF and LSTM.MI+B-RF classifier. The B-RF model (in table VII) has the highest true positive rate among the 3 models being compared. Out of the 1.87M domains flagged as DGA by the classifier, approximately 61% were found in DGArchive. Although the LSTM.MI classifier catches the highest number of DGAs in real-traffic, the true positive rate is 34% which is 27% lower than the B-RF classifier. However, as seen in the last row of table VII, the B-RF also has the highest number of false positives. This could be due to the fact that there is a large number of overlapping domains between Alexa and DGArchive as mentioned earlier in this section. A good workaround to reduce the number of false positives during the deployment is to check the flagged domains against Alexa before making the final decision.

Model LSTM.MI B-RF LSTM.MI+B-RF
Features Domain name DNS + Lexical Domain name +
DNS + Lexical
Out of the 66M domains in real-traffic, 3,400,017 1,877,784 2,170,056
number of domains flagged as DGA
by the classifier
Out of the domains flagged as DGA 1,151,750 1,149,689 1,150,116
by the classifier,
number of domains found in DGArchive
Out of the domains flagged as DGA 1,626,232 1,717,638 1,420,319
by the classifier,
number of domains found in Alexa
TABLE VII: Real traffic analysis of DGA classifiers on 66,440,662 (66M) domains

Vi-D Defense against Adversarial ML

The use of side information is important in the context of adversarial ML because side information is a lot harder to manipulate than the domain name string itself [8]. In order to test this, we generated 1,000 DGA domains with CharBot [17], a simple DGA algorithm that was written specifically to evade existing DGA classifiers. Since, to the best of our knowledge, CharBot has not been deployed yet in the wild, we cannot collect side information for CharBot domains from real traffic. Instead, we pair up the CharBot domains with the DNS features obtained from 1,000 randomly sampled DGA domains in real traffic. To avoid any bias in the selection of DNS features for CharBot domains, we perform the random sampling for 5 trials and create 5 sets of CharBot DNS features. The lexical features extracted for CharBot are appended with the DNS features, which can then be exposed to DGA classifiers for detection of malicious domains. The idea here is to test if the DGA classifiers trained on side information features are successful in detecting CharBot domains.

Table VIII shows the CharBot detection rate, which is the average proportion of CharBot domains that were flagged as DGA by the classifiers over the 5 randomized trials. Higher values of CharBot detection rate indicates that the classifier is more robust against new DGAs or adversarial attacks. As expected, the B-RF model trained on both lexical and side information features detects 20% of CharBot domains as DGA/malicious, which is 12% more than the LSTM.MI model. This clearly indicates that the use of side information features to train the DGA classifier makes it more robust against adversarial samples like CharBot domains, when compared to classifiers that rely only on the domain name for DGA detection.

Classifier Features DGA (CharBot) detection rate
B-RF DNS ± 
Lexical ± 
Lexical + DNS ± 
LSTM.MI Domain name string ± 
LSTM.MI+B-RF Domain name string + DNS ± 
Domain name string + Lexical + DNS ± 
TABLE VIII: Detection rate of CharBot domains as DGA

Vii Conclusion

In this paper, we proposed and evaluated state-of-the-art classifiers for inline DGA detection using side information features that are easily obtained from DNS query and response. Results from tables VIII and VI show that using side information in addition to the domain name to train classifiers not only improves the predictive performance, but also makes it more robust against adversaries like CharBot, when compared to the classifiers that use just the domain name to detect DGAs. Additionally, the side information features in our approach are carefully chosen to perform lightweight inline detection of DGA domains, and do not rely on external sources such as WHOIS for feature extraction.

Acknowledgement. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. Jonathan Peck is sponsored by a fellowship of the Research Foundation Flanders (FWO).

References

  • [1] J. Abbink and C. Doerr (2017) Popularity-based Detection of Domain Generation Algorithms. In Proceedings of the 12th International Conference on Availability, Reliability and Security, Cited by: TABLE I.
  • [2] H. S. Anderson, J. Woodbridge, and B. Filar (2016) DeepDGA: Adversarially-Tuned Domain Generation and Detection. In

    Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security

    ,
    pp. 13–21. Cited by: §I.
  • [3] M. Antonakakis, R. Perdisci, W. Lee, N. Vasiloglou, and D. Dagon (2011) Detecting Malware Domains at the Upper DNS Hierarchy.. In USENIX Security Symposium, Vol. 11, pp. 1–16. Cited by: TABLE I.
  • [4] M. Antonakakis, R. Perdisci, Y. Nadji, N. Vasiloglou, S. Abu-Nimeh, W. Lee, and D. Dagon (2012) From Throw-Away Traffic to Bots: Detecting the Rise of DGA-Based Malware.. In USENIX Security Symposium, Vol. 12, pp. 491–506. Cited by: §I, TABLE I.
  • [5] L. Bilge, S. Sen, D. Balzarotti, E. Kirda, and C. Kruegel (2014) Exposure: A Passive DNS Analysis Service to Detect and Report Malicious Domains. ACM Transactions on Information and System Security (TISSEC) 16 (4). Cited by: §I, TABLE I, 10th item, TABLE II, TABLE IV.
  • [6] T. Chin, K. Xiong, C. Hu, and Y. Li (2018) A Machine Learning Framework for Studying Domain Generation Algorithm DGA-based Malware. In International Conference on Security and Privacy in Communication Systems, pp. 433–448. Cited by: §I, TABLE I, TABLE II, TABLE IV.
  • [7] C. Choudhary, R. Sivaguru, M. Pereira, B. Yu, A. C. Nascimento, and M. De Cock (2018) Algorithmically Generated Domain Detection and Malware Family classification. In International Symposium on Security in Computing and Communication, pp. 640–655. Cited by: §I, TABLE I, TABLE IV.
  • [8] R. R. Curtin, A. B. Gardner, S. Grzonkowski, A. Kleymenov, and A. Mosquera (2019)

    Detecting DGA Domains with Recurrent Neural Networks and Side Information

    .
    In Proceedings of the 14th International Conference on Availability, Reliability and Security, Cited by: §I, §I, TABLE I, TABLE II, §VI-D.
  • [9] Y. Fu, L. Yu, O. Hambolu, I. Ozcelik, B. Husain, J. Sun, K. Sapra, D. Du, C. T. Beasley, and R. R. Brooks (2017) Stealthy Domain Generation Algorithms. IEEE Transactions on Information Forensics and Security 12 (6), pp. 1430–1443. Cited by: §I.
  • [10] A. Joshi, L. Lloyd, P. Westin, and S. Seethapathy (2019) Using Lexical Features for Malicious URL Detection–A Machine Learning Approach. arXiv preprint arXiv:1910.06277. Cited by: TABLE I, TABLE IV.
  • [11] I. Khalil, T. Yu, and B. Guan (2016) Discovering Malicious Domains through Passive DNS Data Graph Analysis. In Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security, pp. 663–674. Cited by: TABLE I, TABLE II.
  • [12] J. Koh and B. Rhodes (2018) Inline Detection of Domain Generation Algorithms with Context-Sensitive Word Embeddings. In Proceedings of 2018 IEEE International Conference on Big Data, pp. 2965–2970. Cited by: §I.
  • [13] J. Kwon, J. Lee, H. Lee, and A. Perrig (2016) PsyBoG: A Scalable Botnet Detection Method for Large-Scale DNS Traffic. Computer Networks 97, pp. 48–73. Cited by: TABLE I, 10th item, TABLE II.
  • [14] Y. Li, K. Xiong, T. Chin, and C. Hu (2019) A Machine Learning Framework for Domain Generation Algorithm DGA-Based Malware Detection. IEEE Access 7, pp. 32765–32782. Cited by: §I, §I, TABLE I, TABLE II, TABLE IV.
  • [15] P. Lison and V. Mavroeidis (2017) Neural Reputation Models learned from Passive DNS Data. In 2017 IEEE International Conference on Big Data, pp. 3662–3671. Cited by: §I, TABLE I, 10th item, TABLE II.
  • [16] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker (2009) Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1245–1254. Cited by: TABLE I, TABLE II.
  • [17] J. Peck, C. Nie, R. Sivaguru, C. Grumer, F. Olumofin, B. Yu, A. Nascimento, and M. De Cock (2019) CharBot: A Simple and Effective Method for Evading DGA Classifiers. IEEE Access 7, pp. 91759–91771. Cited by: §I, TABLE IV, §VI-D.
  • [18] M. Pereira, S. Coleman, B. Yu, M. De Cock, and A. Nascimento (2018) Dictionary Extraction and Detection of Algorithmically Generated Domain Names in Passive DNS Traffic. In International Symposium on Research in Attacks, Intrusions, and Defenses, pp. 295–314. Cited by: TABLE I, §VI-A.
  • [19] D. Plohmann, K. Yakdan, M. Klatt, J. Bader, and E. Gerhards-Padilla (2016) A Comprehensive Measurement Study of Domain Generating Malware. In 25th USENIX Security Symposium, pp. 263–278. Cited by: §I, §VI-A.
  • [20] J. Saxe and K. Berlin (2017) eXpose: A Character-Level Convolutional Neural Network with Embeddings For Detecting Malicious URLs, File Paths and Registry Keys. preprint arXiv:1702.08568. Cited by: §I, §II.
  • [21] S. Schiavoni, F. Maggi, L. Cavallaro, and S. Zanero (2014) Phoenix: DGA-based Botnet Tracking and Intelligence. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, pp. 192–211. Cited by: §I, §I, TABLE I.
  • [22] S. Schüppen, D. Teubert, P. Herrmann, and U. Meyer (2018) FANCI: Feature-based Automated NXDomain Classification and Intelligence. In 27th USENIX Security Symposium, pp. 1165–1181. Cited by: §I, §I, TABLE I, TABLE IV.
  • [23] L. Sidi, A. Nadler, and A. Shabtai (2019) MaskDGA: A Black-box Evasion Technique Against DGA Classifiers and Adversarial Defenses. arXiv preprint arXiv:1902.08909. Cited by: §I.
  • [24] M. Singh, M. Singh, and S. Kaur (2019) Detecting bot-infected machines using DNS fingerprinting. Digital Investigation 28, pp. 14–33. Cited by: TABLE I, TABLE II.
  • [25] R. Sivaguru, C. Choudhary, B. Yu, V. Tymchenko, A. Nascimento, and M. De Cock (2018) An Evaluation of DGA Classifiers. In 2018 IEEE International Conference on Big Data, pp. 5058–5067. Cited by: §I, TABLE I, TABLE IV, §V-A, §V-B.
  • [26] J. Spooren, D. Preuveneers, L. Desmet, P. Janssen, and W. Joosen (2019) Detection of Algorithmically Generated Domain Names used by Botnets: A Dual Arms Race.. In Proceedings of the 34th ACM/SIGAPP Symposium On Applied Computing, pp. 1902–1910. Cited by: §I.
  • [27] D. Tran, H. Mac, V. Tong, H. A. Tran, and L. G. Nguyen (2018) A LSTM based framework for handling multiclass imbalance in DGA botnet detection. Neurocomputing 275, pp. 2401–2413. Cited by: §I, §I, TABLE I, §II, §V-B, §VI-B.
  • [28] Y. Vorobeychik and M. Kantarcioglu (2018) Adversarial Machine Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 12 (3), pp. 1–169. Cited by: §I.
  • [29] Z. Wang, Z. Jia, and B. Zhang (2018) A Detection Scheme for DGA Domain Names Based on SVM. In 2018 International Conference on Mathematics, Modelling, Simulation and Algorithms (MMSA 2018), Cited by: TABLE I, TABLE IV.
  • [30] L. Watkins, S. Beck, J. Zook, A. Buczak, J. Chavis, W. H. Robinson, J. A. Morales, and S. Mishra (2017) Using Semi-supervised Machine Learning to Address the Big Data Problem in DNS Networks. In 2017 IEEE 7th Annual Computing and Communication Workshop and Conference (CCWC), Cited by: TABLE I, TABLE II.
  • [31] J. Woodbridge, H. S. Anderson, A. Ahuja, and D. Grant (2016) Predicting Domain Generation Algorithms with Long Short-Term Memory Networks. preprint arXiv:1611.00791. Cited by: §I, §I, TABLE I, §II, §V-B.
  • [32] S. Yadav, A. K. K. Reddy, A. N. Reddy, and S. Ranjan (2012) Detecting Algorithmically Generated Domain-Flux Attacks with DNS Traffic Analysis. IEEE/ACM Transactions on Networking 20 (5), pp. 1663–1677. Cited by: §I, TABLE I.
  • [33] B. Yu, D. L. Gray, J. Pan, M. De Cock, and A. C. Nascimento (2017) Inline DGA Detection with Deep Networks. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 683–692. Cited by: §I, §I, TABLE I, §II, TABLE IV.
  • [34] B. Yu, J. Pan, D. Gray, J. Hu, C. Choudhary, A. C. Nascimento, and M. De Cock (2019) Weakly Supervised Deep Learning for the Detection of Domain Generation Algorithms. IEEE Access 7, pp. 51542–51556. Cited by: TABLE I, §VI-B.
  • [35] B. Yu, J. Pan, J. Hu, A. Nascimento, and M. De Cock (2018) Character Level Based Detection of DGA Domain Names. In Proc. WCCI, pp. 4168–4175. Cited by: §I, TABLE I.