This paper presents a method to identify infected computers from their HTTP traffic, which can be collected on the network perimeter, in the cloud proxy, or by an antivirus installed on monitored computers. Although the model assumes visibility of URL strings, which is decreasing due to the popularity of HTTPs, we believe that the problem is still important and interesting because: (i) the solution presented here is general and the framework can be used for other problems with a hierarchical structure; (ii) there is an established prior art which is well developed and non-trivial to outperform; (iii) URLs can be still collected directly at the endpoint before being encrypted, for example by an antivirus engine, virtual private network agents, by browser extensions, or by a white-hat version of a man in the middle attack employed by some companies.
The prior art on using machine learning (ML) on URL strings is vast and relevant works are briefed in Section VI. URL strings are challenging for ML models since they have an internal structure consisting of several blocks, consisting of a variable number of tokens including zero. Coping with this variability is a challenge solved in the most prior art by two orthogonal approaches: the first [17, 40, 22, 18] converts the URL to a Euclidean space33]
avoids human-designed features by converting the URL string to a matrix with one-hot encoded characters.111In this representation each column corresponds to one character from the URL with the value one on the row with an index of the character and zero elsewhere.
Although this allows using convolution neural networks (CNN) or recurrent neural networks, they have to learn how to parse and interpret the structure of the string, which makes the learning unnecessarily complicated and opaque since the structure is known.
This work tries to fix the shortcomings of the prior art and extend it along three directions.
It presents a new model of URL strings, which (i) avoids human design features like CNN, (ii) removes the need to truncate or pad URL strings to fit the predefined number of characters, and (iii) exploits the structure of URL strings.
Unlike the most prior art modeling just individual URL strings, here the model utilizes complete HTTP traffic of a single computer. This gives it the ability to detect infections that changes the distribution of traffic on otherwise legitimate servers (e.g. adware) (which models scrutinizing URL strings independently cannot do).
The hierarchy of data reflected in the model allows to extract indicators of compromise and explain the decision by identifying parts of a sample responsible for positive classification.
The model is learned end-to-end from labels on the level of a computer. The motivation for this is manyfold: (i) according to Ker’s laws [13, 14] it should improve accuracy of classification, which is based on more data; (ii) it moves the granularity of labeling from individual URL strings to the whole computer, which is simpler and more accurate;222Imagine that you know the computer is infected and you should label an HTTP request to google.com. This request can be due to search invoked by a user or by malware checking connection to the internet.
(iii) using multiple URL strings gives the classifier an ability to detect malware, that communicates only with legitimate servers but changes a distribution on them (see for a formal proof). This type of malware might be undetectable by classifiers using only single URLs, as they look perfectly normal.
The proposed model is experimentally compared to representative work of prior art based on (i) hand-designed features and random forests  (further called R. Forest model) and (ii) convolution neural network  (further called by its name eXpose). Experiments are carried on public and private datasets and they are designed to evaluate accuracy on future and unseen malware samples (Grill test ), accuracy in identifying the type of infection, and dependency on the number of labeled samples.
The paper is structured as follows. Section II reviews multi-instance learning and particularly the method  on which the proposed structured model is based. Section III describes the proposed structured model of URL requests and the traffic of computers. Implementation of the model and prior art, datasets, and other experimental details are summarized in Section IV. Experimental results are presented in Section V and Section VII concludes the work.
Ii Multi-instance learning
The problem of multi-instance learning (MIL) has been introduced in  to solve the problem where a sample (in MIL nomenclature called bag) was composed by a set of vectors (in MIL nomenclature called instances) of an arbitrary number, but of a fixed size (dimension). In pioneering work  it has been assumed that there exist labels on individual instances, but during training only labels on the level of bags are known. Later works adopt a more general formulation of the problem by 
, which assumes the sample to be a probability distribution observed through a finite number of realizations of a corresponding random variable. The very same work has proposed to solve this problem using a combination of support vector machines with a probabilistic kernel. While their classifier converges to the optimal detector, it does not scale, as its worst-case complexity is where is the number of samples in the training set and is the average size of the bag.
Refs. [30, 9] have independently proposed a simple classifier utilized in this work and outlined in Figure 1, which according to extensive experiments outperforms most prior art . Denoting a single sample (bag) as
, the classifier is implemented by two feed-forward neural networksand with an element-wise aggregation between them. The first network embeds each instance into a -dimensional space as , then the element-wise aggregation combines all projected instances into a single vector of the same dimension , which means the whole bag of an arbitrary number of instances is represented by a single vector . Finally, the network provides the final decision.
The simplest implementation of this classifier uses single non-linear layer with rectified linear units foran element-wise average for aggregation and finally a linear function for The final classifier can be written as
Its main advantage with respect to prior art  is that since the whole scheme is differentiable, all parameters including those of the instance-projection function can be optimized using stochastic gradient descend and its variants. Furthermore, it has been shown in  that the construction is dense in the space of continuous functions from probability measures over to .
Iii Hierarchical model of a traffic of a computer
This section describes an application of the multi-instance learning approach to identify an infected computer based on its HTTP network traffic observed on a network perimeter. The explanation starts by modeling the traffic under the assumption that each connection (message) is already represented by a vector. Then, a model of URL is presented under the assumption that each token within is represented by a vector. Finally, a representation of a token is detailed. Putting all three pieces together in reverse order gives rise to the full model demonstrated in this work. It also shows the power and flexibility of the modeling framework proposed in .
Iii-a Embedding the network traffic of computers
The model of computers’ traffic is adopted from our previous work . It assumes that the network traffic of a single computer can be modeled by a set of remote servers (identified by a domain) with which the computer has communicated with, i.e. to which it has issued a request over HTTP protocol. In an example in Figure 2, the computer is modelled by communication to three domains evil.com, bbc.co.uk, and adnxs.com. Similarly, each hostname is modeled by a set of requests a particular computer has issued to the server. Again, in Figure 2 the domain bbc.co.uk is modelled by three HTTP requests bbc.co.uk/index.html, bbc.co.uk/favicon.ico, and bbc.co.uk/banner.jpg.
The proposed model simplifies the complexity of the network traffic along two axes: (i) it assumes that domains do not interact with each other, i.e. it cannot properly model mash-up sites; (ii) it does not take the time continuity in the account, which means that it cannot model time dependencies within sequences. The first simplification is mainly technical, as it is difficult to attribute an HTTP request in mash-up to the originator. The second removes problems with catastrophic forgetting and gradient explosion/diminishing in recurrent neural networks.
Despite these simplifications below experiments demonstrate that the expressive power remains high to identify almost all infected computers. We hypothesize, that it is because the infected computer has either a very distinct probability distribution of requests issued by malware (for example URLs sent to a command and control server has a special format) or the infection changes the distribution of types of servers with which the computer interacts (for example adware downloads a large number of ads albeit from legitimate services).
Iii-B Embedding URL strings
The above model of computer traffic assumes that URL strings are already embedded in a Euclidean space. Since this is not straightforward, this section fills this gap by showing how this embedding is implemented (and optimized) using MIL formalism.
The URL string is viewed as a Cartesian product (concatenation) of models of hostname, path, and filename, and query parameters respectively (note that path and query parts are optional).333The scheme, port, and users are not represented, but they can be added if needed. Since each part can contain an arbitrary number of tokens (parts of hostname, file path, key-value pairs), the MIL framework is used again to handle this variability. Finally, key-value pairs in the query part are represented as a Cartesian product (concatenation) of representations of a key token and a value token.
In URL string
used in Figure 2, the hostname evil.com consists of tokens evil, com (modeled by the first MIL problem); path and filename consist of tokens path and file (modeled by the second MIL problem); and finally query parameters consist of pairs key=val and get=explit.js (modeled by the third MIL problem). Pair get=explit.js is represented by a key token get (one model) and value token explit.js (another model). What remains to explain is how to embed individual string tokens to Euclidean space, which is left to the next section.
The proposed representation of URL strings and the network traffic offers a lot of flexibility, as a representation of each bag on each level of the hierarchy is parametrized by a feed-forward neural network. Specifically (see Figure 2), functions and embed tokens of hostname, path, and keys and values in query respectively. Similarly functions and embeds higher-level concepts such as key-value pairs in the query, URL strings, all traffic to a domain, all domains the computer has contacted, and finally, utilizes the topmost embedding and classifies the computer. All feed-forward neural networks can contain arbitrary number of layers and can implement recently proposed extensions such as dropout , layer normalization 11], etc. Needless to say that in the below experiments, none of these extensions have been used and all functions
Iii-C Embedding string tokens
What remains to explain is how to embed individual string tokens to Euclidean space. Although this is a subject of extensive research in natural language processing and language translation, not all solutions can be trivially adapted to the domain of URL modeling, as the language of tokens in URLs is several orders of magnitudes larger because a token can be an arbitrary sequence of allowed characters including Unicode. Since this problem is outside of the scope of this work, the experimental section uses a very simple representation. Each token is represented as a histogram of indexes of trigrams. Since the number of trigrams can be very high (in theoryas Unicode is converted to its byte representation) the dimension is decreased by applying the modulo operation. In the experimental section, the modulo is . The rationale behind this number is that (i) it is a prime, which is important for the distribution in hashing and (ii) the number seems to be a good trade-off between the complexity of the network and accuracy of the solution.
We have experimented with an alternative representation inspired by eXpose  , where a token is seen as a sequence of one-hot encoded characters. This sequence is embedded into Euclidean space by first applying the convolution windows of size three and then reducing its output by elementwise average and maximum. Despite the complexity of this solution is 4-5 times higher than that of based on trigrams, and it has required the implementation of a custom convolution operator that can handle tokens of different sizes, in our setting it was inferior to an embedding based on trigrams. It might be possible to improve this approach with sufficient tweaking, but we leave this option to future work.
Iii-D Extracting indicators of compromise
If we assume that the HTTP requests caused by malware are added to those caused by user interaction, then infected computers emit a mixture of normal and malicious requests. This means that neurons inside the trained network should be sensitive to malware tokens / URL strings / domains and insensitive to those of clean computers. This means that each neuron is a weak indicator of compromise (IOC). By inspecting the tokens / URL strings / domains to which neurons on different level of hierarchies are sensitive have several benefits: (i) it helps to understand and verify the function of neural network; (ii) it can reveal some phenomenons security analysts were not aware of; and finally (iii) it can help explain the decision and provide context, especially if the type of traffic can be described in a human understandable language. Notice that the hierarchical structure of the proposed model allows to select the level of granularity (tokens / URL strings/domains) a security researcher is interested in.
The exact algorithm extracting important parts of the traffic at a particular level of detail works as follows.
Calculate the average output of neurons (at that given level) on a normal traffic, further denoted by
Calculate the output of same neurons on the traffic of infected computers, further denoted as 444Note that while is a single vector with dimension equal to the number of neurons, is a set with a cardinality equal to the number of URL strings / tokens or domains.
For neuron (dimension of vectors ), calculate the normalized score as Then the URL strings / tokens or domains with score much greater or much smaller than one are the most characteristic for a given type of infection. Note that scores around one are non-indicative, as they are similar to the traffic of normal computers.
The above algorithm assumes that the malware adds but does not remove HTTP requests. We can imagine cases, where such removal occurs, for example, if malware disables an anti-virus engine, which then ceases updating itself. But missing these updates are not indicative of infection since not all users have an anti-virus installed. We, therefore, believe that the assumption on additivity does not have a significant impact on the extraction of IOCs.
Iii-E Explaining the decision
Neural networks have a bad reputation of being a black-box model without any possibility to extract any explanation of their decision. In intrusion detection, this feature might prevent widespread adoption, as people have a tendency not to trust in systems that they do not understand. Moreover, providing an explanation of the security incident to the analyst might simplify and speed-up their investigation.
Similarly to the extraction of IOCs, the explanation exploits the structure of the model and it also relies on the assumption that malware can only add HTTP requests, but it cannot remove them. 555The question underlining this assumption is whether interfering with user’s HTTP requests can make the malware visible to the user, since the user might notice that the device works differently, which might trigger his investigation leading to the identification of infection. This has the implication that by removing malware HTTP requests the decision of a neural network can be changed from infected to clean. Although finding the smallest number of such requests is likely an NP-complete problem, a greedy approximation inspired by  performs surprisingly well.
The explanation is iterative, where in each iteration a set of requests to the same hostname causing the biggest decrease of the classifier’s output are removed (in our implementation positive means infected). The algorithm stops when there are no further requests. The set of all removed requests is grouped by their hostname and the returned explanation will have the form: “This computer was found infected because it has communicated with these servers”. This explanation can be further augmented by annotating the type of traffic to which neurons are sensitive to as described in the previous subsection, and use the activity of these neurons to improve the description.666The explanation on the level of requests to servers has been shown in our prior work in . Due to privacy concerns, this work demonstrates this on a publicly available CSIC  dataset on a lower level of tokens in URL.
Iv Implementation and experimental details
|Training set||Testing set|
|3 ad injector||46||15196||566709||451||9262340||33||1804||11684||214||89593|
|11 information stealer||13||5614||154761||175||1154500||7||526||7240||19||65701|
|12 mal. cont. dist.||18||8962||35993||716||121558||15||646||1379||138||6559|
|14 money scam||2||494||947||74||1220||2||23||42||20||53|
|15 anon. software||2||550||19368||69||31089||1||84||926||53||1716|
|16 banking trojan||2||6||5106||12||10578||1||2||294||10||620|
|17 spam tracking||8||1530||2860||305||20889||7||108||143||64||598|
|18 click fraud||8||584||3882||60||10344||5||34||186||23||385|
This section details: the implementation of the presented models together with the selected prior art; the datasets used for evaluation; grill-test used to estimate accuracy on unseen malware; and finally the metric used to compare the classifiers. Unless said otherwise, each experiment has been repeated ten times.
Iv-a The hierarchical model
The proposed hierarchical model based on multi-instance learning was implemented in the Julia language  using the Flux.jl  library for automatic differentiation and the authors’ library supporting nested multiple instance learning models and their Cartesian products available at https://github.com/pevnak/Mill.jl. The experimental section compares two models based on the proposed MIL framework.
MIL-5min is the classifier advocated in this work. It classifies a computer based on all network traffic observed during 5-minute windows, as has been described in the previous section and as is outlined in Figure 2. Its main advantage is that (i) it can be trained on coarse labels (whole computer vs. individual URLs) and (ii) it has more information (multiple URLs) upon which it can base its decision.
MIL-URL is a submodule of the MIL-5min model providing the decision on individual URL strings. It is outlined in the upper part of Figure 2 showing the representation of URL strings. The rationale behind introducing this model is that it allows direct comparison to the prior art, which works mostly on individual URL strings. The main drawback of this model is that it requires labels on the level of individual URL strings, similarly to most of the state of the art. The comparison of its accuracy to MIL-5min also shows that the drop in accuracy by training on coarse labels is negligible.
Unless said otherwise, individual feed-forward networks, consisted of a single layer of 80 neurons with ReLU non-linearity. The aggregation of bags used mean and maximum simultaneously, therefore increasing the dimensionality from 80 to 160. The rationale behind this was that both the mean and the maximum aggregation functions have their advantages. Mean is very good for the case when malware is abusing legitimate services and infected computer exhibits change in the probability distribution of contacted types of servers . Maximum is good when malware contacts single or few servers with a very distinct pattern, for example, a command and control channel. Utilizing both aggregation functions simultaneously provides the best of both worlds.
The last layer of the neural network is augmented by a linear layer with 16 neurons, as computers were classified into 16 classes according to the type of malware and one class deeming the computer clean. The loss function was the usual cross-entropy, with the only exception being that the error on clean computers had a weight, while the error on infected computers had a weight of The rationale behind this is that in real use, the detection system will observe two to three orders of magnitude higher number of clean computers than infected ones. A high false-positive rate is therefore devastating as the network operator would be flooded with false alarms. Unless said otherwise, .
The ADAM stochastic gradient descent method with default settings was used with batch size samples when the sample was a computer (MIL-5min model) and samples when the sample was a single URL string (MIL-URL model). The gradient descend was run for 50 000 iterations. Because loading the data was very time consuming, the stochastic gradient descends used a circular buffer of size 5, which means that every minibatch was reused 5 times. This means that although the SGD used 50 000 steps, it has seen only 10 000 "new" mini-batches.
Iv-B Prior art
The proposed solution has been compared to two approaches — manually designed features used in a random forest classifier  (called R. Forest) and an approach based on convolution neural network  (called eXpose). They were selected as they represent state complementary approaches in the prior art.
R. Forest classifier  uses a set of 398 hand-designed features that are used with a random forest classifier to separate URL strings of benign and malicious applications. This approach is a good prototype of an industry workhorse, as random forests are very robust and hand-designed features allow to incorporate a lot of domain knowledge into the solution. The set of features proposed in  has been also used for example in . We have reimplemented these features by ourselves, but thanks to authors we could verify their correctness as authors have provided us with a set of URLs and corresponding feature values. Random Forests used the implementation from https://github.com/bensadeghi/DecisionTree.jl v in the Julia  language. In all experiments below, each forest contained 100 trees with a maximum depth 30 and we have left all settings to default, which authors of  confirmed as reasonable settings. To maximize the diversity of trees within the forest, each tree has been trained on its random subset of all training URLs. Although we were not able to train the forest on all URLs in the training set, as we have been limited by the 64Gb of memory of a single m4.x8 instance, the model still used more labeled samples than MIL-URL classifier. In all experiments, we have trained each on negative and the same number of positive samples.
eXpose , inspired by the success of convolution in digital images, builds a classifier of URLs using a convolution neural network. Specifically, eXpose truncates or pads all URL strings to 200 characters such that they all have the same size. Using a one-hot encoding of characters, a URL is then converted into a binary matrix of size , which allows for using a common stack of convolution, reduction, and fully connected layers used in image recognition. eXpose was implemented exactly as described in  with the only difference being that samples have been classified into 16 categories (clean + 15 malware categories) instead of benign/malicious. eXpose was implemented in the Julia  language in the Flux.jl  library for neural networks. We have used the ADAM optimizer  with a minibatch of size 256 (mandated by the limit of GPU memory) and it was allowed to train for iterations on Amazon’s p2.xlarge GPU instance.
Iv-C Corporate dataset
The main dataset of HTTP traffic was collected from more than 500 large customers of Cisco’s Cognitive Threat Analytics (CTA)  during one month from October 2017 till November 2017. The inherent limitation of the CTA engine is that it discards 93% of the observed traffic and keeps only the most suspicious part, which is 7%. Nevertheless, the dataset still contains more than URL strings. The training part contains all traffic collected in October and the testing set contains traffic collected in November.
URL strings were labeled using Cisco’s internal blacklist based on hostnames combined with regular expressions. The blacklist is curated by senior security officers and it is accurate in the sense that it contains minimal false positives (connections made by legitimate applications but attributed to the malware), but we admit that there might be false negatives, i.e. it can contain URL strings made by malware yet classified as legitimate. We argue that almost all datasets will suffer from this type of error for two reasons. First, it is at this moment impossible to obtain an accurately labeled large dataset. Second, as was mentioned in the motivation of this paper, even senior officers can have difficulties in labelling some URL strings. A typical example is connections togoogle.com made by the malware to check if it is connected to the internet.
We have a preferred private blacklist over labeling using the public services such as Virus Total (VT), since (i) they are prone to false positives, (ii) we do not have sufficiently high quotas to ask VT for every observed domain, (iii) it is not trivial to infer labels from the results provided by Virus Total answers , (iv) the labeling will suffer from the same problems as our internal blacklist. Again, we do not think that possible false negatives in the dataset make the results less credible. Details and statistics about the dataset are shown in Table I. Although the testing set does not contain any new malware campaign, it contains approximately 12.57% domains that are not present in the training set.
Iv-D HTTP dataset CSIC 2010
The CSIC 2010 dataset  is a public collection of URL strings created to test web attack protection systems. It contains automatically generated web requests targeted to an e-commerce web application developed by the Group of Information and Communication Technologies at the Institute for Physical and Information technologies.
The dataset contains 36 000 normal URL strings and more than 25 000 anomalous ones including samples of following attacks: SQL injection, buffer overflow, information gathering, files disclosure, CRLF injection, XSS, server-side include, parameter tampering, etc.
The dataset contains traffic only to a single domain from a single computer, and the number of types of attack is very limited. It, therefore, allows assessment of accuracy only on individual URLs and it, therefore, does not mimic well the scenario of interest of this work. On the other hand, it is accurately labeled and its publicity is good for reproducible science, which is difficult in the field of network security. The experiments on this dataset presented in Section V-E should be therefore treated as a supplement to main experiments on the Corporate dataset which contains nine orders of magnitude more URL strings.
Iv-E Grill tests
In this paper, all classifiers have been always evaluated on future data to observe the effect of aging. Yet, since labels in future data are created using the same blacklist as data in the training set, they will be correlated as a large number of domains and malware families will be present both in the training and testing set. This makes it difficult to estimate how well the classifier detects new types of malware, infections, and migrations to new domains, which is precisely the type of accuracy practitioners are interested in.
An experimental protocol originally proposed in  aims to rigorously measure this type of generalization. Below, we present its minor variation allowing a straightforward comparison to the baseline, where the labels in the training set are not manipulated. Bellow, the test is called Grill test as a tribute to the main author and also because it puts the classifiers on the edge of their capabilities.
The Grill test was executed on two levels: hostname and malware families. For simplicity, it is explained for hostnames, but its variant with malware families is straightforward.
Grill test assumes that each positive (malicious) sample is attached to a hostname, which in the case of URLs is trivial. In the beginning, all hostnames of positive samples from the testing set are randomly divided into folds (in this paper ). Then, classifiers are trained, where the training set of each classifier has either relabelled or removed samples with a hostname matching those in the corresponding fold. During classification, if a true label of a sample is positive, its hostname has to belong to one of the folds777Note that this property is ensured by the fact that folds are created from data in the testing set and it is classified using the classifier which did not have this hostname in its training set. If a sample is negative, its hostname does not belong to any fold and the output is calculated as an average of all classifiers. Thus although in the Grill test a total of classifiers are trained, they act as a single classifier and positive samples are always classified using the classifier with corresponding hostnames removed or relabelled in its training set.
The rationale behind relabelling and removing is the following. Relabelling simulates a scenario where the blacklist is not accurate and it contains false negatives, which means that the training set contains malware samples labeled as benign. Removing simulates a scenario, where a new malware family appears after the classifier was trained, i.e. samples are not present in the training set at all.
We believe that the Grill test is important for practitioners, as it demonstrates how the classifier can detect new threats. Moreover, since the protocol used in this paper preserves the number of positive and negative samples in testing, it can be directly compared to the case, where training and testing data are shifted just by time.
The Grill test used a Corporate dataset with training data collected in October and testing data in November.
Iv-F Evaluation metrics
Although all classifiers are trained to solve multi-class problems, they are firstly compared to a binary problem of identifying clean vs. infected computers. This is because, in reality, the false alarm / missed detection is more important than incorrect identification of malware. On this binary problem, classifiers are compared using the Precision-Recall curve  known from information retrieval. The rationale behind this is that unlike the ROC curve, it is sensitive to class ratio and it shows how many infected computers can be identified (recall) and what is the fraction of truly infected computers out of the total number of computers classified as infected (precision). Since most experiments are repeated ten times, the plot contains average PR curves together with an area of standard deviation.
Precision-Recall curves (PR curves) are estimated and plotted on two levels of granularity. The microscopic operates on the level of individual URL strings, and its main purpose is to observe how well are classifiers trained. However, for practical purposes the macroscopic precision-recall curve on the level of users is more important, where all traffic of a single user is treated as a single sample. This shows how many computers would require attention by a diligent staff investigating every security incident. The decisions of classifiers on individual URLs (MIL-URL, Conv. Random Forest) or on individual five-minute windows (MIL-5min) are aggregated using maximum, i.e.
where is the output of the classifier on a sample (URL or five-minute window). An alternative to maximum aggregation is mean , which would correspond to counting alarms, or an aggregation learned from the data . But according to discussions with security officers maximum well mimics functionality of real intrusion detection systems triggering on every alarm.
V Experimental results
This section presents the experimental comparison of the proposed classifiers MIL-URL and MIL-5min with the prior art based on human-designed features and Random Forests  (further called Random Forests) and the eXpose classifier based on convolutional neural networks (further called Convolution).
The classifiers are compared on four problems (or datasets). The first, called classification in the future, corresponds to the case when the classifier is trained on data from the past (October 2017) and used on the data from the future ( November 2017). This comparison mimics the application in practice well. The second, called Grill test, uses modified training as has been described above to estimate the accuracy on unseen or unlabelled data. The third, called CSIC, is similar to the first case but performed on the publicly available CSIC dataset. The fourth problem uses the same scenario as the first one but measures the accuracy of classifiers in identifying the type of infection.
V-a Classification in the future
Figure 3(a) shows precision-recall curves of classifiers of individual URLs (therefore the URL-5min classifier is missing). The proposed MIL-URL model offers only a slightly lower recall than the detector based on Random Forests, but with markedly higher precision. EXpose based on convolution neural networks is inferior to both R. Forest and MIL-URL. Figure 3(b) shows precision-recall curves of the same classifiers and also of the MIL-5min when one sample corresponds to one computer — a scenario important for practitioners. We observe that both MIL-5min and MIL-URL dominate the prior art across all precision-recall space. Both classifiers keep precision above 95 percent with a similar recall. Prior art solutions have precision below 50 percent with worse recall.
The version of MIL-5min seems to be slightly better than the MIL-URL classifier. This should not be surprising as MIL-5min has more information about the infected computer on which it can base its decision, as is suggested by Ker’s laws [13, 14]. The superiority of MIL-5min is also important for label acquisition, as to train this classifier, it is sufficient to have labels on the level of 5-minute windows. These labels are simpler to obtain and therefore more precise than on the level of individual URLs.
The fact that Random Forests achieve good precision in classifying individual URLs but worse performance when the result is aggregated on the level of computers suggests that they suffer from higher false-positive rates.888While the loss in precision might seem to be puzzling, it is caused by the presence of computers infected by a noisy but easily detectable malware. When the level of sample granularity moves to a single computer, this large number of malicious URLs (samples) collapses to one, but a single false positive from a clean computer is still viewed as a single false positive. To conclude, the proposed solutions based on multi-instance learning seems to be more precise in metrics better simulating practical applications.
V-B Dependency on training set size
Models based on neural networks, particularly deep ones, have a reputation of requiring a large number of samples. This section demonstrates that in the problem of URL classification, the Random Forest classifier requires even more. In fact, R. Forest classifier used in above section (Figure 3) used more samples than MIL-URL classifier. To study how the accuracy of different classifiers depends on the number of samples, we have varied (i) the number of training steps for the MIL classifiers and (ii) the number of samples per tree in the R. Forest classifiers, as these two hyper-parameters control the number of samples used in the training of corresponding classifiers. All other experimental settings were kept the same as above. Note though that in MIL-URL and in Random Forests one sample corresponds to one URL string, whereas in MIL-5min one sample corresponds to all URL strings collected for 5 minutes. Figure 4 shows PR curves of classifiers. We observe that the MIL-5min features the highest stability, as model trained on 4 million samples has almost the same accuracy as the one trained on 10 million samples used in Figure 3. Contrary, the R. Forest classifier features the lowest stability, as it continues to significantly improve even when it uses more than (120) samples ( more than MIL-5min and more than MIL-URL model). The eXpose classifier was omitted from this study due to their poor performance and very high computational complexity (see Section V-H below).
V-C Generalization to unseen malware and new domains
In this section, the quality of detection of new hostnames/malware families of classifiers is, which shows how classifiers can generalize outside the training set. Again, due to poor performance and high computational cost, eXpose was omitted from this experiment.999One experiment requires 5 classifiers and the experiment was repeated 10 times.
PR curves of the Grill-test are shown in Figure 5, wherein the top row a sample corresponds to a single URL and in the bottom row a sample corresponds to a single computer. We observe that in both cases, MIL models feature better generalization than R. Forest models. In almost all scenarios, the behavior is as expected, where classifiers trained on all known samples are better than classifiers with missing / mislabelled samples. The exception is the R. Forest classifier evaluated on the level of computers, where classifiers trained on data where some malware families were missing or mislabelled, the accuracy has improved. We believe that this is a result of overfitting to some malware families prevalent in the training set, yet rare in the testing set of 3rd November. Contrary to our expectations, the model classifying 5min windows is less robust than the model classifying individual URLs, which we cannot explain at the moment, but it can be caused by a smaller number of available training samples.
|mal. cont. dist.||8.7||4.7||71.5|
Table II shows the accuracy of eXpose, R. Forest, and MIL-5min classifiers in identifying the type of infection of the computer. In almost all cases (except Trojan, ad-injector, and click fraud), the MIL-5min classifier is better than the prior art, sometimes very significantly. In the cases where the MIL approach is worse, it lags behind the best by less than
V-E CSIC dataset
This section compares all of the three evaluated approaches on the publicly available CSIC dataset of HTTP requests . Since the dataset is rather small and it contains only binary labels and all HTTP requests target the same host (http://localhost:8080/), the architectures of the classifiers were mildly altered to reflect this.
Specifically, the MIL-URL model has modeled only the path and query parts, as there is no reason to model the hostname, which is constant. The number of neurons in each layer was decreased to 40 (from 80 used in the corporate dataset) and the number of training iterations was decreased to 10 000. Finally, the cross-entropy has classified into just two classes (benign/malicious).
In eXpose, the hostname was removed from the URL string, since it is constant across the dataset and since the maximum length of URL is limited to 200 characters, keeping the hostname would decrease the expressive power of the model. Similarly to MIL-URL, the number of training iterations was decreased to 10 000.
Random Forest classifier was left almost intact with the exception that the training set for each tree was a random subset of 80% of the training set. The rationale behind this was to increase the diversity of the ensemble to improve robustness.
Figure 6 shows the average Precision-Recall curves of 10 repetitions, where all available data were randomly split into a training set containing 80% of samples and a testing set containing 20% of samples. The sampling was stratified, which means that the class ratio was preserved. On this small problem, eXpose was mildly better in terms of recall than the MIL classifier and the R. Forest classifier was the worst. We believe that the superiority of eXpose is due to the simplicity of this problem, specifically because the number of possible attacks was much lower and attacks had very distinct signatures and the problem is smaller by nine orders of magnitude than previous problems in corporate networks.
V-F Extraction of IOCs
Indicators of compromise were extracted as described in Section III-D on the level of tokens in the path and in the query. The extraction on the level of traffic to domains was shown in  and we cannot repeat it here due to privacy restrictions on the Corporate dataset. Figure 7, therefore, shows the score, i.e. ratio of the output of neurons on anomalous and normal URL strings. According to these values, path and values seem to be more important for the detection of malicious URL strings than keys in the query string.
For path, the token with the overall highest output was 6909030637832563290.jsp.old, for key cantidadA and value Espriella+Morcossessionid%3D12312312%26+username%3D%253C%2573%2563%2572%2569%2570%2574%253E%2564%256F%2563%2575%256D%2565%256E%2574%252E%256C%256F%2563%2561%2574%2569%256F%256E%253D%2527%2568%2574%2574%2570%253A%252F%252F%2561%2574%2574%2561%2563%256B%2565%2572%2568%256F%2573%2574%252E%2565%2578%2561%256D%2570%256C%2565%252F%2563%2567%2569%252D%2562%2569%256E%252F%2563%256F%256F%256B%2569%2565%2573%2574%2565%2561%256C%252E%2563%2567%2569%253F%2527%252B%2564%256F%2563%2575%256D%2565%256E%2574%252E%2563%256F%256F%256B%2569%2565%253C%252F%2573+%2563%2572%2569%2570%2574%253E%3F. Although we do not know the precise meaning of the key ("cantidad" is in spanish "amount, count, number"), the value is a doubly-escaped, which after decoding (twice) using the service https://www.motobit.com/util/url-decoder.asp reads =12312312&username=? suggesting a likely cookie stealing through cross-site scripting.
V-G Explaining the decision
The explanation algorithm described in Section III-E has been used on the CSIC dataset on tokens of a filename and keys and values. Again, the privacy restrictions on the Corporate dataset prevent us from using it in this test. As an example, the explainer is used on the positively classified URL string http://localhost:8080/tienda1/miembros/imagenes/zarauz.jpg/6909030637832563290.jsp.old. The contributions of the individual tokens to the prediction of the model are listed in Table III. As can be seen, the most important token is 6909030637832563290.jsp.old, which nicely corresponds to indicators of compromise extracted in the previous section.
V-H Computational complexity
This section compares the complexity of all four models, namely a time to train the model, to classify all URLs in traffic of 5-minute windows of 1000 infected users (81090 URLs in total), and also the complexity of the model, measured by a size  of a serialized Julia. Since the serialized model contains the data and necessary structures defining the model, we believe it to be a good proxy for the size used in .
The training and classification times of the MIL and Random Forest classifiers were measured on a single m5d.4xlarge AWS instance (64Gb of memory, 16 virtual Intel Xeon Platinum 8175M CPUs), the times of Convolution NN were measured on a single p2.xlarge AWS instance (60Gb of memory, 4 virtual Intel Xeon CPU E5-2686 CPUs with a single Tesla K80 GPU). Times are measured end2end, which means that they include all preprocessing with feature extraction in the case of Random Forest included.
|Training time||Classification time||Model size|
|R. Forest||126 100||145.8||7.9M|
The measurements are shown in Table IV. We can see that the solution based on eXpose is the fastest to train, which is caused by the small size of the mini-batches mandated by limited GPU memory and also by employing a GPU, which for this type of task easily leads to 10 times faster training. On the other hand, eXpose is the largest model and the classification time is the second highest, caused by slow loading of data to the GPU. Random Forest was the most expensive to train and also the classification time is the highest, more than 20 times that of MIL and 1.5 times that of eXpose.
The training time of MIL-URL is the fastest, while that of MIL-5min is the second slowest. This discrepancy is caused by the size of the mini-batches of 5-min windows of MIL-5min, which is approximately 50 000 URLs (opposed to 5000 URLs of the MIL-URL model). On the other hand, in both cases, the classification time is more than an order of magnitude faster than that of the prior art. Also, the model is the lowest complexity as measured by the stored serialized model size.
Vi Related work
The evolution of methods to detect malicious URLs follows the evolution of machine learning. Early, but still used methods [17, 40, 22, 18] rely on human-designed features. For example  uses the length of the URL, frequency of selected characters or occurrence of special tokens in the query part. For each type of features, it designs a method to detect anomalies, such as the chi-square test for the frequency of characters. A detector for spam, phishing and malware in  uses the number of, average length and maximum length of the domain, path, and query tokens together with spam, phishing and malware SLD hit ratio brand name presence. The most complete set of features known to us [22, 18] was used in the experimental section in the Random Forest classifier. The feature set contains (i) characterization of distinct patterns in the URL string, including the length of the URL, the vowel ratio, the consonant ratio, the number of special characters (’!’, ’-’, ’_’, ’,’, ’@’, ’#’, ’%’, ’+’, ’:’, ’;’), the upper case ratio, the lower case ratio, and the proportion of the digits; (ii) common statistical features of domain names, including domain name levels, character type distribution ratio, and top-level domain name; (iii) the overall length of the path and the number of directories; (iv) and finally filename suffix and its length.
Refs.  and  avoid the design of features by either creating a dictionary of observed tokens or their 3–8grams. With that, they represent the URL by their one-hot encoded presence. These works build upon the progress of training linear models with a large dimension on very sparse samples.
Recent advances in neural networks from vision and language modeling were used in [33, 19], and , where the URL is treated as a sequence of characters truncated to 200 characters, as according to the references 95% of URLs are shorter. CNN from  is used in the comparison. Unlike the proposed method it has to learn to parse the URL to utilize the structure.
The closest work to this is  and our earlier work , which divides the URL into a hostname, path and query modeled separately either by a convolution over tokens embedded to Euclidean space using word2vec  or by the multiple-instance learning framework . The proposed model can be seen as an evolution that uses a more sophisticated model of query and models all traffic of the computer instead of a single URL (which gives the model ability to be trained from coarse labels and to detect infections merely changing distribution on otherwise legitimate servers).
In , URL features are supplemented by information about the server, such as data from WHOIS, IP prefix, autonomous system number, geographic location, and connection speed. Although the presented model can be modified to use these features, they were avoided since we wanted to compare to the prior art on URLs only.
As some malware has a very specific header of the HTTP request, Ref.  creates a template for each family and measures the distance from it. Thus, it essentially builds a 1-nearest neighbor classifier with a custom distance. The proposed hierarchical models can include the model of HTTP header, yet the data were not available to us and as mentioned above, utilizing them would limit the comparison to the prior art.
So far, we have reviewed models detecting individual URL strings. Recognizing the deficiencies, in 
the subject of classification is a set of connections between the hostname and a particular server. The representation is inspired by models from computer vision, but it cannot be easily extended to the traffic to multiple servers. A model for the same subject is proposed in, but the scaling in higher dimensions is dubious. Moreover, unlike the presented model both methods rely on hand-crafted features.
The clustering of binaries executed in a sandbox based on their URLs is treated in 
. To calculate distance, it is proposed to use a weighted sum of Levenstein distance between strings, Jaccard index between parameters, etc. Since the goal of the proposed work is classification, the distance function of is not well suited for this problem, as k-NN or SVM classifiers do not scale.
Last, we mention our previous work , where the network host is modeled by a set of domains and each domain by a set of HTTP messages exchanged with it (see Section III-A). While  requires URL strings to be described by a set of features, whereas the proposed model extends it such that it requires the representation of individual string tokens only (for example by 3-grams).
The main goal of this work was to replicate in the field of network security the success of convolution neural networks in computer vision and other areas in removing human-designed features. This has been achieved by removing time and spatial dependencies and by nesting multiple instance learning problems. The proposed framework can classify a set of all connections from a single computer while relying on features describing only string tokens. To the best of our knowledge, this is the first work of its type.
Experimental results have demonstrated that the proposed model outperforms, sometimes significantly, the prior art in the problem of identifying infected computers within a computer network and classifying the type of infection.
We believe that the proposed approach will serve as a blueprint approach, how domains with complicated hierarchies can be elegantly handled by straightforward nesting of multiple-instance learning problems. We have therefore released a library simplifying this task at https://github.com/pevnak/Mill.jl. We also believe that there is a lot of space for further improvement, for example by utilizing local dependencies by convolution.
-  Jaume Amores. Multiple instance classification: Review, taxonomy and comparative study. Artif. Intell., 201:81–105, August 2013.
-  Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
-  Karel Bartos, Michal Sofka, and Vojtech Franc. Optimized invariant representation of network traffic for detecting unseen malware variants. In 25th USENIX Security Symposium (USENIX Security 16), pages 807–822, 2016.
-  Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B. Shah. Julia: A fresh approach to numerical computing. julialang.org/publications/julia-fresh-approach-BEKS.pdf, 2017.
-  Hyunsang Choi, Bin B Zhu, and Heejo Lee. Detecting malicious web links and identifying their attack types. WebApps, 11(11):218, 2011.
-  Andreas Christmann and Ingo Steinwart. Universal kernels on non-standard input spaces. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 406–414. Curran Associates, Inc., 2010.
-  Marek Dědič. Hierarchical models of network traffic. B.S. thesis, České vysoké učení technické v Praze. Vypočetní a informační centrum., June 2017.
-  Thomas G Dietterich, Richard H Lathrop, and Tomás Lozano-Pérez. Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence, 89(1):31–71, 1997.
-  Harrison Edwards and Amos Storkey. Towards a Neural Statistician. 2 2017.
-  Martin Grill and Tomáš Pevný. Learning combination of anomaly detectors for security domain. Computer Networks, 107:55–63, 2016.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Flux: Elegant machine learning with julia.
Journal of Open Source Software, 2018.
-  Andrew D Ker. Batch steganography and pooled steganalysis. In International Workshop on Information Hiding, pages 265–281. Springer, 2006.
-  Andrew D. Ker. The square root law of steganography: Bringing theory closer to practice. In Proceedings of the 5th ACM Workshop on Information Hiding and Multimedia Security, IH&MMSec ’17, pages 33–44, New York, NY, USA, 2017. ACM.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Jan Kohout and Tomáš Pevný. Network traffic fingerprinting based on approximated kernel two-sample test. IEEE Transactions on Information Forensics and Security, 13(3):788–801, 2018.
-  Christopher Kruegel and Giovanni Vigna. Anomaly detection of web-based attacks. In Proceedings of the 10th ACM conference on Computer and communications security, pages 251–261. ACM, 2003.
Ke Li, Rongliang Chen, Liang Gu, Chaoge Liu, and Jie Yin.
A method based on statistical characteristics for detection malware
requests in network traffic.
2018 IEEE Third International Conference on Data Science in Cyberspace (DSC), pages 527–532. IEEE, 2018.
-  Steven Z Lin, Yong Shi, and Zhi Xue. Character-level intrusion detection based on convolutional neural networks. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2018.
-  Justin Ma, Lawrence K Saul, Stefan Savage, and Geoffrey M Voelker. Beyond blacklists: learning to detect malicious web sites from suspicious urls. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1245–1254. ACM, 2009.
-  Justin Ma, Lawrence K Saul, Stefan Savage, and Geoffrey M Voelker. Identifying suspicious urls: an application of large-scale online learning. In Proceedings of the 26th annual international conference on machine learning, pages 681–688. ACM, 2009.
-  Lukas Machlica, Karel Bartos, and Michal Sofka. Learning detectors of malicious web requests for intrusion detection in network traffic. arXiv preprint arXiv:1702.02530, 2017.
-  Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc., 2013.
-  Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, and Bernhard Schölkopf. Learning from distributions via support measure machines. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 10–18. Curran Associates, Inc., 2012.
-  Roberto Perdisci, Wenke Lee, and Nick Feamster. Behavioral clustering of http-based malware and signature generation using malicious network traces. In NSDI, volume 10, page 14, 2010.
A Perez-Villegas, C Torrano-Gimenez, and G Alvarez.
Applying markov chains to web intrusion detection.Proceedings of Reunión Espanola sobre Criptología y Seguridad de la Información (RECSI 2010), pages 361–366, 2010.
-  Tomás Pevný and Vojtech Kovarík. Approximation capability of neural networks on spaces of probability measures and tree-structured domains. CoRR, abs/1906.00764, 2019.
-  Tomáš Pevný and Ivan Nikolaev. Optimizing pooling function for pooled steganalysis. In 2015 IEEE International Workshop on Information Forensics and Security (WIFS), pages 1–6. IEEE, 2015.
-  Tomas Pevný and Petr Somol. Discriminative models for multi-instance problems with tree structure. In Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security, AISec ’16, pages 83–91, New York, NY, USA, 2016. ACM.
-  Tomáš Pevný and Petr Somol. Using neural network formalism to solve multiple-instance problems. In Fengyu Cong, Andrew Leung, and Qinglai Wei, editors, Advances in Neural Networks - ISNN 2017, pages 135–142, Cham, 2017. Springer International Publishing.
-  R Rajalakshmi and Chandrabose Aravindan. An effective and discriminative feature learning for url based web page classification. In 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 1374–1379. IEEE, 2018.
-  J. Rissanen. Modeling by shortest data description. Automatica, 14(5):465 – 471, 1978.
-  Joshua Saxe and Konstantin Berlin. expose: A character-level convolutional neural network with embeddings for detecting malicious urls, file paths and registry keys. arXiv preprint arXiv:1702.08568, 2017.
-  Marcos Sebastián, Richard Rivera, Platon Kotzias, and Juan Caballero. Avclass: A tool for massive malware labeling. In International Symposium on Research in Attacks, Intrusions, and Defenses, pages 230–253. Springer, 2016.
-  Fernando Silveira and Christophe Diot. Urca: Pulling out anomalies by their root causes. In 2010 Proceedings IEEE INFOCOM, pages 1–9. IEEE, 2010.
-  Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
-  Cisco systems Inc. Cognitive threat analytics, 2019.
-  C. J. van Rijsbergen. Information Retrieval. London: Butterworths, 1975.
-  Jonathan Woodbridge, Hyrum S Anderson, Anjum Ahuja, and Daniel Grant. Predicting domain generation algorithms with long short-term memory networks. arXiv preprint arXiv:1611.00791, 2016.
-  Apostolis Zarras, Antonis Papadogiannakis, Robert Gawlik, and Thorsten Holz. Automated generation of models for fast and precise detection of http-based malware. In 2014 Twelfth Annual International Conference on Privacy, Security and Trust, pages 249–256. IEEE, 2014.
Ming Zhang, Boyi Xu, Shuai Bai, Shuaibing Lu, and Zhechao Lin.
A deep learning method to detect web attacks using a specially designed cnn.In ICONIP 2017, pages 828–836, 2017.