Deep Anomaly Detection in Packet Payload

12/05/2019 ∙ by Jiaxin Liu, et al. ∙ Sichuan University 0

With the widespread adoption of cloud services, especially the extensive deployment of plenty of Web applications, it is important and challenging to detect anomalies from the packet payload. For example, the anomalies in the packet payload can be expressed as a number of specific strings which may cause attacks. Although some approaches have achieved remarkable progress, they are with limited applications since they are dependent on in-depth expert knowledge, e.g., signatures describing anomalies or communication protocol at the application level. Moreover, they might fail to detect the payload anomalies that have long-term dependency relationships. To overcome these limitations and adaptively detect anomalies from the packet payload, we propose a deep learning based framework which consists of two steps. First, a novel feature engineering method is proposed to obtain the block-based features via block sequence extraction and block embedding. The block-based features could encapsulate both the high-dimension information and the underlying sequential information which facilitate the anomaly detection. Second, a neural network is designed to learn the representation of packet payload based on Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNN). Furthermore, we cast the anomaly detection as a classification problem and stack a Multi-Layer Perception (MLP) on the above representation learning network to detect anomalies. Extensive experimental results on three public datasets indicate that our model could achieve a higher detection rate, while keeping a lower false positive rate compared with five state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The rapid increase of cloud services brings remarkable convenience to our daily life and promotes the Internet economy. However, it is faced with abundant threats from malicious attackers. According to the data from the annual report of Micro Focus 18

, there are almost 51% growth of disclosed vulnerabilities that are related with Web applications in 2017, and nearly 95% of Web applications are vulnerable to sensitive data exposure, which would cause great harm to the usage of cloud services. Therefore, it is highly expected to accurately detect anomalies in network traffic. To this end, a variety of methods have been developed, which could be roughly classified into the following categories, namely, rule-based methods, flow-based methods, and packet-based methods.

As one typical method of rule-based anomaly detection, Carmen et al. Torrano-Gimenez et al. (2011)

applied feature selection called Generic-Feature-Selection to construct domain specific rules for Web application firewall. By adopting and integrating these technology 

Torrano-Gimenez et al. (2015); Lin and Tseng (2004), a number of powerful tools have been developed for constructing domain specific rules from known threats, such as Suricata 27 and Snort 26. These tools use a highly efficient engine to discover malicious traffic by comparing the extracted signatures with the predefined rules. If malicious traffic is detected, actions can be taken to protect the cloud services. Although the rules-based methods are effective for the known threats, they heavily depend on in-depth expert knowledge, e.g., signatures describing anomalies.

Recently, some machine learning methods have been proposed to detect traffic anomaly. There are two popular directions, which use flow-based information and packet-based information to detect anomalies respectively. Flow-based anomaly detection usually treats the representation of network traffic as a type of time series

Zhou et al. (2019); Zhang et al. (2018); Ren et al. (2017). Xu et al. (2005) used the five-tuple to construct comprehensive behavior profiles of network traffic in terms of communication patterns of end-hosts and services. The anomalies are detected by exploring the correlation between the traffic behaviors and the corresponding characteristics. da Silva et al. (2016) presented a framework called ATLANTIC which uses similarities of flows to detect threats in traffic flows. These methods could achieve a competitive performance when the flow-based behaviors are presented. However, they do not perform well for some kinds of attacks, e.g., shell-code and SQL injection, which do not express abnormal characteristics in flow-based information.

Packet-based anomaly detection can unveil anomalies by inspecting the packet payload, which refers to the user data of network packet. The objective of packet-based anomaly detection is to discover the possible attacks that have potential abnormal characteristics in the packet payload. The anomalies might appear as a number of specific strings. For example, as one of the most common attacks, i.e., SQL injection, which injects anomalous codes, such as “ ’ or 1=1 - -”, into conditional statements of SQL queries to make them always be true. To detect this kind of anomalies from packet payload, a variety of methods have been proposed. The PAYL was proposed in Wang and Stolfo (2004), which used 1-gram frequency distribution of the packet payload as features to detect network anomalies. McPAD was then proposed in Perdisci et al. (2009)

, which developed a modified feature extraction method for accurate anomaly detection. More recently, deep learning technology is explored for payload anomaly detection. Several literature

Marin et al. (2018); Zhang et al. (2017); Qin et al. (2018) investigated by using raw measurements to detect payload anomalies. These methods use deep learning technologies to automatically extract features from packet payload. However, the performance of payload anomaly detection is still undesirable due to the incomplete representation of features. As shown in Figure 1, there are two types of packet payload anomalies that have different distributions of anomalous bytes. Unlike the short-term packet payload anomalies whose anomalous bytes are concentrated, the anomalous bytes for long-term are scattered and their abnormal characteristics can not be addressed by existing works. Most of the existing detection methods ignore the long-term dependency relationships among the anomalous bytes.

Figure 1: Two examples of the packet payload anomalies that have different distributions of anomalous bytes. For the short-term packet payload anomaly, the potential anomalous bytes are concentrated and their abnormal characteristics may be obvious, which could be detected by existing methods. In contrast, for the long-term packet payload anomaly, the potential anomalous bytes are scattered and their abnormal characteristics may not be addressed by existing works. Long-term anomalies in packet payload are more difficult to be detected than short-term ones for the existing methods.

To tackle this, we propose a payload anomaly detection framework, which consists of two parts. The former part of the proposed framework is a feature engineering method, which consists of two steps. First, it introduces a sliding block to construct block sequences from packet payload. Second, the low-frequency items of block sequences are filtrated by a dictionary and the high-frequency items are encoded into the low-dimension embedded vectors by a self-learning block embedding layer. The proposed feature engineering method constructs the block-based features, which contain both the high-dimension information and the underlying sequential information to reveal the characteristics of payload. The latter part of the proposed framework is a detection model, which has an LSTM and CNN based neural network for learning both the potential long-term and short-term dependency relationships among the block-based features and an MLP based classifier to discover potential attacks. The major contributions of this paper could be summarized as follows:

  • We propose a feature engineering method that constructs block-based features of the packet payload, which could reveal the long-term dependency relationships among the anomalous bytes in packet payload. Our feature engineering method are not dependent on in-depth expert knowledge. To the best of our knowledge, this could be the first work that explores the long-term dependency relationships among the anomalous bytes for the payload anomaly detection.

  • We design a detection model that contains an LSTM and CNN based neural network to learn both the long-term and short-term dependency relationships in the block-based features and an MLP based classifier to discover potential attacks in the packet payload.

  • We evaluate the proposed framework that integrates the feature engineering method and the detection model by using three public datasets.

The rest of the paper is organized as follows. Section 2 introduces the related work, including the traditional technology and the deep learning technology for network anomaly detection. Section 3 presents the proposed framework, which integrates a feature engineering method and a detection model. In Section 4, we evaluate the proposed framework by using three public datasets. We conclude the paper in Section 5.

2 Related work

2.1 Network Anomaly Detection

Network anomaly detection is a fundamental task for the quality of service (QoS) of Internet. A lot of previous work focused on the anomaly detection of low-level network flows or high-level backbone networks. To detect anomalies through flow-based information, ATLANTICda Silva et al. (2016) used deviations in the entropy of traffic flow tables to detect threats in traffic flows. In Xu et al. (2005), K. Xu et al. detected anomalies by exploring the correlation between traffic behaviors and the corresponding characteristics in backbone networks. These methods could achieve a high detection accuracy for flow-based anomalies, but it is unlikely to detect attacks that insert anomalies in packet payloads, e.g., shell-code and SQL injection. Packet-based anomaly detection methods focus on inspecting the abnormal information in the packet payload. K. Wang et al. Wang and Stolfo (2004) proposed PAYL which uses the 1-gram frequency distribution of the payload as features to detect anomalies. R. Perdisci et al. Perdisci et al. (2009) proposed McPAD to construct modified 2-gram features that contain abundant information for accurate anomaly detection. However, the accuracy of these methods heavily depend on feature construction that is complex and requires in-depth expert knowledge.

2.2 Deep Learning Methods for Network Anomaly Detection

Deep learning technology, which could automatically learns representation of data, was recently explored to address the limitations of the traditional machine learning methods. To detect the flow-based threats in network traffic, many studies investigated the power of deep learning for flow-based anomaly detection. Kim et al.Kim and Cho (2018) proposed C-LSTM neural network for effectively modeling the spatial and temporal information contained in raw data to detect anomalies in traffic. Tang et al. Tang et al. (2016) proposed a flow-based Deep Neural Network (DNN) model for intrusion detection in a software defined networking environment. To detect payload-based attacks, several detection models using the raw payload data as input have been investigated in the literature. Gonzalo et al. Marin et al. (2018) applied deep CNN and LSTM neural networks for network intrusion detection with different representations of payload data. Arne et al.Bochem et al. (2017) applied LSTM neural networks to learn latent characteristics of normal requests. H.Liu et al.Liu et al. (2019) implemented an end-to-end deep learning detection models using raw payload data. Wei et al.Wang et al. (2017) proposed hierarchical spatial-temporal features-based intrusion detection system, which applied deep CNN to learn the low-level spatial features of network traffic and used LSTM to learn the high-level temporal features. Sheraz N.et al.Naseer et al. (2018)

developed several neural networks to build network anomaly detection models, including CNN, auto encoders and recurrent neural networks (RNN).

2.3 Summary

The most related work to our paper is Qin et al. (2018), which proposed a RNN model with the attention mechanism called ATPAD to detect anomalies in the packet payload. The ATPAD employs the word embedding and RNN to extract features, which are used at the attention calculation stage to capture the correlation between potential byte of payload and the detection results. Different from the ATPAD model, we propose a novel feature engineering method which utilizes the raw packet payload data to construct the block-based features. The block-based features contain two different kinds of information that retain both long-term and short-term dependency relationships among the packet payload. We also employs a neural network based on LSTM and CNN rather than the RNN model with the attention mechanism to capture the long-term dependency relationships among the anomalous bytes. To the best of our knowledge, our model achieves state-of-the-art performance on the CSIC 2010 dataset6.

3 Proposed framework

The proposed anomaly detection framework is shown in Figure 2. There are four modules in this framework. The first two modules make up the former part of the proposed framework, which aims to construct block-based features for efficient feature extraction. In the first module, the payload is extracted and labeled through a preprocessing process. Then, the block sequence is constructed by the sliding block and in order to remove redundant information, the high-frequency items in the block sequence are selected by a dictionary. In the block embedding process, the block-based features are constructed by encoding each item in block sequence into an embedded vector. The last two modules form the latter part of the proposed framework, which aims to adaptively detect anomalies for packet payload. Specifically, a neural network based on the LSTM and the CNN is designed to learn both the long-term and short-term dependency relationships in the block-based features and an MLP is adopted as a classifier to detect anomalies in each sample. In order to better understand how the framework works, the framework is described in details in the following subsections.

Figure 2: Overview of the proposed framework. The proposed framework contains four modules. First, the payload is extracted and labeled through a preprocessing process. Then, the block-based features are constructed for each payload. In the last two modules, a detection model based on the LSTM, CNN and MLP is designed for packet payload anomaly detection.

3.1 Packet Payload Preprocessing

The objective of the packet payload preprocessing is to extract the payload from the packet and to convert the payload into a suitable form for the following feature engineering method. The payload extraction is conducted by packet parsing based on the low-level communication protocols. The following process will try to construct efficient expression for the extracted payload. Thus, instead of employing the encoding method, e.g., popular one-hot encoding, to transform the extracted payload to an embedding vector with fixed length and possible zero padding, we directly process the whole payload to a byte stream, which is a string with variable length. The byte stream and the label with respect to the same packet make up a sample for the preprocessed packet payload data.

3.2 Block-Based Feature Extraction

Instead of using the payload byte stream as features, we proposed a feature engineering method to extract the block-based features which contain the high-dimension information and the underlying sequential information for anomaly detection. The block-based feature extraction has two steps, i.e., block sequence construction and block embedding. Firstly, a block sequence is constructed by using the sliding block to extract numerous items that could be considered as short subsequences. For retaining the sequential information, the items are arranged in order. Secondly, to reduce the redundant information unrelated to anomalous bytes in block sequences, the high-frequency items of each block sequence are selected by a dictionary and encoded into embedded vectors through block embedding process.

Figure 3:

An example for the process of block sequence construction. With a sliding block of length 3 and a fixed stride, the blocks extracted from a packet payload form a block sequence.

The process of block sequence extraction is shown in Figure 3. A sliding block of specific length slides on each sample consecutively. When the sliding block slides to a certain position, an item would be extracted, then the sliding block would move with a fixed stride to extract items repeatedly. Finally, the block sequence is constructed by arranging blocks in a sequence according to the order of extraction process.

As mentioned above, the high-dimension information and underlying sequential information are retained in the block sequences, which are not just useful for detecting general anomalies in the payload, but also efficient for detecting anomalous bytes that have long-term dependency relationships. First of all, the high-dimension information could be considered as a kind of semantic information, which is affected by the length of sliding block. Intuitively, the longer the sliding block is, the more high-dimension information the item contains. As is shown in Figure 4(a)&(b), for the same part of the packet payload, when the block length equals to 2, the items , , are extracted by the sliding block. They have more information than single character , , , that are extracted when the block length equals to 1. However, when the length of the sliding block is too long, the extracted features would contain a mixture of normal information and abnormal information, which might confuse the learning process for anomaly detection. Thus, a suitable length of the sliding block should be chosen.

Moreover, as the length of sliding block increases, the block sequence could contain more abundant sequential relationships. To be specific, under the ASCII extended standard, there are about possibilities of the sequential relationships between items of length . As is shown in Figure 4(c), when the block length equals to 2, the item has 2 different sequential relationships, i.e., and . When the block length equals to 1, the item only contains the sequential relationship . Furthermore, the expression of both the short-term and long-term dependency relationships in the block sequence is enhanced as the block length increases. This will benefit practical payload anomaly detection, especially for those that have long-term dependency relationships, such as the Union Query AttackHalfond et al. (2006).

The high-frequency items in the block sequence would be selected by a dictionary for the reason that there are plenty of redundant information unrelated to the anomalous bytes. In the extraction process of the sliding block, a dictionary is constructed to record the frequency of occurrence for each item and a threshold is set to limit the number of high-frequency items in the dictionary. By using the dictionary, each high-frequency item in the block sequence is selected and rearranged in the original order. Finally, each sample would be reconstructed into a sequence of selected items, which represent the significant information of each sample.

Figure 4: An example of block sequence construction process and information variations using different sliding block. Using a sliding block with a fixed stride, the blocks extracted from payload could be regarded as a set of short strings. Comparing the process of sliding blocks with different lengths in the same field of the payload, the block with length 2 extracts more abundant sequential relationships than the block with length 1.

Furthermore, the high-frequency items in the block sequence are encoded by block embedding layer in order to make a better expression of the high-dimension information and underlying sequential information. One-hot encoding does not work for this task, because it could not represent the similarity between different items and with the increase of number of items, it is faced with the curse of dimensionality. Inspired from distributed representation

Paccanaro and Hinton (2001), the items in each block sequence are encoded into low-dimension embedded vectors by a self-learning block embedding layer. The block-based features are constructed by concatenating all the vectors in order.

The proposed feature engineering method builds the block-based features, which do not rely on in-depth expert knowledge, as several low-dimension embedded vectors to form a valid expression of packet payload.

Figure 5: An illustration of our proposed anomaly detection model.

3.3 Model Construction and Anomaly Detection

The structure of the proposed anomaly detection model is presented in Figure 5

. A neural network based on LSTM and CNN is designed to learn the high-dimension information and the underlying sequential information contained in the block-based features. LSTM is used to learn the sequential dependency relationships among the block-based features, which are indicated in its hidden states of each time step. In order to learn both the long-term and the short-term dependency relationships in block-based features, we make use of the chosen LSTM hidden states in different time steps instead of only using the last hidden state that is widely adopted in classification tasks. CNN based structure is adopted to extract the local spatial information in the chosen hidden states and an MLP connected with a softmax layer is used as a classifier to detect anomalies.

In recent years, LSTM has been applied to machine translationBahdanau et al. (2014), speech recognitionGraves et al. (2013)

, and so on, for its capability of processing persistent information. Benefiting from its special memory cell structure and gating mechanism, it solves the exploding and vanishing gradient problems, which enable the efficient learning for long sequences. Therefore, LSTM is adopted for purpose of learning the long-term dependency relationships in the block-based features. In the proposed detection model, we employ LSTM to learn the relationships in the block-based features, i.e., the constructed features of our feature engineering method. At each time step, an embedded vector

of block-based features is fed into the LSTM. The LSTM updates its cell state and outputs the current hidden state according to the previous hidden state and the current input through its inner non-linear operations. For the output of the LSTM at each time step , the hidden state is calculated as followsZhou et al. (2015):

(1)
(2)
(3)
(4)
(5)

Here, the , and are the forget, input and output gates respectively. They control the process for updating the LSTM hidden state.

is the logistic sigmoid function

Yin et al. (2003) and the tanh is the hyperbolic tangent functionXiao et al. (2005). is the weight matrix and is the bias. The notations and represent the Matmul product and the Hadamard productManevitz and Yousef (2000) respectively.

Assume that the length of block-based features is , which varies with different samples, there will be hidden states. The last hidden state of LSTM is widely adopted in classification tasks, however, it could not adequately express the long-term relationships in the block-based features. To tackle this, we choose candidates from the hidden states and these candidates are equally spaced in the ascending hidden states. This process not only preserves the long-term relationships, but also reduces the complexity of feature expression.

CNN is powerful for its capability to learn spatial features and reduce feature space. Benefiting from sparse connectivity, shared weights and pooling, CNN extracts the spatial correlation information via convolution without any complex processingKrizhevsky et al. (2012)

. The CNN based structure in our model is used to extract high-level spatial information in the chosen hidden states. The chosen hidden states are concatenated in order and reshaped into a two-dimensional matrix. In the convolution layer, multiple convolution filters slide over the matrix to do the convolution operations, which extract the local spatial features. The learning process for CNN based structure is progressive, where the first convolution layer extracts low-level features and the next convolution layer extracts high-level features. After each convolution layer, a max-pooling layer is adopted to obtain the largest value of a small region, which preserves the important parameters and enhances the generalization ability of the model. In addition, the rectified liner unit (ReLU)

Nair and Hinton (2010)

is used as the activation function to add nonlinear constraint in the process. After the convolution and pooling, the spatial features of each sample are extracted and flattened into a vector which is further transmitted to the classifier.

By casting the payload-based anomaly detection as a classification problem, an MLP is stacked on the above neural network to detect anomalies. The MLP has two layers, which would convert the flattened vector into a two-dimension vector , and the softmax function maps it into a two-dimension distribution by Eq. (6), whose values are scaled between 0 and 1, and the sum of these two values is 1. The sample would be labeled by the Eq. (7). The label 0 means the classifier judges the sample is normal, while the label 1 means the classifier judges the sample is anomalous.

(6)
(7)

3.4 Implementation

In the proposed framework, there are four hyper-parameters needed to be set up, which includes the length of sliding block, the stride of sliding block, the number of high-frequency items in the dictionary and the number of chosen LSTM hidden states. In the experiments, the length of sliding block, the stride of sliding block, the number of high-frequency items in the dictionary, and the number of chosen states are set up as 3, 1, 15000, and 50, respectively.

Regarding the parameters of the neural network we have designed, the hidden units of LSTM are set to 128 and the LSTM is fed with a embedded vector of 64 dimensions at each time step. Two convolution layers and two pooling layers are implemented in the CNN based structure. These two convolution layers have 32 and 64 filters, respectively. All the filters in the convolution layers are with size of . Each pooling layer uses max-pooling with a filter.

An MLP with two layers is used as a classifier, which has 128 and 2 hidden units respectively. It finally converts the feature maps into a two-dimension vector for classification. During the training process, we set the learning rate to 0.0001 for a stable training. The dropout rate for the MLP is 0.1.

4 Evaluation

In this section, we conduct various experiments to evaluate the performance and effectiveness of the proposed framework for the payload anomaly detection. We first describe the datasets and metrics used for the evaluation. Then, experiments are conducted to evaluate the performance of the proposed framework on different aspects.

4.1 Datasets

We conduct experiments on three datasets to evaluate the performance of the proposed method. These three datasets contain various types of network traffic attacks. We randomly divided each dataset into three parts, the training set, the validation set and the testing set. These three sets account for 70%, 10% and 20% of the total data in each dataset, respectively. The overview of three datasets are shown in Table 1 and the detailed description of each dataset is introduced as follows.

4.1.1 Csic 2010

The CSIC 2010 dataset6 is developed at the Information Security Institute of Spanish Research National Council and contains thousands of Web requests which are generated automatically. The dataset consists of 72,000 normal requests and more than 25,000 anomalous requests, and all HTTP requests are marked as normal or abnormal. The CSIC 2010 dataset contains various types of Web attacks such as SQL injection, buffer overflow, information collection and so on.

4.1.2 Cicids 2017

The CICIDS 2017 dataset5 contains both normal traffic and up-to-date attacks which resemble the true data. The dataset is developed by the Canadian Institute for Cyber Security. Various types of attacks, include DoS, DDos, heartbleed, web attack, infiltration and botnet, are collected in this dataset. In the experiments, we only use the traffic data collected in July 6, which contains three types of Web attacks that are related with the packet payload including Brute Force, XSS and SQL injection.

4.1.3 Iscx 2012

ISCX 2012 dataset10 contains network traffic which aims to describe network behaviors and intrusion patterns. This dataset contains actual traffic types such as HTTP, SMTP, SSH, IMAP, POP3, and FTP. It records packet payloads of traffic traces in the form of PCAP and the relevant profiles are publicly available for researchers. In our experiments, we use the traffic data collected in June 17th, which contains Brute Force SSH anomalies related with the packet payload.

Dataset Category Train Validation Test
CSIC 2010 Anomaly 17,617 2,439 5,009
Normal 50,328 7,268 14,404
CICIDS 2017 Anomaly 6,374 951 1,822
Normal 14,188 1,987 4,053
ISCX 2012 Anomaly 1,618 241 482
Normal 33,722 4807 9,616
Table 1: The detailed description of three experimental dataset.

4.2 Performance Metric

In our experiments, we consider the abnormal packet payload to be a positive sample and the normal packet payload to be a negative sample. The methods performed in the experiments are evaluated on five metrics, i.e., Precision, Detection Rate (DR), False Positive Rate (FPR), Accuracy and F1-Score. These metrics are defined based on four related parameters, i.e., TP, TN, FP, FN, where TP represents the number of true positive samples, FN represents the number of false negative samples, FP represents the number of false positive samples and the TN represents the number of true negative samples.

The definitions of the five metrics are listed as follows:

(8)
(9)
(10)
(11)
(12)

4.3 Results and Discussion

In this section, we have implemented five experiments to evaluate the performance of the proposed framework in the following five aspects:

  • Experiment A: How is the performance of the proposed framework compared with the traditional machine learning methods and other state-of-the-art methods?

  • Experiment B: Whether the block-based features are a well expression of the packet payload characteristics?

  • Experiment C: Whether the proposed detection model can extract the long-term dependency relationships in payload anomalies?

  • Experiment D: What is the influence of the hyper-parameters in the proposed framework?

  • Experiment E: How does the proposed model perform on other public datasets?

4.3.1 Experiment A: Performance compared with other methods

In this experiment, we test the proposed framework on the CSIC 2010 dataset and compare the results with those of other methods. Five compared methods are involved in this experiment. Specifically, we use two classical machine learning methods and three methods recently released as the compared methods. We use the scikit-learn libraryPedregosa et al. (2011)

to implement two traditional machine learning methods, support vector machine(SVM)

Chang and Lin (2011)

and random forest(RF)

Breiman (2001), and test them on CSIC 2010 dataset. The other three methods include a RNN based method (Qin’18Qin et al. (2018)), a CNN based method (Zhang’17Zhang et al. (2017)) and a LSTM based method (Bochem’17Bochem et al. (2017)). In the experiments, we simplify the http payload by ignoring the request header fields, which removes redundant information and reduces calculation complexity. The detection results of each method are listed in Figure 6(a)&(b), respectively.

(a) Detection Rates
(b) False Positive Rates
Figure 6: The Detection Rates and False Positive Rates of the related works and our experiment.

As shown in Figure 6(a)&(b), the proposed framework and three deep learning based methods outperform the two classical machine learning based methods in DR, and the proposed framework achieves the highest DR of 99.12%. The FPR of our proposed framework is 0.22, which is lower than those of all the compared methods. To the best of our knowledge, our model achieves state-of-the-art performance on the CSIC 2010 dataset.

4.3.2 Experiment B: Performance analysis of the block-based features

In this experiment, we compare the performance of three models with or without the adoption of the block-based features. We aim to investigate whether the block-based feature engineering method can improve the anomaly detection performance in the investigated models. The first model, LSTM-CNN based model, uses the proposed network structure described in Section 3.3. The other two models are constructed by only using the LSTM network or the CNN network described in Section 3.3. We call them as the LSTM based model and the CNN based model, respectively. The “BL” prefix of a model’s name indicates the model uses the block-based features, otherwise the method uses the payload bytes stream. The CSIC 2010 dataset is used to test each model in this experiment.

(a) Detection Rates
(b) False Positive Rates
Figure 7: Detection Rates and False Positive Rates of three models with or without the adoption of the block-based features.

Our experiment results are shown in Figure 7(a)&(b). When the block-based features are not used, the CNN based model achieves DR of 84.35% and FPR of 3.56%, the LSTM based model achieves DR of 91.5% and FPR of 4.76%, and the LSTM-CNN based model achieves DR of 96.57% and FPR of 0.72%. When the block-based features are used, the BL-CNN based model achieves DR of 98.82% and FPR of 0.15%, the BL-LSTM based model achieves DR of 99.08% and FPR of 0.44%, and the proposed model achieves DR of 99.12% and FPR of 0.22%. These results show that when the block-based features extraction method is applied, the DR of each model increases 14.47%, 7.58%, 2.55% respectively, while the FPR of each model decreases 3.42%, 4.31%, 0.5%, respectively.

The above results demonstrate that the block-based features help these three models improve their detection performance on CSIC 2010 dataset. Moreover, the block-based feature extraction method could be easily combined with other anomaly detection methods and has the potential to improve their performance.

4.3.3 Experiment C: Performance analysis for the long-term dependency relationships in payload anomalies

In this experiment, we design a more challenging anomaly detection task, compared to the basic task in experiment A, to evaluate the proposed framework. The purpose of this task is to investigate whether the proposed framework has the capability of learning the long-term dependency relationships for packet payload anomalies.

In order to conduct the above task, we use a random-insertion method to change the sequential dependency relationships of anomalous bytes in samples of the CSIC 2010 dataset. In this method, each sample is inserted with a segment of noise, and the length of noise is 20% of the sample length. The insertion index is random and the noise is all composed of character ‘0’. On one hand, after the random-insertion preprocessing, the noise is added into each sample which disrupts the original short-term dependency relationships of anomalous bytes. On the other hand, the same redundant information increases the similarity between each sample, which makes it more difficult to extract effective features and detect anomalies.

We calculate the performance metrics for the BL-CNN based model, the BL-LSTM model and the proposed framework. All the packet payloads are preprocessed by the random-insertion method introduced above. The experiment results are shown in Table 2, which indicate that the proposed framework still achieves an excellent detection performance. Compared with the results on the original CSIC 2010 dataset, the DR of the proposed framework only decreases 0.08%, while the DR of the BL-CNN based model decreases 3.1% and the BL-LSTM based model decreases 1.6%. The proposed framework shows the ability to extract the long-term dependency relationships in the packet payload anomalies and still performs well on the task in this experiment.

Models DR FPR Precision -score Accuracy
BL-CNN 95.72% 0.22% 99.34% 97.50% 98.73%
BL-LSTM 97.47% 0.45% 98.68% 98.07% 99.02%
BL-LSTM-CNN 98.67% 0.17% 99.52% 99.29% 99.53%
Table 2: Comparison for the results of three models in Experiment C.

4.3.4 Experiment D: Influence of the hyper-parameters on the proposed framework

In this experiment, we attempt to evaluate the influence of four different hyper-parameters on the proposed framework. We evaluate one hyper-parameter each time and the other hyper-parameters are the same as those set in Experiment A. The experiment is also performed on the CSIC 2010 dataset. The four parameters that we evaluate include the length of sliding block, the stride of sliding block, the number of high-frequency items in dictionary and the number of chosen LSTM hidden state. The length of sliding block is set to 1, 2, 3, 4 and 5, respectively. The block sliding length is set to 1, 2 and 3, respectively. For the evaluation of the number of high frequency items in dictionary, it is set to 5000, 10000, 15000 and 20000 each time. The number of chosen LSTM hidden state we tested includes 5, 20, 50 and 100.

The influence of the hyper-parameters on the proposed framework is shown in Figures 8&9. When the length of sliding block is too small, the information contained in the block-based features is limited. If it is too large, both normal and abnormal information might be mixed in the block-based features, which will cause poor performance. The stride of sliding block affects the amount of information extracted from the packet payload, which should be set to 1. The number of high-frequency items in dictionary affects the amount of information of block-based features, if the number is to large, the dictionary will involve too much redundant information. However, if the number is too small, the dictionary may not contain enough valuable information. The number of chosen LSTM hidden state affects the amount of sequential feature information used for the proposed framework. The experiment results indicate the model achieves its best detection performance when the number of chosen states is 50.

Figure 8: The influence of model parameters on DR.
Figure 9: The influence of model parameters on FPR.

4.3.5 Experiment E: Performance evaluation on other public datasets

In this experiment, we evaluate the performance of the proposed framework on other public datasets. We use the subset of the public dataset for evaluation, which contains attacks related with the packet payload. For the CICIDS 2017 dataset, we use the Web attack data includes Brute Force, XSS, SQL injection in one day’s record to set up the attack dataset, and use the normal HTTP traffic data in that day to conduct the normal dataset. For the ISCX 2012 dataset, we only use the Brute Force SSH attack data to form the attack dataset. The normal packet payload of SSH data in that day is used to form the normal dataset.

The detection results on the two datasets are shown in Table 3. Our proposed method achieves excellent performance on both two datasets. On the CICIDS 2017 dataset, our proposed model achieves a DR of 99.78% and an FPR of 0.0165%. On the other dataset, our proposed model achieves a DR of 99.17% and an FPR of 0.332%.

Datasets DR FPR Precision -score Accuracy
CICIDS 2017 99.78% 0.0165% 99.34% 99.56% 99.92%
ISCX 2012 99.17% 0.332% 98.68% 98.92% 99.64%
Table 3: Experimental results of proposed model on other datasets.

5 Conclusion

This paper proposed a payload-based anomaly detection framework to construct block-based features for efficient feature extraction and to adaptively detect anomalies. The block-based features are constructed by the former part of the proposed framework, which is a feature engineering method implemented via block sequence extraction and block embedding. The latter part of the proposed framework, i.e., the anomaly detection model, is designed to learn both the long-term and short-term dependency relationships in the block-based features and to discover potential attacks in the packet payload. Experiment results with three public datasets showed that the proposed framework could achieve a high detection rate and a low false positive rate compared with existing methods in the literature. In future work, we will consider other kinds of anomalies, e.g., anomalies in video surveillanceLuo et al. (2019), and try to explore a unified framework to detect them.

References

  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3.3.
  • A. Bochem, H. Zhang, and D. Hogrefe (2017) Streamlined anomaly detection in web requests using recurrent neural networks. In 2017 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 1016–1017. Cited by: §2.2, §4.3.1.
  • L. Breiman (2001) Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §4.3.1.
  • C. Chang and C. Lin (2011) LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2 (3), pp. 27. Cited by: §4.3.1.
  • [5] CICIDS 2017 dataset. Note: https://www.unb.ca/cic/datasets/ids-2017.html Cited by: §4.1.2.
  • [6] CSIC 2010 http dataset. Note: https://www.isi.csic.es/dataset/ Cited by: §2.3, §4.1.1.
  • A. S. da Silva, J. A. Wickboldt, L. Z. Granville, and A. Schaeffer-Filho (2016) ATLANTIC: a framework for anomaly traffic detection, classification, and mitigation in sdn. In NOMS 2016-2016 IEEE/IFIP Network Operations and Management Symposium, pp. 27–35. Cited by: §1, §2.1.
  • A. Graves, A. Mohamed, and G. Hinton (2013) Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645–6649. Cited by: §3.3.
  • W. G. Halfond, J. Viegas, A. Orso, et al. (2006) A classification of sql-injection attacks and countermeasures. In Proceedings of the IEEE International Symposium on Secure Software Engineering, pp. 13–15. Cited by: §3.2.
  • [10] ISCXIDS 2012 dataset. Note: https://www.unb.ca/cic/datasets/ids.html Cited by: §4.1.3.
  • T. Kim and S. Cho (2018) Web traffic anomaly detection using c-lstm neural networks. Expert Systems with Applications 106, pp. 66–76. Cited by: §2.2.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §3.3.
  • S. Lin and S. Tseng (2004) Constructing detection knowledge for ddos intrusion tolerance. Expert Systems with Applications 27 (3), pp. 379–390. Cited by: §1.
  • H. Liu, B. Lang, M. Liu, and H. Yan (2019) CNN and rnn based payload classification methods for attack detection. Knowledge-Based Systems 163, pp. 332–341. Cited by: §2.2.
  • W. Luo, W. Liu, D. Lian, J. Tang, L. Duan, X. Peng, and S. Gao (2019) Video anomaly detection with sparse coding inspired deep neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1. Cited by: §5.
  • L. M. Manevitz and M. Yousef (2000) Document classification on neural networks using only positive examples. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 304–306. Cited by: §3.3.
  • G. Marin, P. Casas, and G. Capdehourat (2018) Rawpower: deep learning based anomaly detection from raw network traffic measurements. In Proceedings of the ACM SIGCOMM 2018 Conference on Posters and Demos, pp. 75–77. Cited by: §1, §2.2.
  • [18] MICRO focus 2018 application security research update. Note: https://www.microfocus.com/media/report/application/_security/_research/_update/_report.pdf Cited by: §1.
  • V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning, pp. 807–814. Cited by: §3.3.
  • S. Naseer, Y. Saleem, S. Khalid, M. K. Bashir, J. Han, M. M. Iqbal, and K. Han (2018) Enhanced network anomaly detection based on deep neural networks. IEEE Access 6, pp. 48231–48246. Cited by: §2.2.
  • A. Paccanaro and G. E. Hinton (2001) Learning distributed representations of concepts using linear relational embedding. IEEE Transactions on Knowledge and Data Engineering 13 (2), pp. 232–244. Cited by: §3.2.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. J. Weiss, V. Dubourg, et al. (2011) Scikit-learn: machine learning in python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §4.3.1.
  • R. Perdisci, D. Ariu, P. Fogla, G. Giacinto, and W. Lee (2009) McPAD: a multiple classifier system for accurate payload-based anomaly detection. Computer Networks 53 (6), pp. 864–881. Cited by: §1, §2.1.
  • Z. Qin, X. Ma, and Y. Wang (2018) Attentional payload anomaly detector for web applications. In International Conference on Neural Information Processing, pp. 588–599. Cited by: §1, §2.3, §4.3.1.
  • H. Ren, M. Liu, Z. Li, and W. Pedrycz (2017) A piecewise aggregate pattern representation approach for anomaly detection in time series. Knowledge-Based Systems 135, pp. 29–39. Cited by: §1.
  • [26] Snort. Note: https://www.snort.org/ Cited by: §1.
  • [27] Suricata. Note: https://suricata-ids.org/ Cited by: §1.
  • T. A. Tang, L. Mhamdi, D. McLernon, S. A. R. Zaidi, and M. Ghogho (2016) Deep learning approach for network intrusion detection in software defined networking. In 2016 International Conference on Wireless Networks and Mobile Communications (WINCOM), pp. 258–263. Cited by: §2.2.
  • C. Torrano-Gimenez, H. T. Nguyen, G. Alvarez, and K. Franke (2015) Combining expert knowledge with automatic feature extraction for reliable web attack detection. Security and Communication Networks 8 (16), pp. 2750–2767. Cited by: §1.
  • C. Torrano-Gimenez, H. T. Nguyen, G. Alvarez, S. Petrović, and K. Franke (2011) Applying feature selection to payload-based web application firewalls. In 2011 Third International Workshop on Security and Communication Networks (IWSCN), pp. 75–81. Cited by: §1.
  • K. Wang and S. J. Stolfo (2004) Anomalous payload-based network intrusion detection. In International Workshop on Recent Advances in Intrusion Detection, pp. 203–222. Cited by: §1, §2.1.
  • W. Wang, Y. Sheng, J. Wang, X. Zeng, X. Ye, Y. Huang, and M. Zhu (2017) HAST-ids: learning hierarchical spatial-temporal features using deep neural networks to improve intrusion detection. IEEE Access 6, pp. 1792–1806. Cited by: §2.2.
  • F. Xiao, Y. Honma, and T. Kono (2005) A simple algebraic interface capturing scheme using hyperbolic tangent function. International Journal for Numerical Methods in Fluids 48 (9), pp. 1023–1040. Cited by: §3.3.
  • K. Xu, Z. Zhang, and S. Bhattacharyya (2005) Profiling internet backbone traffic: behavior models and applications. In ACM special interest group on data communication, pp. 169–180. Cited by: §1, §2.1.
  • X. Yin, J. Goudriaan, E. A. Lantinga, J. Vos, and H. J. Spiertz (2003) A flexible sigmoid function of determinate growth. Annals of Botany 91 (3), pp. 361–371. Cited by: §3.3.
  • L. Zhang, J. Lin, and R. Karim (2018) Adaptive kernel density-based anomaly detection for nonlinear systems. Knowledge-Based Systems 139, pp. 50–63. Cited by: §1.
  • M. Zhang, B. Xu, S. Bai, S. Lu, and Z. Lin (2017) A deep learning method to detect web attacks using a specially designed cnn. In International Conference on Neural Information Processing, pp. 828–836. Cited by: §1, §4.3.1.
  • C. Zhou, C. Sun, Z. Liu, and F. C. M. Lau (2015) A c-lstm neural network for text classification. Computer Science 1 (4), pp. 39–44. Cited by: §3.3.
  • J. T. Zhou, H. Zhang, D. Jin, and X. Peng (2019) Dual adversarial transfer for sequence labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1. Cited by: §1.