Log In Sign Up

Autoencoder-based Unsupervised Intrusion Detection using Multi-Scale Convolutional Recurrent Networks

The massive growth of network traffic data leads to a large volume of datasets. Labeling these datasets for identifying intrusion attacks is very laborious and error-prone. Furthermore, network traffic data have complex time-varying non-linear relationships. The existing state-of-the-art intrusion detection solutions use a combination of various supervised approaches along with fused features subsets based on correlations in traffic data. These solutions often require high computational cost, manual support in fine-tuning intrusion detection models, and labeling of data that limit real-time processing of network traffic. Unsupervised solutions do reduce computational complexities and manual support for labeling data but current unsupervised solutions do not consider spatio-temporal correlations in traffic data. To address this, we propose a unified Autoencoder based on combining multi-scale convolutional neural network and long short-term memory (MSCNN-LSTM-AE) for anomaly detection in network traffic. The model first employs Multiscale Convolutional Neural Network Autoencoder (MSCNN-AE) to analyze the spatial features of the dataset, and then latent space features learned from MSCNN-AE employs Long Short-Term Memory (LSTM) based Autoencoder Network to process the temporal features. Our model further employs two Isolation Forest algorithms as error correction mechanisms to detect false positives and false negatives to improve detection accuracy. Riemannian manifold that is naturally embedded with distance metrices that facilitates descriminative patterns for detecting malicious network traffic. We evaluated our model NSL-KDD, UNSW-NB15, and CICDDoS2019 dataset and showed our proposed method significantly outperforms the conventional unsupervised methods and other existing studies on the dataset.


page 8

page 9


Intrusion Detection using Spatial-Temporal features based on Riemannian Manifold

Network traffic data is a combination of different data bytes packets un...

A Combination of Temporal Sequence Learning and Data Description for Anomaly-based NIDS

Through continuous observation and modeling of normal behavior in networ...

Network Intrusion Detection based on LSTM and Feature Embedding

Growing number of network devices and services have led to increasing de...

Detecting Abnormal Traffic in Large-Scale Networks

With the rapid technological advancements, organizations need to rapidly...

CANdito: Improving Payload-based Detection of Attacks on Controller Area Networks

Over the years, the increasingly complex and interconnected vehicles rai...

Suspicious ARP Activity Detection and Clustering Based on Autoencoder Neural Networks

The rapidly increasing number of smart devices on the Internet necessita...

Two-stage Deep Stacked Autoencoder with Shallow Learning for Network Intrusion Detection System

Sparse events, such as malign attacks in real-time network traffic, have...

1 Introduction

An intrusion detection system (IDS) is a primary defense mechanism against any expected or unexpected cyberattacks to adapt and secure the computing infrastructures 8369054; jang2014survey

. IDS based on machine/deep learning is currently the main focus in recent research studies in intrusion detection as they are more effective against large-scale attacks. Various studies based on support vector machine, K-nearest neighbor, decision tree, random forest, and deep learning models such as convolutional neural network-based, a recurrent neural network have been applied for building IDS

electronics9010173; article11; article13. The Intrusion detection system suffers from complexity issues (a large amount of high dimensional complex data) and is insufficient for learning complex nonlinear relationships that change over time between large datasets IERACITANO202051. As Hogo 6987012 stated in his work, there are temporal and spatial dependencies in the traffic data article4 but existing IDS mostly focus on the last snapshot. Like, source IP and destination IP define the subject and object of the behavior in the data streams, and the duration describes how long the behavior lasted Rieman1. Similarly, the packet volume and packet size indicate the traffic flow, and their size varies between different protocols Rieman1. Therefore, this information should be analyzed in the context of the communication protocol to explain the impact of this behavior on the communication capability. This spatio-temporal relationship between the multivariate data helps in detecting various attack characteristics that look rather benign under different network protocols Rieman1.
The existing state-of-the-art methods address this issue by proposing feature fusion and association to improve the ability to identify malicious network behaviors li2018data. For example, Li et. al LI2020107450 proposed IDS that divide features into four subsets based on the correlation among features and used a multi-convolutional neural network to detect intrusion attacks. Similarly, Wei et al. 8171733

proposed fusion of convolutional neural network (CNN) to learn spatial features and long short term memory (LSTM) to learn the temporal features among multiple network packets. Mostly machine learning (ML) or deep learning (DL) solutions proposed for the IDS are based on a supervised learning approach. Unfortunately, the IDS models based on supervised ML techniques liaise heavily on manually labeled raw traffic and fine-tuning of ML models based on the labeled datasets. This human-engineered network traffic labeling is very laborious considering the massive growth of traffic data and may lead to error-prone data labels

binbusayyis2021unsupervised. This stems from the ” Unsupervised intrusion detection” approach of trying to define what is anomalous rather than what is normal 5485505

. In recent years, with the advent of deep architectures such as fully connected Autoencoders and self-organizing maps, there have been substantial advances in the domain of unsupervised IDS. However, unsupervised models do not consider spatial and temporal characteristics of traffic data.

We address this issue by proposing a new model that utilizes three different deep learning models. The main contributions of this paper can be summarised as follows.

  • We propose a unified Autoencoder based on combining multi-scale convolutional neural network and long short-term memory (MSCNN-LSTM-AE) for anomaly detection in network traffic. Our proposed model runs in an unsupervised manner by effectively removing the requirement of having to manually label the data.

  • The MSCNN-AE part of our model is capable of extracting inherited spatial patterns from network traffic. The LSTM-AE part of our model extracts the temporal patterns from network traffic in addition to the spatial patterns obtained through the MSCNN-AE. We utilize Autoencoder for both MSCNN and LSTM training without labels (i.e. unsupervised).

  • To further improve classification accuracy, we utilize a two-staged detection technique using isolation forest to effectively reduce the false positives and false negatives created by the threshold-based approach used in the Autoencoder.

  • Our experimental results, benchmarked on NSL-KDD, UNSW-NB15, and CICDDoS2019 datasets, show that our proposed model achieves a higher recall and f-score compared to other state-of-the-art models.

The rest of the paper is organized as follows. Section presents related works for intrusion detection. Section gives a brief description of the typologies of deep neural networks employed in the proposed approach. Section introduces our proposed model. Section

, describes datasets, experimental setup, and evaluation metrics. In section

, the experimental results are discussed and compared to existing studies. Lastly, section draws the conclusions of our proposed method.

2 Related work

The recent literature on cybersecurity reveals how the advancements in unsupervised AI techniques have led the intrusion detection research. For example, Auskalnis et al. article343

use the Local Outlier Factor during data preprocessing to exclude normal packets that overlap with the density position of anomalous packets. Later, a cleaned (reduced) set of normal packets is used to train another local outlier model to detect anomalous packets. Similarly, Rathore et al.

article314 proposed unsupervised IDS that is based on semi-supervised fuzzy C-Mean clustering with single hidden layer feedforward neural networks (also known as Extreme Learning Machine) to detect intrusions in real-time. Aliakbarisani et al. article308 proposed a method that learns a transformation matrix based on the Laplacian eigenmap technique to map the features of samples into a new feature space, where samples can be clustered into different classes using data-driven distance metrics.

With the advent of deep learning methods and cheap hardware (graphical processing units), unsupervised deep learning methods such as deep belief network (DBN), self-organizing maps, and autoencoders are increasingly being used. Karami

article315 proposed an IDS model based on a self-organizing map that not only removes benign outliers but also improves the detection of anomalous patterns. Alom et al. 7443094 proposed an unsupervised deep belief network (DBN) for intrusion detection. Similarly, for the in-vehicular network security, Kang et al. kang2016intrusion leveraged the benefit of the unsupervised pretraining process of the DBN model. Additionally, a considerable number of studies have investigated the application of DBN in IDS design. gao2014intrusion; zhang2017deep. Nevertheless, recent studies have focused primarily on autoencoder (AE) to develop efficient IDS because they are easy to implement and inexpensive computational cost. A number of studies have attempted to develop variants of AE with improved discriminative intrusion detection. For instance, Hassan et al. electronics9020259

optimized hyper-parameters of sparse AE to extract better feature embedding for classifying intrusion attacks. Similarly, Song et al.


proposed an Autoencoder model (trained on normal samples) based on the principle that the reconstruction loss of normal traffic samples is lower than that of abnormal (attack) samples so that a threshold can be set for detecting future attacks. In addition, this work evaluates various hyperparameters, model architectures, and latent size settings in terms of attack detection performance. The various researcher proposed methods that combine unsupervised and supervised methods to get the best of both learning techniques. Like, Shone et al.

article331 proposed unsupervised feature learning using a non-symmetric stacked deep autoencoder. Moreover, these features are used for intrusion detection using a random forest model. Similarly, Hawawreh et al. article321

combined autoencoder with a deep feed-forward neural network for intrusion detection. Sheng et al.


developed a framework that combines generative adversarial networks and Autoencoder for improving the performance for intrusion detection. Aygun et al.

aygun2017network introduced AE variant that stochastic-ally determined threshold for reconstruction error to determine intrusion attack and improve the discriminative ability of AE on NSL-KDD intrusion datasets. Ieracitano et al. IERACITANO202051 used statistical analysis to select more relevant features as well as filtered out local outliers before training the AE variant. This methodology improved performance on the NSL-KDD dataset. Mirsky et al. mirsky2018kitsune proposed an ensemble of autoencoders to detect intrusion attacks. In the same vein, Al-Qatf et al. al2018deep

proposed a combination of AE with support vector machine (SVM) model. They used sparse AE for extracting meaningful feature embedding and used an SVM classifier with encoded features for classifying intrusion attacks. Their proposed combination of AE and SVM can efficiently be used for binary and multi-class scenarios. Moreover, the authors in

qureshi2020intrusion applied a sparse AE model exploiting the concepts of self-taught learning to learn useful features for intrusion detection. In addition, they combined the original features with extracted features to improve the model generalization ability in recognizing the network attacks. Furthermore, in kherlenchimeg2020deep, a two-stage framework that combines a sparse AE with long short-term memory is investigated for building an efficient IDS. Here, the framework employs the sparse AE for learning effective feature representation and the LSTM model for classifying normal and malicious traffic. In a work by Shuaixin shuaixin2020intrusion

, the viability of combining stacked AEs and an SVM classifier configured with a piece-wise radial basis function to improve the classification performance of the SVM for intrusion detection is examined. Similarly, the authors in

yu2017network combined the advantage of stacked AEs with a CNN to considerably achieve the high-performance demand of network IDS. Likewise, the authors in 8418451 have studied the effectiveness of a stacked sparse AE model for extracting useful features of intrusion behavior. The study results indicated that the model can extract more discriminative features and accelerate the detection process. Relatedly, the authors in kim2019designing proposed new interesting online deep learning systems that apply an AE as function approximation in the Q-network of RL to achieve a higher detection accuracy rate for network intrusion detection.
From the above literature review, it is evident that despite the significant performance gain achieved with the application of unsupervised approaches in IDS design, there is still room for improvement. one of the causes of weakness include a focus on the last snapshot, while there are temporal and spatial dependencies in the traffic data. On these grounds, the existing approaches are prone to overfit and show poor generalization performance toward unseen cyberattacks. Thus, research on unsupervised IDS is still in its infancy in terms of development. Hence, the proposed research is expected to make a valuable contribution to the existing knowledge pool.

3 Preliminaries

The approach presented in our study makes use of different typologies of neural networks arranged to provide a powerful network traffic classifier suitable for most of the tasks characterizing modern ISP activities. In this section, the basic theoretical backgrounds of the employed networks are presented.

3.1 Autoencoder

An Autoencoder (AE) is an unsupervised feed-forward neural network used for the reconstruction of its input. AE attempts to find an optimal subspace where the normal data and anomalous data appear very different. Let us assume that the normal training set is , each of which is a dimensional vector . In the training phase, we construct a model to project these training data into the lower-dimensional subspace and reproduce the data to get the output . Therefore, we optimize the model to minimize reconstruction error so as to get the optimal subspace. The reconstruction error is defined as:


As the normal data in the test dataset meet the normal profile which is built in the training phase, the corresponding error is smaller, whereas the anomalous data will have a relatively higher reconstruction error. As a result, by thresholding the reconstruction error, we can easily classify the anomalous data:

The architecture of the Autoencoder consists of the encoder and the decoder. The encoder and decoder are composed of an input layer, an output layer, and one or more hidden layers. It has a symmetrical pattern – the output layer of the decoder is equal to the input layer of the encoder. Mathematically, encoder with input vectors and output layer of size (hidden layer) can be defined as :


where is the input vector, is the parameters , represents the encoder weight matrix with size and

is a bias vector of dimensionality

. Therefore, the input vector is encoded to a lower-dimensional vector. The resulting hidden representation

is then decoded back to the original input space using a decoder. The mapping function is as follow:


The parameter set of the decoder is . We optimize the autoencoder to minimize the average construction error with respect to and :



is the reconstruction error in Equation (1). After we finish training this Autoencoder, we can feed the test data into it to compute the reconstruction errors for each set of data. The anomalous data can be determined by utilizing equation (3). To be noted here, the activation functions

and should be non-linear functions so as to reveal the non-linear correlation between the input features.

3.2 Isolation Forest

Isolation Forest sadaf2020intrusion is an unsupervised machine learning method that can find anomalies by randomly partitioning the data points. Isolation Forest assumes that the instances which fall away from the data center are anomalies. It forms like binary trees and ensembles iTrees by sampling randomly for a given dataset. The isolation tree’s key role is to make use of unusual samples, also called anomalies in detecting the unknown attacks which are strange from the normal attacks. Random selection of a subset from the training set is done to build iTrees and it was found that the realistic amount is after subsampling and this is the first step in creating iForest. It does not make use of distance measures, hence reduction found in the cost required for computing. Secondly, iForest utilizes no distance or density measures to detect an anomaly, this second step, therefore, eliminates computational cost compared with distance measures involved in clustering and takes time complexity in a linear fashion. Lastly, iForest requires a low amount of memory and uses the idea of ensemble and does not bother if some iTrees does not yield efficient results as the ensemble algorithms convert the weak trees into efficient ones. Due to all these benefits, using iForest is strongly recommended to detect anomalies on huge datasets involving complex features. It calculates anomaly score S as


where is the number of edges in a tree for a certain point and is normalization constant for a dataset of size . The binary class gets separated based on a threshold value on anomaly score in supervised classification and without threshold value in unsupervised classification.

Figure 1: Spatio-temporal features integration in the architecture of the MSCNN-LSTM-AE model
Figure 2: A block diagram of MSCNN-LSTM-AE model learning process

4 The Proposed methodology

Our proposed approach consists of data preprocessing and spatio-temporal feature integration using multi-scale CNN-AE and LSTM-AE as shown in Figure 1. Our model computes anomaly scores based on the reconstruction error for traffic data to identify malicious traffic. Our proposed method involves two stages of anomaly detection. The output of the first stage acts as the input to the second stage. As depicted in the flowchart shown in Figure 2, the test dataset is supplied to the autoencoder (MSCNN-LSTM) in stage 1. Unified MSCNN-LSTM-AE identifies the attack based on threshold and segregates the attack and normal network traffic data into two sets. However, the resultant sets contain data points that ideally don’t belong to them. Isolation forest in stage 2, attempt to identify these misfit (outlier) data points, which improves the overall accuracy.

4.1 Data Pre-processing

There are some features that are symbolic and continuous. These features need to be converted into a single numeric type for feature extraction. Secondly, features are not uniformly distributed thus need to be scaled for a better result with machine learning models. Data standardization in pre-processing deals with the numeralization of categorical features. The most common method is encoding symbolic values with numeric values. For example, if feature contain three unique symbolic values like

contain , and then these attributes can be map with 1, 2 and 3 respectively. Generally, features in traffic data flow are highly variable and not uniformly distributed. To achieve better results with the machine learning model, the attribute values are usually scaled to a uniform distribution in the interval . For this purpose, the min-max data normalization method is used, as shown in Equation .


Where and represent maximum and minimum value of feature vector ; whereas is a normalized feature value between [0-1].
Usually, the first order features captured from network traffic are arranged as a dimensional (1D) vector. Let d be such a dimension, the d‐dimensional input needs to be arranged to a 2D matrix for being fed into 2D‐CNN. Accordingly, let be the pre-processed input and let be a derived 2D‐matrix (feature maps) as shown in Figure 1.

4.2 Multi-scale CNN-AE based spatial feature extraction

The convolutional neural network (CNN) architecture performs well in the image processing field. However, during the image processing, CNN focuses on some local features of the image such as edge information. The identification of network traffic can not only rely on some discrete local features but also need to combine multiple local features to perform the classification. Therefore, the CNN is adjusted and transformed into Multi-scale CNN to accomplish this task. When a human visual perception system maps an image in the brain, it will first form a complete set of images from far to near, and from fuzzy to clear. Therefore, the MSCNN simulates different projections of objects at different distances on the retina during human eye recognition. Similarly, network traffic is a high-dimensional dataset that cannot be identified by only a few discrete features.
Convolutional Autoencoder (CAE) yu2017network is a special kind of autoencoder that does not apply the fully connected neural layer. The CAE model consists of convolutional and deconvolution layers from CNN architecture. It utilizes a convolutional layer in the encoder part and a deconvolutional layer in the decoder part. Apparently, the convolutional layer is able to decrease the feature number while the deconvolution layer can increase the feature number. As a result, in CAE, the convolutional layer takes the role of the encoder to perform the dimensionality reduction, while the deconvolution layer is applied here to reconstruct the data. CAE takes advantage of the Convolutional and Deconvolution layers. As these layers utilize the kernel filters, a parameter sharing scheme is applied here to control the number of parameters. Therefore, compared with conventional Autoencoder, CAE has a smaller number of parameters so the training time of CAE is much smaller.
In the MSCNN-AE, we used multiple convolution kernels of different sizes to extract feature maps and combine them to obtain multiple sets of local features to achieve accurate identification. The MSCNN structure is based on three original multi-scale convolutional as shown in Figure 1. The multi-scale convolution layer extracts features of the dataset using , , and convolution kernels. The aim of this configuration is to mine spatial features expressing relevant relations among the basic features. With reference to Figure 1, the encoder and decoder of AE include input layer, convolutional layers, pooling layer, fully-connected NN (latent space), and output layer equal to input layer size.
Let and be the CNN-input, and the filter, respectively. The convolution operation between the CNN-input and filters is defined by:


with the components of the filtered input. The size of Y is defined through its row and column dimension, by:


where and

are the Strides on the row and column, respectively, which control the shifting of the filter on the input. In addition,

is the Padding, which controls the number of zeros around the border of

. Padding is used to change the size of CNN-output without compromising the convolution result. By assuming , it is possible to map any to a point in a two-dimensional array (that looks like a matrix). Thus, with abuse of notation, we can assert that: with k = 1…d, i=1…, and j= 1….


with , while


and with . The represent the new extracted features and act as input to the LSTM Autoencoder. This latent space features express a more complex knowledge because they are a linear combination of the original.

Figure 3: Schematic representation of a LSTM cell

4.3 LSTM-AE based feature extraction

LSTM based Autoencoder models are used to form a sequence to sequence architecture. LSTM-AE consist of encoder and decoder function. An encoder function task is learning the prominent characteristics and creating an encoded version of the input sample. The decoder aims to reconstruct the input using internal representation.
In our case, LSTM-AE architecture’s encoder function maps a sequence of latent space features extracted from network traffic (high-level features) by MSCNN-AE into a fixed-length vector of new features (latent space) which in turn is converted to the same input sequence using decoder function. This configuration is able to mine short and long-distance dependencies within the sequence of basic-features LSTM can remember long-term dependency using ”LSTM Cell” , which allows the cell to remember or forget past information (shown in Figure 3) . This state is updated through four internal activation layers called gates, implemented by a sigmoid neural network layer and a point-wise operation. Each gate is devoted to a specific goal, as described in Equations 13,14, 15, 16, 17, and 18 below:



are linear transformation and

are cell memory and output at time .
LSTM-AE receives a traffic sequence of length W where is normal training example index such that . The LSTM-encoder produces a synthesized output-vector of a pre-determined dimension based on equations from 13 to 18, which can be expressed by:


where represent non-linear encoder funtion of LSTM architecture. This is new latent-space features express a compact representation of the temporal behavior of the basic-features. This is used by decoder to reconstruct input sample as depecited by Equation 19:


where is the decoder function of LSTM autoencoder (LSTM-AE). The objective of the decoder is to reconstruct input sequence with minimum loss whereas loss is calculated in terms of mean square error as shown in Equation  (1)

4.4 Classification

We train our Autoencoder model on latent space features based on normal data extracted from LSTM AE as shown in Figure 2. Since the AE is only trained on ”normal” data, the reconstruction loss for the attack data is much higher than the normal data. For example, if the reconstruction loss value of a data point is higher than the threshold value, then data point is classified as ”attack”, otherwise, it will be classified as ”normal”.


where represents ”normal” data packets having reconstruction loss error less than the threshold and contains data points having reconstruction error greater than the threshold and are considered as ”attacks”. Since the result of AE is not hundred percent accurate, both and contain attack and normal data respectively.
To achieve more accuracy i.e. detecting more intrusions, these 2 sets are then supplied as inputs to two Isolation Forest modules (Figure 2). The first module (Isolation Forest 1) gets the ”attack” output of the AE and search for the anomalies, in our case-normal data points. Similarly, the second module (Isolation forest 2) takes the ”normal” output of the AE and search for anomalies, in this case - attack data points. The attack data in the ”normal” set and normal data in the ”attack” set are nothing but outliers or anomalies. Isolation Forest 2 takes the ”normal” and searches for attack data. Since AE has already identified most of the normal and attack packets in stage 1, the set contains a fewer number of attack packets. The set containing ”attack” data is fed to Isolation Forest 1. contains some actual normal data too. Isolation Forest 1 searches for these ”outliers” in .


At the end, and are final set of normal and malicious network traffic packets.

(a) NSL-KDD (b) UNSW-NB15 (c) CICDDoS2019
Figure 4: Anomaly score based on reconstrusion error in NSL-KDD, UNSW-NB15 and CICDDoS2019 test datasets respectively based on MSCNN-LSTM autoencoder.

5 Experiment setup and evaluation metrics

This section describes the dataset, experiment setup, performance metrics, and the dataset used to evaluate the proposed approach.

5.1 Dataset Description

To evaluate our proposed approach, we considered well-known public datasets namely, NSL-KDD nslkdd, UNSW-NB15 unswnb and CICDDoS2019 respectively. This benchmark dataset is freely available from the Canadian Institute of Cybersecurity. NSL-KDD dataset was obtained by removing redundant records from the KDDCUP99 dataset so that machine learning-based models can produce unbiased results. In addition to normal traffic, this dataset consists of traffic from four attacks, namely DoS, U2R, R2L, and PROBE traffic.

67343 9711
57738 12833
125081 22544
Table 1: NSL-KDD dataset

Table  1 shows the count of samples from the training and testing set from the NSL-KDD dataset. As shown in Table  1, it is highly imbalanced with fewer instances from U2R and R2L attack classes. Furthermore, the test dataset contains unknown attacks samples that do not appear in the training dataset.

Each traffic record in the NSL-KDD dataset is a vector of 41 continuous and nominal values. These 41 values can be further subdivided into four categories. The first category is the intrinsic type, which essentially refers to the inherent characteristics of an individual connection. The second category contains indicators that relate to the content of the network connection. The third category receives a set of values based on the study of the content of the connections in the time segment of 2 seconds. Finally, the fourth category is based on the destination host.
UNSW-NB15 dataset is a benchmark dataset that contains nine families of intrusion attacks, namely, Shellcode, Fuzzers, Generic, DoS, Backdoors, Analysis, Generic, Worms, and Reconnaissance unswnb. This dataset is freely provided by the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS). We used an already configured training and testing data set from ACCS as shown in Table 2. The number of samples in the training dataset is 175,341 and the test set has 82,332 from normal and nine types of attacks. The dataset has a total of 42 features and these features can be subdivided into categories. Similar to the NSL-KDD dataset, the first part is the content features; the second category has some features which refer to the basic and general operation of the internet; the third part is connection features and lastly, the fourth category is time-based features.

56000 37000
119341 45332
175341 82332
Table 2: UNSW-NB15 dataset

Another dataset, we used is the CICDDoS2019 dataset that has been widely used for DDoS attack detection and classification. The dataset contains a large amount of up-to-date realistic DDoS attack samples as well as benign samples. The total number of records contained in CICDDoS2019 is depicted in Table 3. We have used all the benign samples and malicious samples from training and test data sets.

dataset total benign malicious
Training day 50,063,112 56,863 50,006,249
Testing day 20,364,525 56,965 20,307,560
Table 3: The number of records in CICDDoS2019

Each record of the dataset contains statistical features (e.g., timestamp, source, and destination IP addresses, source, and destination port numbers, the protocol used for the attack, and a label for a type of DDoS attack). The training dataset contains a total of 12 different types of DDoS attacks (i.e., NTP, DNS, LDAP, MSSQL, NetBIOS, SNMP, SSDP, UDP, UDP-Lag, WebDDoS, SYN, and TFTP) while only 7 DDoS attacks are included in the testing dataset (i.e., PortScan, NetBIOS, LDAP, MSSQL, UDP, UDP-Lag and SYN).

5.2 Experimental Setup

This study was carried out using a 2.3 GHz 8-core Intel i9 processor with 16 GB memory on MacOS Big Sur 11.4 operating system. The proposed approach is developed using Python programming language with several statistical and visualization packages such as Sckit-learn, Numpy, Pandas, Tensorflow, and Matplotlib. Table 

4 summarizes our system configuration.

Unit Description
Processor 2.3 GHz 8-core Inter Core i9
Operating System MacOS Big Sur 11.4
Packages Tensorflow, Sckit-Learn, Numpy, Pandas, Pyriemannian and Matplotlib
Table 4: Implementation environment specification
(a) NSL-KDD (b) UNSW-NB15 (c) CICDDoS2019
Figure 5: Reciver operting curve showing area under curve for our proposed model.

5.3 Evaluation Metrics

The proposed method is compared and evaluated using Accuracy, Precision, Recall, F1-score, and Area under the receiver operating characteristics (ROC) curve. In this work, we have used the macro and micro average of Recall, Precision, and F1-score for multi-class classification. All the above metrics can be obtained using the confusion matrix (CM). Table  

5 illustrates CM for binary classes but can be extended to multiple classes.

Class Class
Actual Class True Positive False Positive
Class False Negative True Negative
Table 5: Illustration of confusion matrix
(a) NSL-KDD (b) UNSW-NB15 (c) CICDDoS2019
Figure 6: Confusion matrix for NSL-KDD, UNSW-NB15 and CICDDoS2019 test datasets respectively.

In Table 5, True positive (TP) means amount of class data predicted actual belong to class, True negative (TN) is amount of class data predicted is actually class, False positive (FP) indicates data predicted class is actual belong to class and False negative (TN) is data predicted as class but actually belong to class. Based on the aforementioned terms, the evaluation metrics are calculated as follows.
Accuracy (ACC) measures the total number of data samples are correctly classified as shown in equation 24. For balanced test dataset, higher accuracy is indicate model is well learned, but for unbalanced test dataset scenarios relying on accuracy can give wrong illusion about model’s performance.


Recall (also known as true positive rate) estimates the ratio of the correctly predicted samples of the class to the overall number of instances of the same class. It can be computed using equation 25 . Higher Recall

value indicate good performance of machine learning model.


Precision measures the quality of the correct predictions. mathematically, it is the ratio of correctly predicted samples to the number of all the predicted samples for that particular class as shown in Equation 26. Precision is usually paired with Recall to evaluate the performance of the model. Sometimes pair can appear contradictory thus comprehensive measure F1-score is considered for unbalanced test data-sets.


F1-Score computes the trade-off between precision and recall. Mathematically, it is the harmonic mean of precision and recall as shown in equation 27.


The area under the curve (AUC) computes the area under the receiver operating characteristics (ROC) curve which is plotted based on the trade-off between the true positive rate on the y-axis and the false positive rate on the x-axis across different thresholds. Mathematically, AUC is computed as shown in Equation 29.


In the case of unbalanced test data classification, the performance of models is usually evaluated using macro and micro-averaging of recall, precision, and F1-score. Macro-averaging in simple terms is the arithmetic mean of the individual precision, recall, and F1-scores while micro-averaging sums up the individual TP’s, FP’s, and FN’s.

6 Results and Discussion

We have made a number of different observations to understand the performance implications both during the training and testing phases. Figure 4

shows anomaly score in terms of reconstruction error for each data point in the NSL-KDD test dataset based on our model. Here, we can clearly see that some data points in normal and abnormal traffic are wrongly classified based on a threshold. In our case, any point whose reconstruction error is mean plus two standard deviations from normal training samples reconstruction loss is classified as an abnormal data point. To rectify the issue of misclassification based on the threshold we have employed second stage classification based on isolation forest that removes outlier data points from the abnormal and normal set made after Autoencoder’s threshold-based classification. Table

6 shows the performance of our proposed (MSCNN-LSTM-AE with isolation forest) approach on the NSL-KDD dataset able to achieve 93.76 percent accuracy and 92.26 percent recall respectively. In addition to this, we compared our method’s performance with other state-of-the-art methods using the four metrics namely accuracy, precision, recall, and F1-score. As the results in the table show our approach obtained high recall and F-score respectively compared to all the state-of-the-art methods in the literature. High recall and F-score indicate better performance of the model.

Similarly, We also bench-marked our model on UNSW-NB15 and CICDDoS2019 datasets. Figure 6 shows confusion matrix visualization of NSL-KDD, UNSW-NB15, and CICDDoS2019 dataset obtained through our model. As we can see, still some data points are wrongly classified that we tried to correct through second stage classification by isolation forest as shown in Table 6

. Our model obtained high recall and F-score respectively compared to other state-of-the-art methods. In this work, we do not check the effects of different hyper-parameter, hidden layers, and loss functions on the performance of our model but If we fine-tune our model’s parameters with some optimization technique, we can further increase the performance of our intrusion detection models which is mostly the case with other state-of-the-art methods in the literature


Paper Dataset Techniques Acc Precision Recall F1-score
Sharafaldin et al. sharafaldin2019developing CICDDoS2019 Random Forrest - 77 56 62
Rajagopal et al. rajagopal2021towards CICDDoS2019 Extended Decision Tree 97 99.0 97.0 97.8
Gohil et al. gohil2020evaluation CICDDoS2019

Extended Naive Bayes

96.25 96 96 96
Shieh et al. shieh2021detection CICDDoS2019 Bi-LSTM 98.18 97.93 99.84 -
De Assis et al. de2020near CICDDoS2019 CNN 95.4 93.3 92.4 92.8
De Assis et al. de2020near CICDDoS2019 MLP 92.5 84.4 94.2 89.0
Javaid et al. deep2 CICDDoS2019 AE + Regression 88.39 85.44 95.95 90.4
Sadaf et al. 9189883 CICDDoS2019 AE + Isolation Forest 88.98 87.92 93.48 90.61
Can et al.can2021detection CICDDoS2019 FS + MLP - 91.16 79.41 79.39
Wei et al.9591559 CICDDoS2019 AE + MLP 98.38 97.91 98.48 98.18
Our method CICDDoS2019 MSCNN-LSTM-AE 99.56 98.91 98.81 98.46
B. Ingre7058223 NSL-KDD ANN 81.2 96.59 69.35 80.73
M. Al-Qatf8463474 NSL-KDD Sparse-AE + SVM 84.96 96.23 76.57 85.28
I. SharafaldinIERACITANO202051 NSL-KDD AE 84.21 87 80.37 81.98
I. SharafaldinIERACITANO202051 NSL-KDD LSTM 82.04 85.13 77.70 79.24
I. SharafaldinIERACITANO202051 NSL-KDD MLP 81.65 85.03 77.13 78.67
Sadaf et al. 9189883 NSL-KDD AE 88.98 87.92 93.48 90.61
Javed et al.7966342 NSL-KDD AE 88.39 85.44 95.95 90.04
Wen et al.9552882 NSL-KDD AE 90.61 86.83 98.43 92.26
Our method NSL-KDD MSCNN-LSTM-AE 93.30 95.75 92.33 94.01
H. Zhangarticle4 UNSW-NB15 AdaBoost 86.41 72.83 95.96 82.81
H. Zhangarticle4 UNSW-NB15 Extra Trees 86.73 72.60 97.14 83.10
Kasongo and Sun kasongo2020performance UNSW-NB15 ANN 86.71 81.54 98.06 89.04
Hammad et al. 9312002 UNSW-NB15 Naive Bayes 76.04 76 83.4 76.8
Dickson and Thomas 10.1007/978-981-16-0419-5_16 UNSW-NB15 Logistic regression 84 - - -
Amaizu et al. 9289329 UNSW-NB15 DNN 88 90 88 87
Our method UNSW-NB15 MSCNN-LSTM-AE 89 88 89 87
Table 6: Comparison to other similar methods

7 Conclusion

In this paper, we present an unsupervised IDS that captures inter-dependencies in high-level basic features in the traffic data. Our IDS uses a novel Autoencoder combining multi-scale Convolutional Neural Network with Long Short-Term Memory (MSCNN-LSTM) in the Autoencoder architecture to capture spatial-temporal dependencies in traffic data. The performance of the proposed approach was evaluated using the NSL-KDD, UNSW-NB15, and CICDDoS2019 datasets respectively. The experimental results show that the proposed approach not only has good detection performance but also obtains high recall and F-score compare to the existing IDS models in the literature. The proposed approach can be further improved by using different feature selection techniques, optimizing hyper-parameters, and fine-tuning the model’s architecture. In our future work, we will extend this model for the multi-class scenarios to detect different classes of intrusion attacks specifically minority attacks. We also plan to apply the proposed method for Android-based malware detection

zhu2020multi; zhu2021task, or ransomware detection and classification tasks zhu2021few; mcintosh2018large; mcintosh2019inadequacy to evaluate the generalizability and practicability.


This research is supported by the Cyber Security Research Programme—Artificial Intelligence for Automating Response to Threats from the Ministry of Business, Innovation, and Employment (MBIE) of New Zealand as a part of the Catalyst Strategy Funds under the grant number MAUX1912.