With the advent of the 5G era, the importance of network security has become increasingly prominent . Thanks to the development of network terminals such as IoT devices , cloud computing [3, 4], the scale of network traffic has grown exponentially. To response to cyberspace attacks, different from server-based defense methods [5, 6]
, network intrusion detection (ID) is a hotspot benefitting from the development of machine learning. Most of studies generalize feature sets from network traffic as the basis for further detection, represents session as tensor through training classifiers based on labeled datasets, and then finds behavioral characteristics of suspicious attacks. Benefitting from efficient machine learning tools, the detection task is transformed into a learning task on the feature sets by utilizing the classification model, such as deep networks with high computing power.
However, adequate data cannot be guaranteed in most practical scenarios, conflicts emerge between data-hungry models and data-insufficient application scenarios. Further, downstream tasks such as unknown attack detection under few-shot prior information will be greatly influenced. These difficulties are encountered when seeking relevant research: ID is a topic restricted by application scenarios. In the related fields such as graphic classification, there are extensive researches on few-shot learning and unknown sample detection. However, session samples are coupled which is described in Fig. 1. and this phenomenon does not exist in the graphic samples, so we cannot directly use the research results of graphic classification field. The specific difficulties of unknown attack detection under few-shot are as follows:
Difficulty in finding the trade-off between model depth and data volume. Intrusion traffic with sufficient data can be accurately detected, but this is not a common situation. For some subdivision tasks such as detection under insufficient data, the deep model cannot be fully trained, while the shallow model cannot fully fit feature sets. Concretely, the current state-of-the-art methods such as  requires a lot of prior knowledge, which is not satisfied in the scenario targeted by this article. So, we need to design a framework for detection tasks that is more subdivided.
Imperfection of data augmentation method. Due to the fragmentation of application scenarios, model-based or metrics-based few-shot methods  cannot be used directly. Therefore, data augmentation for insufficient data is the solution to few-shot. Among them, Generative Adversarial Network (GAN)  is one of the most widely used methods . However, GAN cannot guarantee the deviation between easily confused categories, that is, the GAN can only guarantee the similarity between the generated sample and the target category instead of guaranteeing the deviation between the generated sample and the similar categories sample.
It can be seen from these difficulties that unknown attack detection under few-shot is a comprehensive problem. So, we designed a full stack method to solve these problems comprehensively. We first propose SFE to summarize the context of session features, then propose GACN to implement intra categories generation, and finally improve the unknown attack detection method for the detection task. Our proposed method has the following advantages:
Insufficient samples will be augmented by prior knowledge. We propose a method for embedding session features to decouple the sessions to make them independent from the session context, bringing prior contextual information to the target sample. Further, through the pre-trained embedded model, few-shot traffic information will be augmented through prior knowledge, which can initially settle the few-shot problem.
Generated samples will not be confused. We propose GACN to solve the problem of confusing generated samples. Compared with only GAN , the adversarial generated samples will be constrained by the cooperation model, to guarantee deviation between generated samples and similar categories samples.
More customized unknown attack detection method. Based on the proposals of SFE and GACN, we obtain the accurately augmented traffic data. Furthermore, we improve the existing unknown traffic detection method RTC , to make it more suitable for unknown attack detection scenarios under few-shot.
Therefore, this paper comprehensively considers two factors and proposes a solution to solve them simultaneously. The method we proposed can mine unknown attacks that occur in the type of traffic to be detected in a scenario with fewer prior samples. In the evaluation stage, the effectiveness of SFE and GACN were evaluated separately, and the unknown attack detection under few-shot was finished brilliantly by using the public data set.
2 Related Work
Few researches on intrusion detection have considered both few-shot and unknown attack detection concurrently, and therefore we will discuss the related literature separately.
Considering the scenarios for IDSs, when prior knowledge or target data is insuffi-cient, large-scale machine learning tools such as deep neural networks will not be adequately trained. used three conditional GANs to generate embedding features step by step. Yong et al.  combined GAN and VAE, constrained the input of gen-erator to VAE, and improved generation accuracy. Schonfeld et al.  used two VAEs of the same structure, one to encode the image and the other to decode the class embedding. Annadani et al.  proved that introducing semantics into embed-ding space is beneficial to Few-Shot learning. Kodirov et al.  used a semantic self-encoder to realize zero shot learning, which solved the problem of domain shift of training set and test set to a certain extent. The work of  proposed two methods, Deep-RIS and Deep-RULE, to solve the problem in different few-shot situations. IDSs can provide a certain level of protection to computer networks; however, such security practices often fall short in face of zero-day attacks.  proposed a probabilistic approach and implements a prototype system ZePro for zero-day attack path identification. Ref.  presented a new data representation diagram that allows us to integrate syntactic and sequential features of payloads in a unified feature space, provided a great solu-tion for context-aware intrusions detection. Zhang et al.  took the first step to-ward formally modeling network diversity as a security metric by designing and eval-uating a series of diversity metrics. The work of  devised a biodiversity-inspired metric based on the effective number of distinct resources. The work of 23] proposed a probabilistic approach to identify zero-day attack paths and implement a prototype system named ZePro. Zhang et al.  proposed a new scheme of Robust statistical Trafﬁc Classiﬁcation (RTC) by combining supervised and unsupervised machine learning techniques to meet the challenge of unknown network trafﬁc classiﬁcation. However, if this meth-od is directly applied in attack detection, it will cause a large false positive rate. The reason is as follows. First, the shallow model used by RTC is not sufficient to fit the session feature set. Second, in RTC, the method of judging clusters during clustering is too simple to be extended. So, in this paper, we will improve the RTC to make it more suitable for unknown attack detection. We designed a category classification method for a single cluster, using a deep model instead of a shallow model, and finally reduc-ing the false positive rate.
3 SFE-GACN: The Framework of Unknown Attack Detection
3.1 Session Features Embedding
Algorithm 1 presents the proposed method of Session Features Embedding, given a feature set to obtain its embedded feature set . First, binary transformation  is used to convert session features to binary representation. Since different features in the sample have inconsistent data type and data scale, set a maximum bit set based on the maximum number of the binary code of each feature. Convert different features to integer type, and finally map them to a sequence of 0 and 1, In order to keep the uniform coding length, fill the maximum bits with 0. After that, the session features are mapped int sequences of different lengths. We synthesize these small sequences into large sequences and average the weight of each feature. This process is completed by column embedding, which will be elaborated on next paragraph.
In order to get the embedding space of the sample set itself, we regard the column vector of each feature in sample set as a sentence set. is all features of one sample, that is, there are altogether M sentences ,, and each word in is represented by binary representation. For , we traverse each word and use its contextual information to predict it, so we can get the embedding vector of each word. Specifically, we build two trainable matrices where the output dimension of the first layer is , and the output dimension of the second layer is set to
. Then we use Stochastic Gradient Descent (SGD) to update the weight ofthrough back propagation. In order to obtain the embedding vector, we use the first matrix to transform the binary encoded sentence of into a vector of . The algorithm is implemented for each , and then the resulting set of embedding vectors is vertically merged to obtain the total embedding matrix , each row contains the total vector of all the features of the samples after embedding. The process is shown in Fig. 2.
3.2 Generative Adversarial-Cooperative Network
We hope to conduct data augmentation for each category of samples in separately rather than uniformly with the GACN, the specific method is presented by Algorithm 2. We firstly generate each kind of sample effectively while maintaining the deviation between the generated samples and the other label samples, which are called side samples. The core of GACN is to use (Discriminator Cooperative) to supervise the training direction of (Generator) when and (Discriminator Adversarial) are engaged in adversarial training. While gradually fits the generated sample space , it can avoid moving forward to the sample space of other labels by adjusting its gradient descent degree.
At the beginning of GACN, we manually specific some parameters and initialize three deep neural network models: is to monitor whether the generated samples fit the target category , is used to monitor whether the generated sample is close to the side category , and is used to generate the samples, respectively. First, we train rounds in and normally, then we train . During this process, GACN can constantly distinguish if the samples generated by will incline to the side category. If they do, is rolled back until the samples generated by no longer have side category characteristics at all. Specifically, during process of using
to predict the generated samples, if the output value of the last sigmoid function approaches 0.5 and does not converge significantly anymore (which means thathas fallen into the sample space of the side category), is able to roll back to the previous state, and negate the gradient descent direction that causes generating samples of . Then the new weight of is calculated by SGD, through repeated negation and renewal, the direction of gradient descent will be affected by and at the same time, to ensure the samples generated by is no longer judged as by . In order to compensate for the multi-model training gap caused by rollback, will be retrained after rollback so that the three models can work against/cooperate with each other to promote the positive iteration of .
As shown in Fig. 3, where are three networks in the GACN: , , and . and are adversarial, while and are cooperative. In the process of adversarial training between and , real samples are gradually generated similarly, meanwhile under the supervision of , the generated samples are always kept distinct from . The term of represents the difference between the generated samples and the real samples. In the training process of GACN, the generated samples gradually approach the real sample space while maintaining a distinction from . Furthermore, as shown in Fig. 4, when we only use GAN, the samples generated by for will partially enter the sample space of . GACN will try to avoid this situation; will supervise the samples generated by and separate them from .
3.3 Two-step Unknown Attack Detection
When there are a small size of labeled samples and a large size of unknown samples that include unknown attack, we first use GACN to augment the known information by fusing the known samples and the unknown samples, carrying out preliminary detection by clustering, then using the deep neural network to reduce the false positive rate of final detection. Algorithm 3 presents the proposed method of Two-step Unknown Attack Mining.
We use two steps method to mine the samples whose label never showed in from . For the first step, augment to by GACN, integrate into unlabeled set to generate the total set , and use KMeans to partition the into clusters . For each cluster, when the unlabeled samples are larger than a certain proportion , all unlabeled samples are initially determined as unknown attack samples and expressed with . There are a lot of false positive samples in , so we solve it in the second step. Regard all samples in as unknown samples of a single class, after mixing a -class set and 1-class set to generate a training set , we train a -class deep network to classify . We still use in the training set as the verification set, the purpose is to use to eliminate the FP samples in and get a cleaner unknown attack sample set . The process is shown in Fig. 5.
By using the matrix of pre training to get the embedded features under few-shot, SFE can represent the uncoupled features of samples to improve the detection accuracy. Based on SFE, GACN can get the augmented samples within the category. After that, the improved two-step method cooperates with the former two to complete the unknown attack detection under few-shot. This section verifies the effectiveness of SFE and GACN, and combines the multi-layer method to evaluate the detection indicators.
4.1 Effectiveness of SFE
We get CICIDS-2017  as the evaluation data set, Friday’s network traffic data in the dataset is obtained to train the embedding model then evaluate the effectiveness of SFE with other date traffic data. In this section. We only evaluate the classification performance of SFE to few-shot traffic, and the evaluation of unknown attack detection will be conducted in Section 4.3.
In order to use the prior embedding matrix to process the unknown few-shot samples, and then get their positions in the embedded space, we reduce the sample size of traffic on other dates, and use
to embed the traffic of few-shot samples to obtain the corresponding embedding features. Finally, a single-layer perceptron is used to train multiple classifiers, and the validation loss convergence of the classifiers is counted to evaluate the effectiveness of SFE.The normal traffic of all dates is reduced by 20 times, the attack traffic is reduced by 10 times, and the convergence is counted. The experimental results are shown in Fig.6. The lines of different colors in Fig. 6 represent the traffic data of different days, each of which contains part of the attack traffic.
As shown in Fig. 6, the convergence rate of the conventional feature set is faster, but the final convergence rate is higher while the embedded feature converges to a lower value. Therefore, in the case of few-shot samples, using the pre-training embedded matrix can more accurately describe the sample characteristics.
We will continue to discuss how samples are distributed in the embedded space to evaluate the effect of SFE on sample coupling. The sample distribution in the embedded space is obtained by Point Walk, which is represented by Algorithm 4. Point Walk starts from a random point and all samples are connected in series according to the nearest point. In this process, we count the number of different categories step-by-step in the window, and get the set of statistics . For visualization, we take steps as the -axis, and different samples in as the -axis.
Using all the data from the IDS2018  dataset with richer categories, we retrain the embedded model and obtain the corresponding embedded dataset. Point Walk is used for the new embedded dataset to get the corresponding and the coordinate map is drawn, the results are as shown in Fig. 7.
In Fig. 7, the Y-axis represents the category statistics within the sliding window as the point walks in the embedded space, and the different colors represent the attack samples of different categories. Fig. 7 shows that the sample categories experienced during Point Walk are always regular, and the Euclidean distances between samples in the same category are close. This is similar to the rule of word embedding : the distance between “apple” and “pear” is much smaller than that between “apple” and “Barack Obama”, which indicates that SFE successfully trains the embedded space of the samples so that the samples no longer rely solely on their traffic environment, and reduces the coupling between the samples.
4.2 Effectiveness of GACN
In order to evaluate the effectiveness of GACN in preventing from inclining to side categories, we obtained a fashion-MNIST  data set that was more easily confused with different categories samples. We augment with as , and at the same time, train an evaluator to determine if the generated samples in the iteration process are inclined to . The results are shown in Fig. 8, and the parameter settings of GACN and the models are shown in Table 1. In order to more accurately evaluate the supervising ability of GACN, the initial random noise is set to .
During the pre-training process of the evaluator, we set the label of to 0 and the label of to 1. Therefore, when the output of the last layer of the evaluator (score) is less than 0.5, it is determined that the sample to be evaluated does not incline to , otherwise, it means that it will happen with some results we did not expect: is moving towards the sample space of .
In Fig. 8, the X-axis represents the training epoch of the GACN and only GAN; the upper bound represents the highest score given by the evaluator and the lower bound represents the lowest score. We find that with the increase of the number of iterations, the score of GACN quickly converges to less than 0.5, while the score of only GAN cannot converge for a long time, which means that the samples generated will be classified into by the evaluator, resulting in the deviation of the generated samples.
In order to verify the distinction between different categories of samples generated by GACN at the session feature analyses topic, we use GACN for CICIDS-2017 data sets, then we apply t-SNE  to reduce the dimensions of features and visualize them. We resample to get , so that each category without BENIGN is generated intra the class using GACN. For clarity of picture representation, we visualize the experimental results of using generated and BENIGN samples as shown in Fig. 9.
As can be seen from Fig. 9, the boundary between samples generated by only GAN is fuzzy in confusion scope. With the increase of sample size, the classifier may misjudge. However, GACN always avoids generated samples close to the samples of other categories, so the boundary between the generated samples of different categories is obvious, which will improve the performance of the classifier.
Further, In order to evaluate the difference between GACN and only GAN in the traffic detection indicated, we use CICIDS-2017 embedded feature set as experimental data. We first randomly sampled for few-shot samples, and then kept the same sampling rate for each category of sample. We use GACN, only-GAN, and non-generate to change the sample size to augment data and obtaine , , , respectively. We then trained the deep network as multi-classifiers, and recorded the f1 score and time spending separately. The experimental results are shown in Table 1. To reduce the error introduced by sampling, the experiment was repeated three times, and the average value of the result indicators was taken. The experiment used Nvdia K80 GPU as the training accelerator.
The experimental results show that in the case of few-shot samples, GACN increased the f1 score while maintaining a small increase in time overhead.
4.3 Evaluation with Unknown Attack Detection
We propose a multi-layer solution to solve the problem of unknown attack detection in few-shot samples. We extract 1/10 of each category attack by random sample from IDS2018 as a prior labeled feature set, and use it to mine unknown attacks in the remaining samples. The labeled and unlabeled scales of each category are shown in Table 2. In order to reduce the calculation cost, attacks with large data size are reduced to 1/10 of their original size.
We test and evaluate each class of attacks in the unlabeled dataset as unknown attacks separately. When a category is regarded as an unknown attack, we delete the data in the labeled data set, then detect the samples in the pending detection samples. At the same time, RTC is used as the benchmark. The results are shown in Table 2.
The experiments result show that our method improves TPR at some extent(increased by 8.38%), as well as significantly reduced FPR(decreased by 12.77%), which shows that our proposed method performs well.
5 Conclusions and Future Work
In this paper, we propose SFE-GACN as an unknown attack detection method under few-shot, which fills the gap in research in the target we aimed to investigate. It is based on the existing session feature set classification method. There are several advantages of SFE-GACN:
(1) It can decouple the sessions in the feature set by embedding, and bring the prior information into the few-shot samples to complete the preliminary augmentation of few-shot samples.
(2) When data augmentation is performed, samples in multiple categories are generated as intra categories to prevent confusion between generated samples.
(3) It improves upon the conventional unknown attack detection methods, making it more suitable for detection under few-shot, and can be docked with SFE-GACN to complete the final detection task.
SFE-GACN is used for the final detection task, which performance outperforms the current state-of-the-art method. However, there are also some points that need to be improved and extended in the future.
(1) Optimal hyperparameter setting method in the model.
In practical applications, a large number of hyperparameters need to be customized by workers, including the time window and embedded dimensions in SFE, and the rollback judgment epoch, rollback coefficient, and backup cycle in GACN. The optimal selection method of these hyperparameters will be given in the future.
(2) Universal scalability. Even if we refine ID tasks to more targeted scenarios such as unknown detection tasks under few-shot, there are still some more detailed application scenarios to deal with. For example, multi classification or binary classification, how inadequate is the data, etc. We will continue to explore these specific scenarios in the future and expand the universal scalability of SFE-GACN.
This work is partially sponsored by the State Key Development Program of China (No. 2018YFB0804402), National Science Foundation of China (No. U1736115, 61572355).
-  Xu G, Guo B, Su C, et al: Am I Eclipsed? A Smart Detector of Eclipse Attacks for Ethereum. Computers & Security (2020) 10.1016/j.cose.2019.101604
-  Li L, Xu G, Jiao L, A Secure Random Key Distribution Scheme Against Node Replication Attacks in Industrial Wireless Sensor Systems. IEEE Transactions on Industrial Informatics, 16(3), 2091-2101(2020)
-  Ning J, Cao Z, Dong X, Liang K, Ma H, Wei L. Auditable -Time Outsourced Attribute-Based Encryption for Access Control in Cloud Computing. IEEE Transactions on Information Forensics & Security 13(1): 94-105 (2018)
-  Yang Y,Liu J, Liang K, Kim-Kwang R C, Zhou J. Extended Proxy-Assisted Approach: Achieving Revocable Fine-Grained Encryption of Cloud Data. ESORICS (2) 2015: 146-166 (2015).
-  Ning J, Xu J, Liang K, Zhang F, Chang E. Passive Attacks Against Searchable Encryption. IEEE Trans. Information Forensics and Security 14(3): 789-802 (2019)
-  Liang K, Man H, Liu J K, Willy S, Duncan S W, Yang G, Yu Y, Yang A. A secure and efficient Ciphertext-Policy Attribute-Based Proxy Re-Encryption for cloud data sharing. Future Generation Computer Systems. 52: 95-108 (2015)
-  Zhang J, Chen X, Xiang Y, et al. Robust Network Trafﬁc Classiﬁcation, IEEE/ACM Transactions On Networking. vol. 23, pp. 1257 - 1270 (2016).
-  Wang W, Zheng V, Yu H. A Survey of Zero-Shot Learning: Settings, Methods, and Appli-cations. ACM Transactions on Intelligent Systems and Technology. 10(2), 37(2019).
-  Zhang R, Che T, Ghahramani Z. MetaGAN: An Adversarial Approach to Few-Shot Learn-ing. Neural Information Processing Systems. (2014)
-  Hu H, Tian T, Qian Y. Generative Adversarial Networks Based Data Augmentation for Noise Robust Speech Recognition. IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 5044-5048 (2018).
-  Ian J, Jean P, Mehdi M. Generative Adversarial Nets. Neural Information Processing Sys-tems. vol. 2, pp. 2672–2680 (2014).
-  Xian Y, Sharma S, Schiele B. f-VAEGAN-D2: A Feature Generating Framework for Any-Shot learning. IEEE Conference on Computer Vision and Pattern Recognition. pp. 10275-10284 (2019)
Schnfeld E, Ebrahimi S, Sinha S. Generalized Zero- and Few-Shot Learning via Aligned Variational Autoencoders. IEEE Conference on Computer Vision and Pattern Recognition. pp. 8247-8255 (2019).
-  Biswas S, Annadani Y. Preserving Semantic Relations for Zero-Shot Learning. IEEE Con-ference on Computer Vision and Pattern Recognition. pp. 7603-7612 (2019).
-  Kodirov E, Xiang T, Gong S. Semantic Autoencoder for Zero-Shot Learning. IEEE Con-ference on Computer Vision and Pattern Recognition. pp. 3174-3183 (2017).
-  Morgado P, Vasconcelos N. Semantically Consistent Regularization for Zero-Shot Recog-nition. IEEE Conference on Computer Vision and Pattern Recognition. pp. 6060-6069 (2017).
Sun X, Dai J, Liu P, Using Bayesian Networks for Probabilistic Identification of Zero-Day Attack Paths. IEEE Transactions on Information Forensics & Security, 13(10), 2506-2521(2017)
Duessel P, Gehl C ,Flegel U. Detecting zero-day attacks using context-aware anomaly detection at the application-layer. International Journal of Information Security, 16, 475–490 (2016).
-  Zhang M, Wang L, Jajodia S. Network Diversity: A Security Metric for Evaluating the Resilience of Networks Against Zero-Day Attacks. IEEE Transactions on Information Forensics and Security, 11(5), 1071-1086 (2016)
-  Wang L, Zhang M, Sushil J. Modeling Network Diversity for Evaluating the Robustness of Networks against Zero-Day Attacks. European Symposium on Research in Computer Security, pp 494-511 (2014).
-  Zhang M, Wang L, Jajodia S. Network Attack Surface: Lifting the Concept of Attack Sur-face to the Network Level for Evaluating Networks’ Resilience against Zero-Day Attacks. IEEE Transactions on Dependable and Secure Computing, (2018). 10.1109/TDSC.2018.2889086
-  Sun X, Dai J, Liu P. Towards probabilistic identification of zero-day attack paths. IEEE Conference on Communications and Network Security, pp. 64-72 (2016).
-  Iman S, Arash L, and Ali G. Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization. International Conference on Information Systems Securi-ty and Privacy, Portugal. (2018).
-  Han X, Kashif R, Roland V. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv:1708.07747 (2017).
-  Bamler R, Mandt S. Dynamic word embeddings. International Conference on Machine Learning, vol. 70, pp. 380–389 (2017).
-  Markus R, Daniel S, Dieter L. Flow-based network traffic generation using Generative Adversarial Networks. Computers & Security, 82, 156-172 (2019).
-  Nicola P, Julian T, Alexander M. GPGPU Linear Complexity t-SNE Optimization. IEEE Transactions on Visualization and Computer Graphics, 26(1), 1172-1181 (2020)
-  Bamler R, Mandt S. Dynamic word embeddings. International Conference on Machine Learning. pp. 380-389. 2017.
-  Xu G, Zhang Y, Jiao L, et al. DT-CP: A Double-TTPs Based Contract-signing Protocol with Lower Computational Cost, IEEE Access, (2019). 10.1109/ACCESS.2019.2952213
-  Feng Q, He D, Zeadally S, Liang K. BPAS: Blockchain-Assisted Privacy-Preserving Au-thentication System for Vehicular Ad Hoc Networks. IEEE Trans. Industrial Informatics 16(6): 4146-4155(2020)
-  Jiang L, Chen L, Thanassis G, Luo B, Liang K, Han J. Toward Practical Privacy-Preserving Processing Over Encrypted Data in IoT: An Assistive Healthcare Use Case. IEEE Internet of Things Journal 6(6): 10177-10190(2019)