Deep Transfer Learning: A Novel Collaborative Learning Model for Cyberattack Detection Systems in IoT Networks

Federated Learning (FL) has recently become an effective approach for cyberattack detection systems, especially in Internet-of-Things (IoT) networks. By distributing the learning process across IoT gateways, FL can improve learning efficiency, reduce communication overheads and enhance privacy for cyberattack detection systems. Challenges in implementation of FL in such systems include unavailability of labeled data and dissimilarity of data features in different IoT networks. In this paper, we propose a novel collaborative learning framework that leverages Transfer Learning (TL) to overcome these challenges. Particularly, we develop a novel collaborative learning approach that enables a target network with unlabeled data to effectively and quickly learn knowledge from a source network that possesses abundant labeled data. It is important that the state-of-the-art studies require the participated datasets of networks to have the same features, thus limiting the efficiency, flexibility as well as scalability of intrusion detection systems. However, our proposed framework can address these problems by exchanging the learning knowledge among various deep learning models, even when their datasets have different features. Extensive experiments on recent real-world cybersecurity datasets show that the proposed framework can improve more than 40 approaches.



There are no comments yet.


page 1


Evaluating Federated Learning for Intrusion Detection in Internet of Things: Review and Challenges

The application of Machine Learning (ML) techniques to the well-known in...

Wireless Communications for Collaborative Federated Learning in the Internet of Things

Internet of Things (IoT) services will use machine learning tools to eff...

An Interpretable Federated Learning-based Network Intrusion Detection Framework

Learning-based Network Intrusion Detection Systems (NIDSs) are widely de...

Federated Mimic Learning for Privacy Preserving Intrusion Detection

Internet of things (IoT) devices are prone to attacks due to the limitat...

Federated Learning for Intrusion Detection System: Concepts, Challenges and Future Directions

The rapid development of the Internet and smart devices trigger surge in...

Federated Dynamic Spectrum Access

Due to the growing volume of data traffic produced by the surge of Inter...

Optimizing Resource-Efficiency for Federated Edge Intelligence in IoT Networks

This paper studies an edge intelligence-based IoT network in which a set...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In recent years, the rapid development of various technologies, such as 5G/6G, Industry 4.0, and Internet-of-Things (IoT), has enabled numerous applications to become an integral part in many aspects of our daily lives. However, such ever-fast growth has also led to an unprecedented massive amount of data and the proliferation of interconnected devices, e.g., sensors, smart cars, and cameras, which raises serious security and privacy concerns. Particularly, the increasing number of emerging applications has also brought forth many new types of cyberattacks. For example, the number of novel (zero-day) cyberattacks has increased by 60% from 2018 to 2019 [11]. Besides the dire consequences in economic, e.g., ransomware alone cost more than $5 billion globally in 2017 [20], cyberattacks pose serious threats to other areas with highly sensitive information such as healthcare and public security. Since we do not know how new types of cyberattacks will perpetrate, it is crucial to detect potential attacks to promptly employ the mitigation and prevention methods. Therefore, cyberattack detection methods play a key role in the development and application of current and future technologies.

Cyberattack detection methods can be classified into signature-based and anomaly-based methods. Signature-based methods rely on the prior knowledge (signatures) of a cyberattack to detect its perpetration. Although such methods can achieve a low false-positive rate, they require frequent updates of the signature database to achieve high performance. Moreover, these methods cannot detect some types of attacks, such as zero-day attacks, because their signatures are yet unknown to the database or defense systems. In contrast, anomaly-based methods can detect attacks based on the anomalies in the incoming traffic. Although these methods may cause more false alarms compared to those of the signature-based methods, anomaly-based methods are more effective in detecting new types of cyberattacks because they do not rely on known signatures. Since both signature-based and anomaly-based methods have complementary advantages, it is desirable to have solutions that can leverage advantages of both techniques, but at the same time can effectively overcome current limitations of these techniques.

Recently, with outstanding classification ability, Machine Learning (ML) techniques, especially deep learning (DL), have been widely applied for cyberattack detection problems. Particularly, DL models can effectively learn the signatures of various cyberattack types. Moreover, for new types of attacks, DL models can learn the normal system behaviors, and thereby better identifying the abnormal behaviors, i.e., potential cyberattacks 

[6]. As a result, in addition to the ability to effectively detect various types of known attacks, DL-based methods can be also used to identify new types of cyberattacks without any prior knowledge of the attacks’ signatures. Nevertheless, DL-based cyberattack detection systems are also facing some practical challenges. Particularly, conventional DL approaches usually require a huge amount of data to achieve a high performance. However, in many applications, data are very difficult to collect because they are often stored locally at user devices such as IoT devices, smartphones, and wearable devices. This poses a threat to user privacy because sensitive data (e.g., location and private information) have to be sent over the network and stored at the centralized server for processing. Besides the privacy concerns, transmitting such a collectively large amount of data also imposes extra communication burden over the network. Consequently, these limitations have been hindering the effectiveness of DL techniques in cyberattack detection systems.

To address these problems, Federated Learning (FL) has emerged to be a highly effective solution. Unlike conventional DL techniques that collect data and train the global model at a central server, FL enables the learning process to be distributed across all devices. Particularly, instead of sending data to a central server, the local data can be used to train a global model locally on each user device. Then, the obtained model weights of each device are periodically sent to a central server for aggregation. Afterward, the aggregated weights are sent back to all devices to update their local models’ weights. Since only the weights are transmitted in FL, both the privacy and communication overhead issues can be mitigated [14].

Despite its effectiveness, FL is still facing some challenges. Particularly, FL only performs well if the training data and the predicting data are independent and identically distributed (i.i.d). Consequently, they are not robust to the changes of the system, e.g., changes in network traffic due to the mobility of users, new types of devices participating in the network, and so on. Moreover, the performance of FL largely relies on the availability of labeled data. However, acquiring sufficient labeled data might be costly and time-consuming. Even if the data are available, the participated user data usually have different structure such as features. This leads to the difficulty or even mistakes when FL aggregates the global model. Consequently, they may not be suitable for the intensive training process of FL [14] [3].

To address these limitations, transfer learning (TL) has been emerging as a promising solution [23, 26, 40]. Unlike DL and FL techniques that are trained only for a specific problem, TL can utilize “knowledge” from rich resource data to enhance the training process and performance of the ML models. Particularly, by transferring “knowledge” from similar scenarios with a lot of high-quality data, TL can address the lack of labeled data for the target networks. Moreover, the TL can exchange “knowledge” even if the data of the target and source networks are different in features. Therefore, TL techniques can can address limitations of FL, especially for problems related to heterogeneous training data. [23, 26, 40].

In this paper, we propose a novel collaborative learning framework that utilizes the strengths of both TL and FL to address the limitations of conventional DL-based cyberattack detection systems. Particularly, we consider a scenario with two different IoT networks 111The cases with multiple networks can be straightforwardly extended, e.g., by scheduling for networks to exchange information in order.

. The first network (source network) has an abundant labeled data resource, while the second network (source network) has very little data resource (and most of them are unlabeled). Here, unlike most of current works that assume that the data at these networks have the same features [references], we consider a much more practical and general case in which data at these two networks may have different features. To address the problem of dissimilar feature spaces of the target and source networks, we propose to transform them into a new joint feature-space. In this case, at each learning round of the federated learning process, trained models of target and source networks can be exchanged through the joint feature-space. Thus, by periodically exchanging and updating the trained model, the target network can eventually achieve the converged trained deep neural network that can predict attacks with high accuracy (thanks to useful “knowledge” transferred from the source network). More importantly, different from FL where networks try to train a joint global model, our proposed framework enables the participating networks to obtain their particular trained models that specific for their networks, i.e., better predict attacks for particular networks with different data structure. Extensive experiments on recent real-world datasets, including N-BaIoT 

[18] [19], KDD [1], NSL-KDD [33] and UNSW [21]

show that our proposed framework can achieve an accuracy of up to 99% and an improvement of up to 40% over the unsupervised learning approach. The main contributions of this paper can be summarized as follows:

  • We propose a novel collaborative learning framework that can effectively detect cyberattacks in decentralized IoT systems. By combining the strengths of FL and TL, our proposed framework can improve learning efficiency and the accuracy of cyberattack detection in comparison with the conventional DL-based cyberattack detection systems.

  • We propose an effective transfer learning approach that can allow the deep learning model from the rich-data network to transfer useful knowledge data to the low-data network even they have different features.

  • We perform extensive experiments on recent real-world datasets including N-BaIoT, KDD, NSL-KDD and UNSW to evaluate the performance of the proposed collaborative learning framework. The results show that our proposed approach can perform well over all datasets even with different data features.

The rest of this paper is organized as follows. We first discuss related works in Section II. We then provide fundamental background knowledge on essential concepts in Section III. We propose the federated transfer learning model for cyberattack detection in Section IV. After that, we present the settings and results of our experiments in Section V. Finally, we conclude the paper in Section VI.

Ii Related work

Ii-a Deep Learning for Cyberattack Detection

There have been a rich literature proposing DL approaches for cyberattack detection. In [34]

, a deep neural network (DNN) model is developed to detect zero-day attacks based on two types of data, i.e., network activities and local system’s activities. Extensive experiments on multiple datasets are conducted to find the optimal parameters for the proposed DNN as well as compare the performance of using DNN with other classifiers such as Naive Bayes (NB), K-Nearest Neighbors (KNN), and Support Vector Machine (SVM). The results show that for most of the datasets, the proposed DNN can achieve a higher detection accuracy and lower false-positive rate compared to those of the other classifiers. Another DL approach is proposed in 

[24] to detect cyberattacks in the mobile cloud computing environments. The main difference between [34] and [24] is that the approach in [24] consists of a feature analysis phase before the learning phase. In the analysis phase, the datasets are analyzed to identify meaningful features, thereby reduce the data dimension and computational complexity. Experiments on the KDD [1], NSL-KDD [33], and the UNSW [21] datasets show that the proposed approach can achieve a detection accuracy of up to 97.1%. Similarly, a DL-based model is proposed in [10] for intrusion detection based on network activities. Different from [34] and [24]

, this approach combines both deep (auto-encoder) and shallow (random forest) learning. Experimental results show that the proposed approach can detect cyberattacks with an accuracy up to 98% on several datasets such as KDD, NSL-KDD, and UNSW. Although all of these proposed approaches can achieve good performance, i.e., high detection accuracy and low false-positive rate, they still have inherent limitations of DL such as privacy concerns and the reliance on data.

Ii-B Federated Learning for Cyberattack Detection

With the advent of FL, the research focus has recently shifted towards applying this framework for cyberattack detection, especially in environments with numerous devices such as IoT and mobile edge networks. In [2]

, an FL framework is proposed for cyberattack detection in an edge network. In this network, the data for intrusion detection are stored locally at each edge node. The edge nodes train their data locally and send their models’ weights to an FL server for aggregating. After aggregation, the FL server sends the weights back to all edge nodes. In this way, each edge node can benefit from the other nodes’ data and training while protecting its privacy and reducing the network’s communication burden. Experiments with the NSL-KDD datasets show that the proposed approach can achieve an accuracy of up to 99.2%. A similar FL framework is proposed for anomaly detection in the IoT environments in 

[25]. However, different from most of related works that train with publicly available datasets, the data used in [25] are collected from 33 real-world IoT devices with 23 different types. Experimental results show that the proposed framework can detect attacks with high accuracy, up to 95.6%. Another FL approach is proposed in [13]

for attack detection in industrial cyber-physical systems. In the considered setting, there are multiple cyber-physical systems acting as FL nodes. Each node trains its data locally and sends the weights to a cloud server, similar to the typical FL process. However, unlike the previous frameworks, the authors propose a novel architecture combining convolution neural network (CNN) and gated recurrent unit for training at each FL node. Moreover, the authors also employ the cryptographic asymmetric keys mechanism for the communications between the FL nodes and the server, thereby preserving the security and privacy of the communications. Experiments with self-collected data show that the proposed approach can outperform other state-of-the-art approaches, e.g., 

[25, 30, 4], with an accuracy up to 99.2%. Different from the mentioned frameworks, an FL framework is proposed in [29]

that can utilize both labeled and unlabeled data to detect IoT malwares. Particularly, the authors develop multi-layer perceptron and autoencoder (AE) models for labeled and unlabeled data, respectively. Experimental results using the N-BaIoT dataset show that the proposed framework can achieve a detection accuracy of up to 99%. However, because of the limitation of FL as presented in the previous section, the learning model only can combine the same structure data such as features and labels.

Ii-C Transfer Learning for Cyberattack Detection

Although FL techniques can effectively address the privacy and communication load concerns of conventional ML for cyberattack detection, they are still facing some challenges. Particularly, FL approaches usually require high-quality and labeled data for training. However, collecting and labeling such data is expensive and time-consuming, especially for large-scale systems. On the other hand, unlabeled data are often abundant in environments such as IoT and mobile edge networks. Thus, a deep TL approach is proposed for IoT intrusion detection in [35] based on network activities, which can utilize both labeled and unlabeled data. In this approach, the authors employ two AEs. The first AE is trained with labeled data, while the second AE is trained with unlabeled data. Then, the knowledge is transferred from the first AE to the second AE by minimizing the Maximum Mean Discrepancy (MMD) distances between their weights. Experiments over nine IoT datasets were conducted to show that the proposed approach can achieve higher Area Under the Curve (AUC) scores compared to those of several other approaches. In [37], another deep TL approach is proposed to detect abnormal segments of the network traffic. Different from the approach in [35] that transfers knowledge from a data-rich domain to a domain with unlabeled data, the approach in [37] transfers knowledge from a pre-trained model to a domain with labeled but scarce data. Particularly, the authors propose to transfer the first 12 layers of a pre-trained CNN to the local model for fine-tuning with local data. Transferring only the first layers is more efficient because these layers usually capture the general features that are common across various networks, whereas the last layers are more network-specific. Experiments on four different datasets show that the proposed TL approach can improve the performance of the CNN by up to 21%.

Besides analyzing network traffic, another approach to detect cyberattacks is to analyze the devices’ fingerprints. Particularly, attackers may try to impersonate a device in the system by copying its signal. For this kind of attack, ML techniques can be used to detect if the signals are coming from the real device or the malicious device. TL approaches such as [31, 39, 32, 5] are proposed to identify cyberattacks based on device fingerprints. Among them, [32] and [5] leverage the environmental effects to classify signals from devices. To improve the classification accuracy and address the lack of data, these approaches transfer the knowledge from nearby devices (since they share similar environmental effects). On the other hand, [31] and [39] leverage the knowledge from previous experiences, i.e., data collected in the past. These past data are then combined with the current data for training, thereby addressing the lack of fingerprint data.

Difference from all the abovementioned approaches, the collaborative learning framework proposed in this paper can leverage the strengths of both FL and TL to address limitations of ML-based intrusion detection systems, e.g., lack of labeled data, privacy and heterogeneous data feature space. Moreover, in our approach, each IoT network has a separated model that is fine-tuned specifically for that network, therefore the model is more effective for that network’s cyberattack detection compared to FL frameworks with a single model for all networks. Furthermore, our proposed system model can utilize knowledge from multiple sources in the network instead of only transferring knowledge from a single source as proposed in most of the mentioned TL frameworks [35, 37, 39, 31].

Iii Fundamental Background

Iii-a Deep Learning

Deep learning, which is a sub-field of machine learning, is a multi-layer Deep Neuron Network (DNN) that imitates the neural network in human brains. In this neutral network, each neutron connects to others by a hypothesis function that characterizes by weights

as the following:


with is the input data and is the output of the neural network. In general, the deep learning neutral network architecture mainly contains 3 layers including input layer, hidden layers and output layer [9]. While the input layer is the gateway to process and transform the input data into the space of the hidden layers. The hidden layers are the heart of deep learning which include multiple neuron layers to process, analyze and optimize the input data. The output layer is the last layer which represents the results of the whole deep learning process. In processing perspective, the deep learning includes two main processes namely training process and testing process. The training process is the main process of deep learning to analyze and optimize the input data. Firstly, the forward propagation is used to calculate the predicted value. Then, the predicted value

is evaluated to calculate the errors by loss function. Here, the loss function depends on the types of output or mechanism, i.e., the logistic loss function 

[16] with the predicted value and the labeled :


The output of the loss function is used by the back propagation to update the parameters of neural network. These processes are repeated in the training process to optimize the output and neural network. After the training process, the testing process is performed to evaluate the whole learning process. There are three approaches used to process the data namely supervised, unsupervised or semi-supervised learning 


. The main difference between them is at the data processing phase. While the supervised learning processes labeled data, the unsupervised learning approaches focus on unlabeled data and the semi-supervised learning approaches can work with both types of data. Until now, each approach still attracts a number of researches in both science and industry 


Fig. 1: Architecture of a basic AE.

Autoencoder (AE) is a type of unsupervised deep learning model that can effectively learn the latent feature representation of the input data for multiple purposes such as dimensionality reduction and generative modeling [15]. As illustrated in Fig. 1, a basic AE consists of three layers, i.e., one input layer, one hidden layer and one output layer. Note that, for deep AEs, we can use multiple hidden layers to improve the learning performance. The input and hidden layers together form the encoder which is used to map the -dimensional input into an -dimensional representation by the encode function . Then, the decoder, i.e., the hidden and output layer, reconstructs the input by the decode function to create the output . During the training process, the AE aims to learn the functions and to minimize the reconstruction error, i.e.,


In practice, AEs are often employed for dimensionality reduction [15]. Particularly, AE can map the original input data of -dimensional space to a representation of -dimensional space. To this end, the input and output layers of the AE are designed with nodes each, whereas the hidden layer contains nodes. Then, the AE is trained with the original data to minimize the reconstruction error, thereby allowing the original data to be reconstructed using the hidden layer with minimal error. After sufficient training, the hidden layer becomes an accurate -dimensional projection of the the original -dimensional input data [36]

. AE is a very useful mechanism to convert the feature-space of the input data as well as perform unsupervised learning to analyze the unlabeled data of input dataset, in this situation some machine learning algorithms, e.g., k-means, could be used to cluster the unlabeled data into different groups. As a result, AE has been widely used in various purposes especially in cybersecurity for intrusion detection 

[35] [12].

Iii-B Federated Learning

Fig. 2: Illustration of the FL process.

FL is an emerging distributed learning framework that allows multiple devices to distributedly train a shared model without a need of collecting and aggregating all data from all the nodes in the network, thereby protecting the data privacy and reducing the communication overhead. Typically, an FL system consists of an FL server and FL users. Each user in the system possesses a private dataset and a local model , whereas the FL server has a global model . As illustrated in Fig. 2, at the beginning of an FL process, the FL server broadcasts the initial global model to every user in the network. Then, each user trains the received model locally using its private dataset to generate the gradient and the updated local model  [17]:


After the training process finishes, the user sends its updated local model to the server. The server then aggregates all the updated local models’ weights into the global model using an aggregation algorithm, e.g., the FedAvg Algorithm [17]:


In (5), is the total number of samples and is the number of samples of user . The aggregated global model weight is then sent to the users for training in the next iteration. The FL process is repeated until a termination criterion is met, e.g., the global loss function converges. By this way, each user can receive the updated learning knowledge from others without sharing its own private data. However, the conventional FL approaches require the dataset of each user having the same characteristics such as features or labels. Therefore, they are not effective to widely deploy in practice, especially in high dynamic and decentralized networks such as the cyberattack detection for IoT networks [38, 16].

Iii-C Transfer Learning

The conventional deep learning uses a dataset collecting from the network to learn and predict the network behaviors. However, when the dataset is small and lacks of useful information, it will dramatically affect the accuracy of the trained model [26]. Transfer Learning (TL) is a technique that allows a model fed by a small dataset can improve its learning knowledge by inheriting the knowledge learned from other models with much better data quality. This method also can overcome limitations of conventional FL by allowing datasets with different characteristics (e.g., features or labels) to share and exchange their knowledge. As a result, this technique has a wide range of applications in practice [27] [23].

Fig. 3: Illustration of a system model for cyber attack detection in IoT networks.

TL is generally defined based on two fundamental concepts, i.e., domain and task. A domain consists of a feature space

and a marginal probability distribution

, i.e., . Particularly, contains all the features of a dataset, whereas

is the probability distribution of the dataset. Given a domain

, a task is defined by a function that aims to map domain to a label space . Let , , , and denote the source domain, source task, target domain, and target task, respectively. The main objective of TL is to utilize the knowledge from the source, i.e., and , to improve the learning process of the target task, i.e., learn . Based on the availability of labeled data and the difference among , , , and , TL can be classified into inductive (, , labeled target data), transductive (, , labeled source data), and unsupervised TL (unlabeled target and source data) [27, 22, 23]. With the precious properties in inheriting knowledge from bigger source, TL has been widely applied in various sections especially in cybersecurity to detect cyberattack in IoT network [35, 23, 7].

Iv Proposed Federated Transfer Learning Framework for Cyberattack Detection in IoT Networks

Iv-a System Model

The conventional FL model requires to use a centralized server to maintain and aggregate all the trained models in the whole learning process. However, this may lead to a high cost to maintain and may not be effective to deploy in IoT networks. Thus, in this work, we propose a federated transfer learning model that allows learning process can be performed more flexibly and effectively in IoT environments. In particular, we consider a network which has unlabeled data (e.g., Network B as illustrated in Fig. 3) and it wants to learn more knowledge from other networks which have abundant labeled data. In this case, this network will connect with a target network (e.g., Network A as illustrated in Fig. 3) and nominate itself as a centralized node which can train its own data as well as perform transfer learning to exchange knowledge with the target network. By this way, this network can help to improve the accuracy in identifying network traffics by learning useful knowledge from other labeled networks.

Each network can be managed by an the IoT gateway and possesses its own private dataset. The IoT gateway uses its deep learning model to detect normal and abnormal traffics. It is important to note that, unlike conventional FL approaches [12], in this work we consider a practical scenario in which the datasets of networks may have different features.

Iv-B Proposed Federated Transfer Learning Approach for Cyberattack Detection

1:  Input: The learning rate , the weight parameter , the maximum iteration , the tolerance and Network A and Network B initialize model parameters ;
2:  Output: The trained model parameter ;
4:  while  do
5:     Network A performs:
6:      for
7:     Send to Network B;
8:     Network B performs:
9:      for
10:     Send to Network A;
11:     Network A performs:
12:     Compute and , then send them to Network B;
13:     Network B performs:
14:     Compute and , then send them to Network A;
15:     Network A performs:
16:     Update
17:     Network B performs:
18:     Update
19:     if  then
20:        Send stop signal to Network B;
21:        Break.
22:     else
23:        ;
24:        ;
25:        continue;
26:     end if
27:  end while
Algorithm 1 Federated Transfer Learning Algorithm: Training Process

In this section, we propose a highly-effective federated transfer learning model that can perform exchange learning between an unlabeled network and multiple networks which may have different features. To better analysis the impact of our proposed approach, we consider a specific scenario in which one labeled network is used as a source network to support an unlabeled network (i.e., target network). The scenario with one unlabeled network and multiple labeled networks can be straightforwardly extended, and we leave it for future study. Fig. 4 describes the training and predicting processes of FTL algorithm that we use in this case. We assume that Network A has a labeled dataset with where is the number of samples of dataset A. In contrast, Network B has an unlabeled dataset with where is the number of samples of dataset B. Each network has it own model parameters called and . As shown in  (1), the outputs of two neural networks are and . We need to find the prediction function to predict the output of Network B. To find a high-quality predict function, we first need to minimize the loss function using the labeled dataset as follows:


where is the overlapping dataset between A and B, and represents the loss function as shown in (2). In addition, datasets A and B may have some small overlapping samples, and thus we can use these samples to optimize the loss function. We denote as the overlapping samples between dataset A and B. we need to minimize the alignment loss function between A and B as follows:


where represents the alignment loss function. The common alignment loss function could be represented in modulus or angle . Lastly, we add the regularization and in which and are the number of layers in neutron Network A and Network B to find the final loss function that needs to be minimized:


where and are the weight parameters. The gradients for updating are calculated by the following formula:


The training process is described in the Algorithm 1. Firstly, we initialize and and calculate and from and . Then, Network A sends to Network B to calculate the gradients of . Similarly, Network B sends to Network A to calculate the gradient and the loss function as in (8). After that, Networks A and B send the gradients and losses to each other and update their parameters on each iteration to minimize the loss as (9).

When the training completes, the prediction process described in the Algorithm 2 is called to predict the final result of the unlabeled dataset . Similar to the training process, the dataset firstly goes through the Network B to calculate . Then, Network B sends to Network A to transfer the learning knowledge of model A. The Network A predicts the results and sends back to Network B for classify the attack and normal behaviors of the network.

(a) FTL training process.
(b) FTL predicting process.
Fig. 4: The proposed FTL algorithm.
1:  Input: The model parameters and dataset ;
2:  Output: The prediction ;
3:  Network B performs:
4:   for
5:  Send to Network A;
6:  Network A performs:
7:  Compute and send it to Network B.
Algorithm 2 Federated Transfer Learning Algorithm: Predicting Process

Iv-C Evaluation Methods

As mentioned in [8, 28]

, the confusion matrix is typically used to evaluate system performance, especially for intrusion detection systems. We denote TP, TN, FP, and FN to be “True Positive”, “True Negative”, “False Positive”, and “False Negative”, respectively. The the Receiver Operator Characteristic (ROC) is created by plotting the TPR over FPR at different thresholds. Then, we use Area under the curve (AUC) as the area under the ROC curve to evaluate the performance of the algorithm in the following formulas:


In our experiment practices, we randomly select samples from original dataset to test the algorithm. In this scenario, the is often used to evaluate the resuts of random tests. The is calculated as the following formula:


with is the mean and

is standard deviation. The results are calculated by the significant number with the following formula:


where is the significant number that represents the results of 30 times randomly runs and the confidence of this number is calculated by . In a normal situation, the is marked as confidence when it has values around 0.01 and 0.05 corresponding to the confidence of significant number is around 99% and 95%.

V Performance Analysis

V-a Datasets

In this experiment, we use four popular cybersecurity datasets namely the N-BaIoT [18] [19], KDD [1], NSL-KDD [33] and UNSW [21] datasets to evaluate the performance of the proposed method. The Network-based Detection of IoT Botnet Attacks (N-BaIoT) dataset [18] [19] includes the information collecting in the setup network about the normal and attack situation. The attack was performed by servers to nine IoT devices and the total network behavior was captured by sniffer server to extract dataset. This dataset is characterized by 115 features for both normal and attack behaviors. In this dataset, the attack type was the Distributed Denial of Service (DDoS) which implemented by two well-known botnets namely Mirai and BASHLITE. The BASHLITE botnet performed 5 types of attack including network scanning (scan), spam data sending (junk), UDP flooding (udp), TCP flooding (tcp) and the join of sending spam data and opening port to specific IP address (combo). Besides BASHLITE, the Mirai botnet also performed 5 types of attacks including scan, ACK flooding (ack), SYN flooding (syn), udp and optimized UDP flooding (udpplain).

In addition to IoT datasets, we also want to evaluate our proposed solution on some classical intrusion detection datasets, i.e., KDD [1], NSL-KDD [33] and UNSW [21] datasets. The KDD dataset [1] includes many different kinds of network attacks simulated in military network environment. The KDD dataset has 41 features and it classifies attacks in 4 groups including Denial of Service (DoS), Probe, User to Root (U2R), Remote to Local (R2L). The NSL-KDD dataset [33] inherits the properties from KDD [1] dataset such as the features and types of attacks, but eliminates the redundant samples in the training dataset and the duplicated samples in the testing dataset. Although both KDD and NSL-KDD dataset are well-known and used in many research works, they were developed long time ago. Thus, some modern attacks were not involved. Therefore, a recent dataset, i.e., UNSW dataset [21], is considered in this work. Unlike KDD and NSL-KDD, the feature space of this dataset includes 42 types and 9 kinds of attacks, namely DoS, Backdoors, Worms, Fuzzers, Analysis, Reconnaissance, Exploits, Shellcode and Generic.

Fig. 5: The data of participated networks used in this experiment.
Dataset Device name Total features
IoT1 Danmini_Doorbell 115
IoT2 Ecobee_Thermostat 115
IoT3 Ennio_Doorbell 115
IoT4 Philips_B120N10_Baby_Monitor 115
IoT5 Provision_PT_737E_Security_Camera 115
IoT6 Provision_PT_838_Security_Camera 115
IoT7 Samsung_SNH_1011_N_Webcam 115
IoT8 SimpleHome_XCS7_1002_WHT_Security_Camera 115
IoT9 SimpleHome_XCS7_1003_WHT_Security_Camera 115
KDD - 41
UNSW - 42
TABLE I: Dataset preparation

V-B Experiment Setup

In this section, we carry out experiments using all aforementioned datasets to evaluate the performance of the proposed solution. In this experiment, we denote IoT1-9 as the dataset names of nine IoT devices. Table I describes the total features and the representative names of datasets that we use in this experiment. Fig. 5 also describes the separated data in each dataset in this experiment. In this experiment, the participated data are randomly selected from the dataset. Then, the selected data are separated into label data (data of Network A) and unlabeled data (data of Network B) with different features and samples, these data have about 10% mutual samples of total dataset samples. We experiment two cases: the first one is with 2000 unlabeled data and 9577 labeled data (CASE 1), the second one is with 10000 unlabeled data and 47893 labeled data (CASE 2).

In this setup, we consider a the baseline solution with the state-of-the-art unsupervised deep learning model (UDL) which clusters the unlabeled data into normal and attack behaviors based on autoencoder and k-means techniques [9]. The unsupervised deep learning model includes an autoencoder and k-nearest neighbor to cluster the unlabeled data. In addition, we consider the second baseline solution that uses both supervised and unsupervised dataset to feed the FTL learning models. The FTL will exchange the knowledge from the supervised learning model and the unsupervised learning model to improve the accuracy of learning as well as increase the precise of identifying attack and normal behaviors of the unlabeled data. Then, we measure the AUC of the this process in 30 times to calculate the signification number of the AUC series results with both baseline solutions. Finally, we plot the reconstruction errors to analyze the convergence of the FTL algorithm for all datasets.

V-C Experimental Results

In this section, we show the results of our experiments with different kinds of cybersecurity datasets.

V-C1 Accuracy Comparison

IoT1 85.771 45.753
IoT2 83.795 63.171
IoT3 94.286 80.453
IoT4 79.241 77.885
IoT5 90.605 81.876
IoT6 91.179 82.703
IoT7 90.670 85.183
IoT8 82.960 65.256
IoT9 83.222 73.072
KDD 99.315 80.477
NSLKDD 98.485 83.025
UNSW 97.072 68.449
(a) The results with p_value=1.
IoT1 87.398 49.770
IoT2 85.672 65.793
IoT3 94.896 81.070
IoT4 81.672 77.885
IoT5 91.517 82.013
IoT6 92.059 82.703
IoT7 92.030 86.013
IoT8 85.197 68.161
IoT9 85.072 73.078
KDD 99.395 81.304
NSLKDD 98.534 83.450
UNSW 97.141 69.124
(b) The results with p_value=3.
IoT1 88.259 51.897
IoT2 86.666 67.181
IoT3 95.220 81.397
IoT4 82.959 77.885
IoT5 92.000 82.085
IoT6 92.525 82.703
IoT7 92.750 86.453
IoT8 86.381 69.700
IoT9 86.052 73.082
KDD 99.438 81.742
NSLKDD 98.561 83.675
UNSW 97.177 69.482
(c) The results with p_value=5.
TABLE II: The results with multiple datasets in CASE 1.
IoT1 90.371 49.783
IoT2 68.193 62.591
IoT3 94.525 83.411
IoT4 87.050 77.725
IoT5 86.535 81.954
IoT6 87.214 82.555
IoT7 97.662 79.517
IoT8 84.609 52.702
IoT9 90.095 63.803
KDD 99.535 84.333
NSLKDD 98.858 81.164
UNSW 97.049 66.329
(a) The results with p_value=1.
IoT1 91.497 54.079
IoT2 72.573 65.681
IoT3 95.073 83.565
IoT4 88.538 77.781
IoT5 88.150 82.160
IoT6 88.638 82.664
IoT7 97.928 81.400
IoT8 86.691 57.318
IoT9 90.959 65.559
KDD 99.562 84.423
NSLKDD 98.885 81.976
UNSW 97.121 66.901
(b) The results with p_value=3.
IoT1 92.093 56.354
IoT2 74.892 67.317
IoT3 95.363 83.647
IoT4 89.326 77.811
IoT5 89.006 82.269
IoT6 89.392 82.721
IoT7 98.069 82.397
IoT8 87.793 59.763
IoT9 91.417 66.489
KDD 99.576 84.471
NSLKDD 98.900 82.406
UNSW 97.159 67.203
(c) The results with p_value=5.
TABLE III: The results with multiple datasets in CASE 2.

In this section, we compare the performance of FTL and the unsupervised deep learning (UDL) method in terms of the significant number of each as explained in Section IV. Table II and Table III describe the significant number of each dataset with corresponding to the confidence of .

In general, Table II and Table III show that the significant numbers of all datasets are increased as increases. This is because in (12) we calculate the significant number based on a series of 30 continuous AUC results. When increases, the AUC results increase in all tables. This demonstrates that most of the AUC results in 30 series are higher than the significant number in the case when .

Table I(c) shows the significant numbers of participated datasets with in CASE 1. In this table, the IoT1 and UNSW datasets show the significant gap at about 30% and 40% between FTL and UDL. These results show the difficulty of clustering in recognizing the groups of samples and the advantage in collaborative learning in these datasets. The other ten datasets have the gap of around 10-20% between two methods, that demonstrates the stability of our proposed solution for any cybersecurity dataset.

In addition, Table II(c) shows the significant numbers of multiple datasets with in CASE 2. In this table, the significant numbers also have a gap at around 10-40% between two solutions. It shows that the common trend of significant numbers increases for all datasets when number of samples increases. However, in IoT2, IoT5, IoT6 datasets, the significant numbers are slightly decreased because of the randomly selected samples from the original dataset. It also can be demonstrated by the high fluctuation of the reconstruction errors of IoT2, IoT5, IoT6 datasets in Fig. 6(b) compared with other datasets. However, in all study datasets, our proposed solution still performs much better than the state-of-the-art UDL solution. These results demonstrate that our solution can work efficiently in all IoT and conventional cybersecurity datasets in detecting cyberattacks in the network.

V-C2 Reconstruction Error Analysis

In this section, we discuss the convergence of the FTL algorithm in each dataset. Fig. 6 describes the reconstruction errors of the nine IoT datasets and the conventional datasets like KDD, NSLKDD and UNSW in CASE 1. Fig. 7 describes the reconstruction errors of study datasets in CASE 2.

In Fig. 5(a) and Fig. 6(a)

, we can see that at the first few epochs, the errors are very high for KDD (up to

in CASE 1 and in CASE 2), but this error dramatically reduces to in CASE 1 and in CASE 2 after only epochs. For NSLKDD and UNSW, they have very similar trends with in CASE 1 and in CASE 2 at the beginning and gradually reduce to in CASE 1 and in CASE 2 after epochs, respectively. After 200 epochs, the algorithm is converged by the flat of all the reconstruction error curves.

Fig. 5(b) and Fig. 6(b) show the reconstruction errors of nine IoT datasets in both CASE 1 and CASE 2. we can observe the same trend over all datasets, i.e., all errors gradually reduce when the number of epochs increases. However, it can be observed that the trend exhibits some fluctuation in comparison with the trends in Fig. 5(a) and Fig. 6(a) because of the heterogeneous distribution in IoT datasets. The high fluctuation of the reconstruction errors of IoT2, IoT5, IoT6 datasets in Fig. 6(b) also explains why their significance numbers reduce when the number of samples increases in CASE 2. However, the reconstruction errors of all study datasets in our proposed solution dramatically reduce and reach stability after running epochs in both cases.

(a) The reconstruction errors of KDD, NSLKDD and UNSW datasets
(b) The reconstruction errors of IoT datasets
Fig. 6: The reconstruction errors in CASE 1.
(a) The reconstruction errors of KDD, NSLKDD and UNSW datasets
(b) The reconstruction errors of IoT datasets
Fig. 7: The reconstruction errors in CASE 2.

Vi Conclusion

In this work, we have proposed a novel collaborative learning framework to address limitations of current ML-based cyberattack detection systems in IoT networks. In particular, by extracting and transferring knowledge from a network with abundant labeled data (source network), the intrusion detection performance of the target network can be significantly improved (even if the target has very few labeled data). More importantly, unlike most of current works in this area, our proposed framework can enable the source network to transfer the knowledge to the target network even when they are different data structure, e.g., different features. The experimental results then show that the accuracy of prediction of our proposed framework is significantly improved in comparison with the state-of-the-art unsupervised deep learning model. In addition, the convergence of the proposed collaborative learning model is also analyzed with various cybersecurity datasets. In the future work, we can consider to use other effective transfer learning techniques to make transfer learning processes more stable and achieve better performance, especially when the amount of mutual information is very limited.

Vii Acknowledgement

This work is the output of the ASEAN IVO project “Cyber-Attack Detection and Information Security for Industry 4.0” and financially supported by NICT

This work was supported in part by the Joint Technology and Innovation Research Centre – a partnership between the University of Technology Sydney and the University of Engineering and Technology, Vietnam National University, Hanoi.


  • [1] Note: Cited by: §I, §II-A, §V-A, §V-A.
  • [2] A. Abeshu and N. Chilamkurti (2018) Deep learning: the frontier for distributed attack detection in fog-to-things computing. IEEE Communications Magazine 56 (2), pp. 169–175. Cited by: §II-B.
  • [3] M. Aledhari, R. Razzak, R. M. Parizi, and F. Saeed (2020) Federated learning: a survey on enabling technologies, protocols, and applications. IEEE Access 8, pp. 140699–140725. Cited by: §I.
  • [4] Y. Chen, X. Qin, J. Wang, C. Yu, and W. Gao (2020) Fedhealth: a federated transfer learning framework for wearable healthcare. IEEE Intelligent Systems 35 (4), pp. 83–93. Cited by: §II-B.
  • [5] Y. S. Dabbagh and W. Saad (2019-Apr.) Authentication of wireless devices in the internet of things: learning and environmental effects. IEEE Internet of Things Journal 6 (4), pp. 6692–6705. Cited by: §II-C.
  • [6] S. Dua and X. Du (2016) Data mining and machine learning in cybersecurity. Auerbach Publications. Cited by: §I.
  • [7] Y. Fan, Y. Li, M. Zhan, H. Cui, and Y. Zhang (2020-12) Iotdefender: a federated transfer learning intrusion detection framework for 5g iot. In

    2020 IEEE 14th International Conference on Big Data Science and Engineering (BigDataSE)

    pp. 88–95. Cited by: §III-C.
  • [8] T. Fawcett (2006-06) An introduction to ROC analysis. Pattern recognition letters 27 (8), pp. 861–874. Cited by: §IV-C.
  • [9] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. Cited by: §III-A, §III-A, §V-B.
  • [10] A. Javaid, Q. Niyaz, W. Sun, and M. Alam (2016) A deep learning approach for network intrusion detection system. In Proceedings of the 9th EAI International Conference on Bio-inspired Information and Communications Technologies (formerly BIONETICS), pp. 21–26. Cited by: §II-A.
  • [11] Y. Keshet (2020-03) Half of the malware detected in 2019 was classified as zero-day threats, making it the most common malware to date. External Links: Link Cited by: §I.
  • [12] T. V. Khoa, Y. M. Saputra, D. T. Hoang, N. L. Trung, D. Nguyen, N. V. Ha, and E. Dutkiewicz (2020) Collaborative learning model for cyberattack detection systems in iot industry 4.0. In 2020 IEEE Wireless Communications and Networking Conference (WCNC), Vol. , pp. 1–6. Cited by: §III-A, §IV-A.
  • [13] B. Li, Y. Wu, J. Song, R. Lu, T. Li, and L. Zhao (2020-09) DeepFed: federated deep learning for intrusion detection in industrial cyber-physical systems. IEEE Transactions on Industrial Informatics. Cited by: §II-B.
  • [14] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y. Liang, Q. Yang, D. Niyato, and C. Miao (2020) Federated learning in mobile edge networks: a comprehensive survey. IEEE Communications Surveys & Tutorials 22 (3), pp. 2031–2063. Cited by: §I, §I.
  • [15] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi (2017) A survey of deep neural network architectures and their applications. Neurocomputing 234, pp. 11–26. Cited by: §III-A.
  • [16] Y. Liu, T. Chen, and Q. Yang (2018) Secure federated transfer learning. arXiv preprint arXiv:1812.03337. Cited by: §III-A, §III-B.
  • [17] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017) Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282. Cited by: §III-B.
  • [18] Y. Meidan, M. Bohadana, Y. Mathov, Y. Mirsky, A. Shabtai, D. Breitenbacher, and Y. Elovici (2018-07) N-BaIoT-Network-based detection of IoT botnet attacks using deep autoencoders. IEEE Pervasive Computing 17 (3), pp. 12–22. Cited by: §I, §V-A.
  • [19] Y. Mirsky, T. Doitshman, Y. Elovici, and A. Shabtai (2018) Kitsune: an ensemble of autoencoders for online network intrusion detection. arXiv preprint arXiv:1802.09089. Cited by: §I, §V-A.
  • [20] S. Morgan (2021-03) Global ransomware damage costs predicted to hit $11.5 billion by 2019. External Links: Link Cited by: §I.
  • [21] N. Moustafa and J. Slay (2015-11) UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In 2015 Military Communications and Information Systems Conference (MilCIS), pp. 1–6. External Links: Document Cited by: §I, §II-A, §V-A, §V-A.
  • [22] K. E. Mwangi, S. Masupe, and J. Mandu (2019-Oct.) Transfer learning for internet of things malware analysis. In International Conference on Information, Communication and Computing Technology, pp. 198–208. Cited by: §III-C.
  • [23] C. T. Nguyen, N. Van Huynh, N. H. Chu, Y. M. Saputra, D. T. Hoang, D. N. Nguyen, Q. Pham, D. Niyato, E. Dutkiewicz, and W. Hwang (2021) Transfer learning for future wireless networks: a comprehensive survey. arXiv preprint arXiv:2102.07572. Cited by: §I, §III-C, §III-C.
  • [24] K. K. Nguyen, D. T. Hoang, D. Niyato, P. Wang, D. Nguyen, and E. Dutkiewicz (2018-04) Cyberattack detection in mobile cloud computing: A deep learning approach. In 2018 IEEE Wireless Communications and Networking Conference (WCNC), pp. 1–6. External Links: Document Cited by: §II-A.
  • [25] T. D. Nguyen, S. Marchal, M. Miettinen, H. Fereidooni, N. Asokan, and A. Sadeghi (2019-07) DÏot: a federated self-learning anomaly detection system for iot. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), pp. 756–767. Cited by: §II-B.
  • [26] S. Niu, Y. Liu, J. Wang, and H. Song (2020) A decade survey of transfer learning (2010–2020). IEEE Transactions on Artificial Intelligence 1 (2), pp. 151–166. Cited by: §I, §III-C.
  • [27] S. J. Pan and Q. Yang (2009) A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22 (10), pp. 1345–1359. Cited by: §III-C, §III-C.
  • [28] D. M. Powers (2011) Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies, pp. 37–63. Cited by: §IV-C.
  • [29] V. Rey, P. M. S. Sánchez, A. H. Celdrán, G. Bovet, and M. Jaggi (2021) Federated learning for malware detection in iot devices. arXiv preprint arXiv:2104.09994. Cited by: §II-B.
  • [30] W. Schneble and G. Thamilarasu (2019) Attack detection using federated learning in medical cyber-physical systems. In 28th International Conference on Computer Communications and Networks (ICCCN), pp. 1–8. Cited by: §II-B.
  • [31] Y. Sharaf-Dabbagh and W. Saad (2015-Aug.) Transfer learning for device fingerprinting with application to cognitive radio networks. In 2015 IEEE 26th Annual International Symposium on Personal, Indoor, and Mobile Radio Communications (PIMRC), pp. 2138–2142. Cited by: §II-C, §II-C.
  • [32] Y. Sharaf-Dabbagh and W. Saad (2016-06) On the authentication of devices in the internet of things. In 2016 IEEE 17th International Symposium on A World of Wireless, Mobile and Multimedia Networks (WoWMoM), pp. 1–3. Cited by: §II-C.
  • [33] University of New Brunswick. Note: Cited by: §I, §II-A, §V-A, §V-A.
  • [34] R. Vinayakumar, M. Alazab, K. Soman, P. Poornachandran, A. Al-Nemrat, and S. Venkatraman (2019) Deep learning approach for intelligent intrusion detection system. IEEE Access 7, pp. 41525–41550. Cited by: §II-A.
  • [35] L. Vu, Q. U. Nguyen, D. N. Nguyen, D. T. Hoang, and E. Dutkiewicz (2020-06) Deep transfer learning for IoT attack detection. IEEE Access 8, pp. 107335–107344. Cited by: §II-C, §II-C, §III-A, §III-C.
  • [36] Y. Wang, H. Yao, and S. Zhao (2016) Auto-encoder based dimensionality reduction. Neurocomputing 184, pp. 232–242. Cited by: §III-A.
  • [37] T. Wen and R. Keyes (2019-05) Time series anomaly detection using convolutional neural networks and transfer learning. arXiv preprint arXiv:1905.13628. Cited by: §II-C, §II-C.
  • [38] Q. Yang, Y. Liu, T. Chen, and Y. Tong (2019) Federated machine learning: concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 10 (2), pp. 1–19. Cited by: §III-B.
  • [39] C. Zhao, Z. Cai, M. Huang, M. Shi, X. Du, and M. Guizani (2018-Mar.) The identification of secular variation in IoT based on transfer learning. In 2018 International Conference on Computing, Networking and Communications (ICNC), pp. 878–882. Cited by: §II-C, §II-C.
  • [40] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He (2020) A comprehensive survey on transfer learning. Proceedings of the IEEE 109 (1), pp. 43–76. Cited by: §I.