The computer networked system has increasingly become a critical infrastructure in supporting a wide range of services in the economy, education, and government sectors. The fact that these sectors are facing severe challenges and threats, such as human error, equipment failure, deliberate attacks, natural disasters, and economic crisis. From a time perspective, it is impossible to respond to all unknown threats in advance. From a space perspective, it is rather difficult to transfer or rebuild network infrastructure before or after a disaster occurs. Therefore, the network system should have the ability to resist internal or external threats and sustain normal services and tasks, rather than providing absolute security.
The word “resilience” comes from the Latin word “resilio”, which literally means “to bounce back”, referring to a system’s ability to return to normal condition after challenging or destructive events. This broad definition applies to fields as diverse as ecology, materials science, psychology, economics, and engineering . A variety of definitions on resilience have been proposed by researchers in multiple disciplines. For example, Pregenzer  defined resilience as the “measure of a system’s ability to absorb continuous and unpredictable change and still maintain its vital functions.” Woods  declared that resilience is the system’s ability to create foresight, identify risks, and mitigate risks before adverse consequences. Haimes  defined resilience as “the ability of the system to withstand a major disruption within acceptable degradation parameters and to recover with an acceptable cost and suitable time.” Defined by the National Academy of Sciences (NAS) as “the ability to plan and prepare for, absorb, recover from disasters and more successfully adapt to adverse events”, resilience is becoming one of the most widely used attributes in various organizations and governments .
There also exists discussions in the literature about the confusion between resilience and other system attributes, such as robustness, vulnerability, and reliability. Robustness is usually defined as insensitivity to uncertain disturbances . Uday and Marais  added that the purpose of robustness is to immediately minimize performance loss after disturbance; in contrast, resilience allows for some performance loss in the hope that performance can be restored over time. Vulnerability focuses on susceptibility to known disturbances that can be obtained by both attacker and defender in advance . Reliability refers to the system’s ability and its components to accomplish required functions within a specified time under stated conditions . Different from the above concepts of system attributes, resilience places greater emphasis on the recovery and evolutionary ability to resist unknown future threats.
There emerges diverse network attacks and threats, and network security defenses are also being developed into a proactive defense direction, such as the Moving Target Defense . Therefore, it has become an urgent problem to reasonably and effectively evaluate and improve the network resilience in various attack and defense scenarios. In general, much research on resilience has been performed from the following three aspects. Measurement and evaluation research is the first step to study network resilience, including failure models, measure indicators, and aggregation models. Second, optimized and improved strategies for network resilience are conducted under specific network scenario based on entity-related analysis. Third, research based on resilience focuses on trade-off between networking performance and resource invested in practical network, such as transportation network  and power supply network . However, the existing research on network resilience lack general measurement methods and standards suitable for different network scenarios, and most of them were only used to evaluate network resilience on time-invariant network, which cannot reflect the dynamic characteristic of real-world network. In this paper, a quantitative framework for network resilience evaluation is established using the Dynamic Bayesian Network based on modeling of five core resilient capabilities. The proposed framework is suitable for time-varying network and can be used to describe the process of network resilience, including preparation, resistance, adaptation, recovery, and evolution.
In this paper, our contributions can be summarized as follows:
1) We propose a general resilience evaluation framework to describe the multi-stage resilience transformation process of network including preparation, resistance, adaptation, recovery, and evolution.
2) We define and describe five core capabilities of network resilience based on several basic network measurement indicators and dynamic abilities of network components.
3) We use the Dynamic Bayesian Network to describe the time-varying resilience of network and combine it with the multi-stage network resilience transformation process to evaluate network resilience.
The rest of this paper is structured as follows. In section II, we begin with the definition of network resilience, and we propose an evaluation framework of network resilience. Evaluation framework and core capabilities indicators are proposed in section III. In section IV, the resilience time-varying process based on the Dynamic Bayesian Network are defined and modeled in sections III. In section V, we use two group of experiments to verify the rationality and performance of our proposed method. Finally, we provide the related works of network resilience evaluation in section VI and a brief conclusion and directions for future studies in section VII.
Ii Description of Network Resilience
This section provides the definition of network resilience and the resilience evaluation framework, which describes the multi-stage resilience process of transformation and the corresponding resilient abilities. The evaluation framework serves as the methodological background for the network resilience’s core capabilities measurement and the Dynamic Bayesian Network modeling that will be discussed in section IV.
Ii-a Network Resilience Definition
There is almost no universally accepted definition of resilience in network. However, the literature agrees on several key aspects of network resilience evaluation, which involves the probability of disruption, the impacts of those disruptions, and the behavior of recovering from the disruption to the normal state. In this paper, we enrich and complement the definition of resilience in network based on NAS’s version . The resilient network should be equipped with the resistance’s ability to inhibit or mitigate internal and external disruptions and networking attacks, the ability to adapt against disruption by adjusting network structure and functional elements, the ability to recover from low performance state to normal running state, and the ability to evolve to a more stable state by intelligently reallocating network resources. When facing future networking adverse events, networks with resilient abilities will adopt faster responses and optimal strategies that minimize network damage.
Ii-B Time-varying Process of Network Resilience
As illustrated in Fig.1, a complete change process of network resilience will go through several time-varying processes of normal network fluctuation (preparation), resistance, adaptation, recovery, and evolution. A quantifiable and time-independent system performance function is the basis for evaluating the resilience of the network system . The curve in Fig.1 has a normal value waving around caused by normal network fluctuations. The network operates at this performance level until it is disrupted or attacked at time . The network system’s risk-aware component will distinguish between disruptions and normal performance fluctuations and initiate resistant approaches to prevent further degradation of network performance after . If the resistance stage is successful, the network will return to a normal operation stage. However, when the performance function declines to at time , the network system will no longer be able to provide a stable service performance and cannot adjust itself back to a normal operation stage. Then, during the adaptation stage, the network performance drops to at under the adaptation strategies, including adjusting the network structure and configuration. After time , the network system launches recovery policies to improve performance until it achieves at a later time . Since the network performance returns to the normal-level threshold, the historical information during destroy and recovery stages can be used to better optimize networking configuration strategies for improving network performance. As shown in Fig.1, the area between resilience curve and dotted , also called resilience triangle , is the total loss as a consequence of the disruption.
Let be the network resilience, which describe the instantaneous resilience performance functionality at time , normalized by the expected network functionality supposing that the network has not been affected by disruption. Let be the cumulative network resilience during time , which describe the total loss of network resilience over a period of time . The formula of and as Equation (1) and (2).
|RRC||Rapid response capability||Betweenness of node i|
|SRC||Sustained resistance capability||Betweenness of edge (i, j)|
|CRC||Continuous running capability||The bandwidth of edge (i, j) at time t|
|RCC||Rapid convergence capability||The RTT between edge (i, j) at time t|
|DEC||Dynamic evolution capability||The degree of node i|
|MAI||The maximum of the total four abilities of each node||Flow robustness of network|
|G(N, E)||Network with a set of nodes N connected by a set of edges E||Effective graph resistance|
|T||Standard time interval||Structure entropy of network|
|The observation ability of node i at time t||The ratio of node i degree to total node degree|
|The control ability of node i at time t||Network performance of preparing at time t|
|The decision ability of node i at time t||Network performance of resisting at time t|
|The action ability of node i at time t||Network performance of adapting at time t|
|Likelihood of disruption on node i||Network performance of recovering at time t|
|Likelihood of disruption on edge (i, j)||Network performance of evolving at time t|
|Repair rate function of node i||The time of destroy occurring|
|Repair rate function of edge (i, j)||The number of destroyed nodes|
|Criticality of node i on the network||The number of recovery nodes at each recovery step|
|Criticality of edge (i, j) on the network||The probability of effectively attack|
|Network criticality||The recovery probability of nodes and edges|
It can be graphically displayed that the in formula (2) focuses on the whole resilience stage and its value is in range of [0, 1], which is quantified by the shadow area’s ratio below the curve to the total rectangle shadow area in Fig.1. The value is influenced by the resilience process time and the minimum performance threshold . When , a slower curve descent rate and a higher minimum performance threshold will result in a larger value, meaning that the network is more resilient.
Ii-C Factors of Network Resilience
The computer network is a system in which multiple sub-systems with independent functions are connected by communication links; these links are managed by a network operating system and protocol software to realize data communication and network resource sharing. The multiple subsystems with independent functions can be abstracted as nentwork nodes with different capabilities. For example, the Intrusion Detection System can be regarded as a node with observation capacity, and the controller cluster can be considered as a node with control capacity. Moreover, a data center has powerful decision capacity, and massive network terminal devices are equipped with special action capacity. By analyzing the functional characteristics of actual network units, inspired by the “OODA” loop , we define four kinds of variable capacities of network nodes, including observation capacity, control capacity, decision capacity, and action capacity. Additionally, adjustable bandwidth and delay are enabled to network links, which also refer to the actual network system. These dynamic properties mentioned above provide basic motivation for network resilience evaluation and are also the internal perspective for network resilience. Referring to numerous quantitative studies in the literature , this paper considers both the internal perspective and external perspective of network system. It is precisely because of the internal networking elements’ specific functions that the performance of network resilience from external perspective can be established, including preparation, resistance, adaptation, recovery, and evolution.
The above four node’s capacities and two edge’s capacities will have significant effects on the network resilience derived from the proposed framework. Once the network system is attacked, dynamically and intelligently tuning these basic capabilities can help achieve the network’s resilience. Our starting point of the proposed framework is a computer network system under the environment of cyberattacks. If researcheres want to apply the evaluation model to real-world network such as smart power grids and transportation networks, they need to make appropriate capacity mapping for the nodes and edges firstly. For example, in a smart power grid, the intrusion detection equipment is a node with observation capacity; the safety isolation device is a node with action capacity; the computing platform is a node with decision capacity; the monitoring center is a node with control capacity. And these capacities can also be adjuested over time according to the real-time requirements of network system. The specific application to actual networks requires further clarification and functional refinement of the elements in the network.
Iii Framework and Methodology
This section begins with the introduction of the network’s basic attributes, including graph spectral matrices and time-varying capacities of network elements. Based on these attributes, we build the measurement of five core capabilities during the resilience process as mentioned in section II. Next a general modeling framework of resilience evaluation based on the Dynamic Bayesian Network is established to quantify resilience in a time-varying network. The evaluation framework of network resilience is illustrated as Fig.2. The notations are displayed in the table below to define the network attributes and probability parameters of network components during evaluation.
Iii-a Basic Networking Attributes
The mathematical model for the network resilience evaluation concerns a basic network , which comprises a set of nodes connected by a set of edges or arcs . The relationship between nodes and edges in a network can be visually described by graph. Several graph spectral matrices, such as algebraic connectivity, natural connectivity, and flow robustness, are generally employed to measure the robustness and resilience of network in generally . The graph’s topology can be represented by adjacency matrix and Laplacian matrix. Let
represent the eigenvalues list of the Laplacian matrix. And all the variables defined for network will be normalized to value in the interval [0,1] when calculating following equations.
Flow robustness, denoted as FR(G), is a graph metric that measures the ratio of the number of available flows to the number of total flows in the network . A flow is considered available if at least one of its paths remains reachable after link or node failures. The number of total flows represents the maximum of network flows. For example, a connected network with nodes has flows between all node pairs. The range of flow robustness values is between 0 and 1, where 1 means that the graph is a completely connected graph, and 0 indicates that the nodes cannot communicate with each other in the whole network. Let be the set of connected sub-graph in given network , and the of a network can be calculated by union-find set within linear time complexity. The union-find set is a tree-shaped merging and searching data structure, which can solve the problem of a disjointed search with constant-level time and space consumption. It will not consume too much computing resources when calculating the connected subgraph even in large-scale network. The flow robustness can be calculated as Equation (3).
Effective graph resistance, denoted as R(G), is a graph metric that measures the network’s resistance against nodes or edges destroyed . The normalized R(G) is calculated as Equation (4), where is the non-zero eigenvalue of the given graph’s Laplacian matrix , where the values of lie in the interval .
Besides the measurement of graph spectral matrices, the time-varying attributes of network elements mentioned in section II-A are also crucial factors on modeling core capability of network resilience. With the development of the Software Defined Network (SDN) and Network Function Virtualization (NFV) technologies, the network become more dynamically reconfigurable and programmable . Moreover, network delay and bandwidth can be uniformly scheduled by routers with programmable kernel to achieve the network’s modifiability and controllability. As described in section II-C, different network sub-systems and elements equip different networking capacities, which can be more convenient with a wide deployment of SDN and NFV, and the capacities of network elements can be dynamically adjusted over time according to the needs of application scenarios. In order to reflect the dynamic characteristics of network elements in the process of measuring network resilience, we simplify and refine the key time-varying capacities of nodes and edges in this paper. The nodes in the network will be equipped with four kinds of capacities, including observation capacity, control capacity, decision capacity and action capacity, which can be adjusted over time, and the summation is within a certain range, considering that the resource and computing power of nodes are limited in real network. We define the maximum of the total four capacities of each node as MAI. The capacity of node is denoted as , described as Equation (5).
Meanwhile, the edges’ capacities in network will be measured by the maximum bandwidth, real-time bandwidth and RTT latency, which are established on the adjacency matrix of network. Define as the maximum bandwidth matrix, and as the real-time bandwidth matrix, and as the RTT latency matrix.
Iii-B Core Capabilities of Resilience
In our previous definition, the resilient network should be provided with the resisting ability to mitigate or prevent internal and external disruptions and networking attacks, the ability to adapt against disruption by adjusting network structure and functional elements, the ability to recover from low performance state to normal running state, and the ability to evolve to a more stable state by intelligently reallocating network resources.
Meanwhile, a complete network resilient process should go through the following stages: preparing, resisting, adapting, recovering, and evolving. There are subtle differences between the above two perspectives. The former refers to resilience as a type of overall network capability, similar to CIA principles in information security system . The latter perspective, however, considers resilience as a manifestation or performance of network during operation process. Fundamentally, this is the network’s resilient capability that results in resilience performance. Therefore, five core capabilities of resilience will be defined and modeled to describe the resilience capability more specifically as follows. It should be noted that network resilience is established on time-varying dimensions; therefore, the following five capabilities are transient capabilities, which can vary in values over time.
Iii-B1 Rapid Response Capability
The rapid response capability (RRC) is defined as the system’s response speed and emergency capability against disturbance or cyber-attacks. The system can take emergency rescue and recovery measures in earlier times when it has better rapid response capability. This capability is related to the perceptual or observing ability of network components (nodes) and the transmission ability of network links. We combine the abilities of nodes and links with graph theory. In a definite scale network, the node with larger observing ability and more connected edges will has more observation capability, and the link with more bandwidth and less transmission delay will has more rapid response capability. Therefore, the network nodes’ observing ability is defined as the product of observation ability and degree distribution of network nodes. The links’ transmission ability can be determined by the ratio of edges’ betweenness centrality and the RTT delay between node pairs. The formula is established as Equation (6), and the calculation process only considers the value of each parameter, so there is no exact unit for RRC.
Iii-B2 Sustained Resistance Capability
The sustained resistance capability (SRC) is defined as the system’s ability to prevent a rapid decline in network performance, which is related to the resources’ redundancy, the network topology’s robustness, the regional network’s autonomous intelligent management, and how to prevent the cascading failures’ propagation. The SRC capability can directly affect the resistance duration and the network performance’s minimum limit. The formula of SRC can be described as Equation (7). The numerator considers the average effective gragh resistance from a graph theory point of view, and the denominator represents the harm caused by the destruction of network elements, which is contributed by the criticality and disruption likelihood of nodes and edges together.
where is the effective graph resistance of network , and is the degree of node , and are the disruption likelihood of node and edge , respectively. and are the criticality of node and edge to network, respectively. Among them, the is calculated as the product of node’s betweenness and the sum of node’s observation and action ability, and the is calculated as the product of edge’s betweenness and the sum of two side node’s network criticality, which can be illustrated as Equation (8) and (9).
Iii-B3 Continuous Running Capability
The continuous running capability (CRC) is defined as the system’s ability for ensuring the continuous operation of the network service during low efficiency phase. The network can reduce service performance to provide less quality of service while ensuring current network security. If some forwarding nodes on the shortest path fail, the transmission task can still be completed by rerouting, despite increasing the link transmission delay. The CRC is determined by the network’s flow robustness, real-time bandwidth, and edges’ criticality. It can be clearly analyzed that the network can make more rerouting decision with larger , and the critical edge should be equipped with larger real-time bandwidth. The formula of CRC is illustrated as Equation (10).
Iii-B4 Rapid Convergence Capability
The rapid convergence capability (RCC) is defined as the capability to ensure the network’s rapid convergence and restoration in the recovery stage, including network status monitoring, recovery strategy deployment, etc., which involve the adjustment of node control ability, decision ability and action ability. The purpose of constructing rapid convergence ability is to speed the recovery rate and reduce the recovery-stage time. The RCC is determined by repair rate of node and edge, the node’s control, decision and action ability, the edge’s real-time bandwidth and RTT delay. And the formula of RCC is illustrated as Equation (11). In the recovery stage, the nodes with strong control and decision abilities will have more positive impact on the network through control path, which result in the and are weighted by betweenness . The node’s action ability will directly affect its connected nodes, so the is weighted by node degree . By contrast, the edge’s influence on RCC can be better understood that the edge with larger betweenness , larger real-time bandwidth and less transmission delay will make more positive effect on RCC. It should be noted that all variables are normalized to value in the interval [0,1], therefore the calculation of equation can make sense.
Iii-B5 Dynamic Evolution Capability
The dynamic evolution capability (DEC) is defined as the system’s capability to continue to evolve and regenerate after recovering from damage. It involves the deep learning and ratiocination of historical destruction and recovery measures. This ability will directly depend on the network’s structural entropy and the network components’ maximum capability. Network with high dynamic evolution capability will be more resistant against similar destruction in the future. The network with larger structural entropy will have greater structural stability for network’s dynamic evolution. Moreover, the adjustment of various abilities of nodes is within the range of the maximum resource of each node. And the maximum of edge’s bandwidth is also the key factor of dynamic evolution capability. In this paper, the DEC is calculated as the product of the network’s structural entropy, the maximum of the node’s abilities and the maximum of the edge’s bandwidth. The formula is illustrated as Equation (12).
where is the network’s structural entropy, which can be calculated as Equation (13), where represents the ratio of node i degree to total node degree. Besides, MAI and MBw represent the maximum of the nodes’ total four capacities and the maximum of the edges’ bandwidth, respectively.
Iv Resilience Evaluation Model Based on DBN
On the basis of detailed description of the core resilience capabilities in the above section, an evaluation model suitable for the time-varying process on network resilience needs to be established. There are many modeling methods that can visually describe the cause-effect relationship in networks, which have been used to evaluate resilience, as outlined in related works. In particular, the Bayesian Network and the Dynamic Bayesian Network can capture the conditional independence between random variables, which will be more suitable for evaluating network resilience.
The Bayesian Network (BN), also known as the Belief Networks or Causal Networks 
, is a directed acyclic graphical pattern to describe the conditional probability relationship between data variables based on probabilistic inference theory. The nodes in BN represent random variables, and the links between them represent conditional dependencies among the variables with parent nodes, which are ruled by the conditional probability tables (CPT). However, static BN cannot be used to model time-varying system. Therefore, the Dynamic Bayesian Network (DBN) was proposed based on the hidden Markov model to satisfy the temporal system. The DBN is also called a two-time slice BN, because there are two time slices provided in DBN modeling, time sliceand . The discrete time slice is usually set to 1. By dividing a time duration into a series of time slice, the DBN allows the node attribute variable at time slice to be conditionally dependent upon its parent nodes at the same time slice, as well as its parents and its own states at the previous time slice . The probability function of node at the time slice can be mathematically described as Equation (14), where is the number of parent nodes of node .
To develop a more detailed resilience evaluation model, resilience-network characteristics rather than actual network nodes will be modeled as the nodes of DBN model. It can be seen that is a crucial characteristic when evaluating network resilience as introduced in section II-B. There is insufficient literature that formally describes , or designates a simple system attribute as the quantitative standard of . In this paper, we establish a detailed quantitative indicator of , which is determined by five defined time-independent network performances: performance of preparing , performance of resisting , performance of adapting , performance of recovering , and performance of evolving . These five performances are directly influenced directly by the five core resilience capabilities described in section III-B. Meanwhile, there also exists time dimensional interaction between these five performances based on conditional probability. For example, network A and network B equip the same sustained resistance capability (SRC) at time when suffering the same destruction at time , but the performance of preparing of network A is lower than network B at time . It can be expected that network A has a better resisting performance
than network B due to the effect of previous moment.
The DBN is adopted in modeling and evaluating network resilience in the following details, and the basic structure of DBN is shown in Fig.3. First, each row of nodes represents the five resilience performances at the same time slice, where the five attribute nodes are linked by solid arcs. The solid arcs represent the conditional transition probability between the parent node and self node. Second, each column of nodes represents the time-varying states of every resilience performance. The dotted arcs between column nodes represent the temporal conditional transition probability of every resilience performance. Third, the diagonal dotted arcs represent the conditional transition probability between the parent node at the previous time slice and the self node at the current time slice. For example, the performance of resisting shown as blue node in Fig.3 can be calculated as Equation (15).
Where the three conditional transition probability , , and are determined by the state of three front nodes, respectively. Based on the evaluation of five resilience performances, the network resilience performance can be calculated as a weighted sum of these five performances, which is illustrated as Equation (16). Finally, we can determine the network resilience by putting into the Equation(1) and (2). Based on the modeling above, the procedure of evaluating network resilience is summarized in Algorithm1.
V Numerical Experiment and Discussion
In this section, we conducted two experiments on three network topologies, ER network, BA network, and one of real-world topologies Geant2012, from The Internet Topology Zoo . In the first experiment, comparative experiments were performed among our proposed evaluation approach and other network resilience measurements in existing research. In the second experiment, variable parameters were set to observe the performance of the proposed evaluation model. The simulation experiments are conducted by Python 3.8, and each network model is encapsulated as a class. During a simulation, a network object is generated and various parameters are assigned. With the execution of simulated attack and recovery, network parameters and the resilience performance of network are calculated and recorded in CSV files.
In order to prove the rationality and applicability of the proposed evaluation approach, ER network, BA network, and a real-world network are selected as the network topology datasets. In ER network , all the nodes are randomly connected according to probability
, and the degree of any node is subject to binomial distribution. The BA networkis generated with the rule that each newly added node is connected to existing nodes with largest degree in the network. This happens in many real-world network models. In our experiments, the Geant2012 topology (with 40 nodes, 61 edges) was selected as an example of real-world network from the Internet Topology Zoo. ER network is set as and BA network is set as .The size of generated network is set as 100 to keep the similar dimension with Geant2012.
V-B Comparison among proposed and other approaches
There has been significant research work performed to evaluate network resilience in recent years. Xu et al.  used the relative network size as the resilience performance , where is the size of the network’s largest connected component during recovery process. Moreover, Alenazi et al.  employed networking flow robustness as the measure indicator of network resilience, which is described as formula (4) in section III-A. To simplify the description and comparison in the following experiment, we named these two evaluating methods as compare1 and compare2 respectively. The purpose of this experiment is to demonstrate that our proposed evaluation approach perform better evaluating effects than other methods. The experiment will be performed in certain configurations and assumptions. We make the convention and assumption that the network’s destruction mainly focuses on node failure, and once one node fails, the connected edges will fail as well. Then, the failed nodes and edges will be removed from the original network. Moreover, two types of simulated attack behaviors, random-based attack and centrality-based attack, are conducted in the experiment. The random-based attack will randomly delete a given number of nodes and their connected edges from the original network graph. Contrastingly, the centrality-based attack will destroy nodes with larger degree and their connected edges from the network. The relative parameters of experiment are listed in Table I.
|5||The attack occurs at time step 5|
|The number of deleted nodes during destruction|
|The number of recovered nodes in each time step|
|1||The probability of effective attack|
|1||The probability of recovery of nodes and edges|
The results are illustrated in Fig.4, and the horizontal axis represents the running time step, which records the network’s transformation from destruction to recovery, and the vertical axis represents the performance of network resilience as described in formula (17). The blue lines represent our proposed resilience evaluation performance under two types of attacks in three networks. The green lines and black lines represent the compare1 and compare2 evaluation methods respectively. After the attack occurred at time step 5 in the BA and ER networks, the resilience performances declined due to the removal of nodes and edges. The performance of compare1 and compare2 maintain the invariable until the network launches recovery strategies. In contrast, the performance of proposed method continues to decline after destruction. The reason for these differences is that the measurement indicator of compare1 only depends on the network connectivity, or it can be said that it only focuses on the number of nodes in the maximum connected sub-graph, ignoring the network links and capacities. Similarly, taking the network’s flow into consideration, the measurement of compare2 (network flow robustness) shows lower performance reduction due to relatively higher accuracy than compare1. Both of these record only the network’s structural properties, which is a relatively constant state value. The evaluation results of compare1 and compare2 neglect many network attributes and functions, which are far from consistent with the real network situation. In contrast, the proposed method considers both the static network connection states and dynamic capacities of nodes and edges. The time-varying transitive resilience performance evaluation network is constructed based on the DBN, so it can reflect the realistic network resilience in a more detailed and real manner. Besides, as shown in Fig.4, the resilience performance of compare1 and compare2 sharply drops to very low levels when suffering an attack, because the real Internet topology of Geant2012 is less robust than the generated networks. However, the proposed approach does not reduce the evaluated resilience to a lower level by taking into account other network’s key factors. Compared with existing approaches, the proposed approach obtains higher resilience performance in Fig.4, In Fig.4(c), the final value of proposed, compare1, compare2 is 0.95, 0.72, 0.56, respectively. Compared with compare1 and compare2, the evaluation performance of proposed are improved and respectively. In Fig.4(f), the final value of proposed, compare1, compare2 is 1.13, 1.00, 1.00, respectively. Compared with compare1 and compare2, the evaluation performance of proposed are improved . In order to illustrate the intermediate variation process of resilience capabilities and basic performances under our proposed evaluation method, we displayed the detailed capabilities and performances of the proposed method of Fig.4(e) (ER network under random attack). Fig.5(a) shows the variation process of five core capabilities, and Fig.5(b) shows the variation process of five basic resilience performances and their average value, which are calculated by the DBN from five core capabilities. The average value of five basic performances () in Fig.5(b) represents the blue line () in Fig.4(e) before normalization.
V-C Evaluation under different attack and recovery scenarios
The key point of this paper is to effectively evaluate the network resilience in the process of dynamic change, rather than specific defense strategies and recovery algorithms. Therefore, the network’s attacks and the recovery strategies become the input variables for our evaluation framework. In order to verify the proposed evaluation model’s performance under different attack and recovery scenarios, we conducted two group of controlled experiments in scenarios for three levels of attack intensity and three levels of recovery intensity. Each scenario contains three types of networks and two types of network changed patterns, which include random-based and centrality-based attack and recovery patterns. The controlled parameters are listed in Table II. In the changed attack intensity scenarios, the high level attack is set as and , which represent that the of original network nodes will be affected by the attack and a part of these nodes whose will be compromised to the attack. Meanwhile, the number of recovered nodes in each time step is set as 2, and the probability of recovery of nodes and edges is set as 1. The purpose of is to make the controlled experiment more unified in recovery stage to demonstrate the differences between changed attack intensity scenarios. Similarly, in the changed recovery intensity scenarios, the number of deleted nodes during destruction is set as and , which means that a fixed number of nodes will be deleted at attack time. These can also eliminate the influence of different attacks on evaluation of changed network recovery strategies.
shows the resilience performance under changed attack intensity scenario in BA network. With the increase of attack intensity, the network’s resilience performance will be reduced to a lower level in the resistance stage. As the blue solid curve shown, the resilience performance reaches lowest level after time 5 under high intensity of attack and the centrality-based changed pattern. Compared with the blue dotted line, the centrality-based network destroyed pattern preferentially removes nodes with larger degree, resulting in the faster declined rate of resilience performance. This feature is especially obvious in BA network with centrality characteristic (a small number of nodes connect with numerous edges). Meanwhile, the centrality-based network destroyed pattern will cost more time to rebuild nodes and edges during the recovery stage. The network can achieve a higher level of recovery under low-attack intensity through the evolution stage. The evolution phenomenon in simulation can be implemented by deep reinforcement learning (DRL) in network area. The DRL algorithm can learn about the experience of network strategies and resilience value from previous stages to make more intelligent decision, such as routing optimization, flow migration , and redistribution of resources. These will result in a higher level of network resilience. As the most significant feature of resilience, evolution ability is bound to arouse widespread attention in the future. As is shown in Fig.6(b), different from Fig.6(a)
, except for the minimum level of descent and recovery time, there exists no significant difference in declination rate and last recovery level between changed attack intensity. Because the connection of nodes and edges is randomly generated, and the distribution of nodes’ degree is subject to the Poisson distribution, which suggests that different attack intensities and destroyed patterns will not significantly influence resilience performance. When Fig.6(c) is analyzed, the general characteristics above can also be proved in a real-world network Geant2012.
Let’s turn our attention to the effect on network resilience performance of different recovery intensities. Due to the same attack intensity parameters at attack time, the network resilience performance exhibits similar decline during the resistance stage. On the whole, the high intensity recovery results in faster recovery speed and higher last recovery performance level than middle and low intensity. However, there are still some differences caused by recovery pattern among the different networks. In Fig.6(d), BA network shows a slightly faster recovery rate and reaches a higher recovery level in the centrality-based recovery pattern. The reason is that the centrality-based recovery pattern is more suitable for the BA network with a same centrality-based topology. Similarly, under each recovery intensity, the random-based recovery pattern obtains a relatively higher recovery level in Fig.6(e). When analyzing Fig.6(f), the general characteristics above can also be proven in real-world network Geant2012. Under the same recovery pattern, curve with a high recovery intensity can give rise to a higher last recovery level.
The above two experiments prove that the proposed network resilience evaluation algorithm has a better evaluating effect than the compared existing works. The complete resilience process including preparation, resistance, adaptation, recovery and evolution stages can be well characterized by our proposed resilience evaluation model. Meanwhile, the resilience evaluation framework can be applied to various attack and recovery scenarios in different networks, which equips considerable applicability and universality.
Vi Related Works
In general, the resilience evaluation approaches can be mainly divided into two major directions: qualitative and quantitative. The former qualitative approach, which includes methods for assessing system resilience in the absence of numerical descriptors, contains two subcategories: (i) a conceptual framework that provides best practices, and (ii) semi-quantitative indicators that provide expert assessment of different qualitative aspects of resilience. While the latter quantitative approach focuses on general metrics which includes two subcategories: (i) general resilience methods that provide probability or deterministic metrics to quantify across applications, and (ii) structure-based modeling methods that model the domain-specific representations of resilience components . It should be noted that the focus of this paper is on quantitative approaches in the computer network domain.
There has been some research work performed on measuring or evaluating network resilience in recent years. Ahmadian  proposed a method to quantitatively measure network resilience for general physical networked systems, and defined network resilience as a function of criticality, disruption frequency, disruption impact, and recovery capability. Wang  established the mapping relationship between the physical network and the logical network in the IoT system to build the cascade failure model, and the critical network failure probability and the cascade length are employed as resilience metrics of the IoT system. Zhang  measured network resilience through formulating and iterating transfer matrix of networking nodes and edges. However, only considering the effects of nodes and edges is not suitable for actual complex networks, and more impact indicators and larger scale network simulation experiments should be considered in the future. Alenazi  investigated a set of graph spectral robustness metrics and evaluated their accuracy in predicating network resilience against three centrality-based node attacks. Although it was comprehensively considered from the perspective of graph spectral metrics, the measurement model did not reflect the time-varying feature of the dynamic practical networked system. Xu  researched the impact of recovery resources allocation approaches on the recovery of scale-free networks under the constraint of a fixed amount of total resources. But the measurement of resilience performance was simply calculated by relative network size, which cannot reflect the true resilience ability of complex networked systems. Yodo  proposed the DBN approach as a readily available modeling tool to quantify resilient performance in engineered resilient systems, which can capture the dynamic behavior of system performance, but only considering reliability and restoration as the quantification metrics. The existing research’s shortcoming is a lack of the systematic view of network resilience, and a lack of combined processes, such as comprehensive resilience capabilities, metric indicators, and time-varying resilience changing processes.
Vii Conclusion and Future Work
Evaluating network resilience against random failures and target attacks is an important work in network evaluation and defense. In this paper, a comprehensive framework was proposed to qualify the network’s time-varying resilience. This framework was developed based on the definition of the dynamic capacities of network components and the measurement of five proposed core network resilience capabilities, which are suitable for the multi-stage processes of network resilience. The DBN approach was employed to quantify the five fundamental and crucial indicators of network resilience performance in temporal network. The simulation experiments were developed to validate the effectiveness and universality of proposed evaluation framework. For future work, we plan to build an resilience evaluation system in physical network environment. Additionally, SDN and networking slice technology deserve attention for providing resilient network capability in future research.
This work is supported by National Key Research and Development Program of China under Grant No. 2018YFB1800602 and No. 2017YFB0801703, CERNET Innovation Project under Grant No. NGIICS20190101 and No. NGII20170406, and National Natural Science Foundation of China (61602114).
-  (2019) Review of major approaches to analyze vulnerability in power system. Reliability Engineering & System Safety 183, pp. 153–172. Cited by: §I.
-  (2020) A quantitative approach for assessment and improvement of network resilience. Reliability Engineering & System Safety, pp. 106977. Cited by: §II-B, §VI.
-  (2020) A quantitative approach for assessment and improvement of network resilience. Reliability Engineering & System Safety, pp. 106977. Cited by: §II-A.
-  (2015) Comprehensive comparison and accuracy of graph metrics in predicting network resilience. In 2015 11th International Conference on the Design of Reliable Communication Networks (DRCN), pp. 157–164. Cited by: §V-B.
-  (2015) Evaluation and improvement of network resilience against attacks using graph spectral metrics. In 2015 Resilience Week (RWS), pp. 1–6. Cited by: §V-B, §VI.
-  (2013) Resilience-based network component importance measures. Reliability Engineering & System Safety 117, pp. 89–97. Cited by: §II-B.
-  (2008) Homeland security preparedness: balancing protection with resilience in emergent systems. Systems Engineering 11 (4), pp. 287–308. Cited by: §I.
-  (2016) A review of definitions and measures of system resilience. Reliability Engineering & System Safety 145, pp. 47–61. Cited by: §I.
-  (2016) A review of definitions and measures of system resilience. Reliability Engineering & System Safety 145, pp. 47–61. Cited by: §II-A.
-  (2020) Probabilistic framework to evaluate the resilience of engineering systems using bayesian and dynamic bayesian networks. Reliability Engineering & System Safety 198, pp. 106813. Cited by: §IV.
-  (2020) A comprehensive survey of service function chain provisioning approaches in sdn and nfv architecture. Computer Science Review 38, pp. 100298. Cited by: §III-A.
-  (2015) Benchmarking agency and organizational practices in resilience decision making. Environment Systems and Decisions 35 (2), pp. 185–195. Cited by: §I, §II-A.
-  (2019) Analysis framework of network security situational awareness and comparison of implementation methods. EURASIP Journal on Wireless Communications and Networking 2019 (1), pp. 205. Cited by: §II-C.
-  (2015) A survey of techniques for modeling and improving reliability of computing systems. IEEE Transactions on Parallel and Distributed Systems 27 (4), pp. 1226–1238. Cited by: §II-C.
-  (2017) A quantitative method for assessing resilience of interdependent infrastructures. Reliability Engineering & System Safety 157, pp. 35–53. Cited by: §I.
-  (2011) Systems resilience: a new analytical framework for nuclear nonproliferation. Albuquerque, NM: Sandia National Laboratories. Cited by: §I, §VI.
-  (2020) DeepMigration: flow migration for nfv with graph-based deep reinforcement learning. In ICC 2020-2020 IEEE International Conference on Communications (ICC), pp. 1–6. Cited by: §V-C.
-  (2020) The internet topology zoo. Note: http://www.topology-zoo.org/dataset.html Cited by: §V.
-  (2009) Autonomic traffic engineering for network robustness. IEEE journal on selected areas in communications 28 (1), pp. 39–50. Cited by: §III-A.
-  (2015) Designing resilient systems-of-systems: a survey of metrics, methods, and challenges. Systems Engineering 18 (5), pp. 491–510. Cited by: §I.
-  (2019) Resilience of iot systems against edge-induced cascade-of-failures: a networking perspective. IEEE Internet of Things Journal 6 (4), pp. 6952–6963. Cited by: §VI.
-  (2014) Improving robustness of complex networks via the effective graph resistance. The European Physical Journal B 87 (9), pp. 221. Cited by: §III-A, §III-A.
-  (2011) Principles of information security. Cengage Learning. Cited by: §III-B.
-  (2005) Creating foresight: lessons for enhancing resilience from columbia. Organization at the limit: lessons from the Columbia disaster. Cited by: §I.
-  (2015) Four concepts for resilience and the implications for the future of resilience engineering. Reliability Engineering & System Safety 141, pp. 5–9. Cited by: §I.
-  (2020) A deep-reinforcement learning approach for sdn routing optimization. In Proceedings of the 4th International Conference on Computer Science and Application Engineering, pp. 1–5. Cited by: §V-C.
-  (2020) Effect of resource allocation to the recovery of scale-free networks during cascading failures. Physica A: Statistical Mechanics and its Applications 540, pp. 123157. Cited by: §V-B, §VI.
-  (2017) Predictive resilience analysis of complex systems using dynamic bayesian networks. IEEE Transactions on Reliability 66 (3), pp. 761–770. Cited by: §IV, §VI.
-  (2016) Resilience modeling and quantification for engineered systems using bayesian networks. Journal of Mechanical Design 138 (3). Cited by: §IV.
-  (2020) Resilience measure of network systems by node and edge indicators. Reliability Engineering & System Safety, pp. 107035. Cited by: §VI.
-  (2020) BCTCP: a feedback-based congestion control method. China Communications 17 (6), pp. 13–25. Cited by: §III-A.
-  (2019) Resilience of transportation systems: concepts and comprehensive review. IEEE Transactions on Intelligent Transportation Systems 20 (12), pp. 4262–4276. Cited by: §I.
Cost-effective moving target defense against ddos attacks using trilateral game and multi-objective markov decision processes. Computers & Security 97, pp. 101976. Cited by: §I.