Defense Against Advanced Persistent Threats in Dynamic Cloud Storage: A Colonel Blotto Game Approach

01/19/2018 ∙ by Minghui Min, et al. ∙ Xiamen University Rutgers University 0

Advanced Persistent Threat (APT) attackers apply multiple sophisticated methods to continuously and stealthily steal information from the targeted cloud storage systems and can even induce the storage system to apply a specific defense strategy and attack it accordingly. In this paper, the interactions between an APT attacker and a defender allocating their Central Processing Units (CPUs) over multiple storage devices in a cloud storage system are formulated as a Colonel Blotto game. The Nash equilibria (NEs) of the CPU allocation game are derived for both symmetric and asymmetric CPUs between the APT attacker and the defender to evaluate how the limited CPU resources, the date storage size and the number of storage devices impact the expected data protection level and the utility of the cloud storage system. A CPU allocation scheme based on "hotbooting" policy hill-climbing (PHC) that exploits the experiences in similar scenarios to initialize the quality values to accelerate the learning speed is proposed for the defender to achieve the optimal APT defense performance in the dynamic game without being aware of the APT attack model and the data storage model. A hotbooting deep Q-network (DQN)-based CPU allocation scheme further improves the APT detection performance for the case with a large number of CPUs and storage devices. Simulation results show that our proposed reinforcement learning based CPU allocation can improve both the data protection level and the utility of the cloud storage system compared with the Q-learning based CPU allocation against APTs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 6

page 7

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Cloud storage and cyber systems are vulnerable to Advanced Persistent Threats (APTs), in which an attacker applies multiple sophisticated methods such as the injection of multiple malwares to continuously and stealthily steal data from the targeted cloud storage system [1],[2] and [3]. APT attacks are difficult to detect and have caused privacy leakage and millions of dollars’ loss [4] and [5]. According to [6], more than 65% of the organizations in a survey in 2014 have experienced more APT attacks in their IT networks than last year.

The FlipIt game proposed in the seminal work [7] formulates the stealthy and continuous APT attacks and designs the scan interval to detect APTs on a given cyber system. The game theoretic study in [8] has provided insights to design the optimal scan intervals of a cyber system against APTs. Prospect theory has been applied in [9]

to investigate the probability distortion of an APT attacker against cloud storage and cumulative prospect theory has been used in

[10] to model the frame effect of an APT attacker to choose the attack interval. Most existing APT games ignore the strict resource constraints in the APT defense, such as the limited number of Central Processing Units (CPUs) of a storage defender and an APT attacker [7] and [11]. However, a cloud storage system with limited number of CPUs cannot scan all the data stored on the storage devices in a given time slot. To this end, encryptions and authentication techniques are applied to protect data privacy for cloud storage systems. On the other hand, an APT attacker with limited CPU resources cannot install malwares to steal all the data on the cloud storage system in a single time slot either [12].

It is challenging for a cloud storage system to optimize the CPU allocation to scan the storage devices under a large number of CPUs and storage devices without being aware of the APT attack strategy. Therefore, we use the Colonel Blotto game (CBG), a two-player zero-sum game with multiple battlefields to model the competition between an APT attacker and a storage defender, each with a limited total number of CPUs over a given number of storage devices. The player who applies more resources on a battlefield in a Colonel Blotto game wins it, and the overall payoff of a player in the game is proportional to the number of the winning battlefields [13]. The Colonel Blotto game has been recently applied to design the spectrum allocation of network service providers [14], the jamming resistance methods for Internet of Things [15] and [16].

Our previous work in [17] assumes that each storage device has the same amount of data and addresses APT attackers that does not change the attack policy. However, the storage devices usually have different amount of the data with different priority levels, and the data size and their priority level also change over time. By allocating more CPUs to scan the storage devices with more data, a storage defender can achieve a higher data protection level. Therefore, this work extends to a dynamic cloud storage system whose data size changes over time and addresses smart APTs, in which an attacker that can learn the defense strategy first chooses the attack strength to induce the storage system to apply a specific defense strategy and then attacks it accordingly.

By applying time sharing (or division), a defender can use a single CPU to scan multiple storage devices as battlefields to detect APTs in a time slot, and an attacker can use a single CPU to attack multiple devices with a single CPU yielding a roughly continuous CBG. According to [13], a pure-strategy Colonel Blotto game rarely achieves Nash equilibria (NEs). Therefore, we focus on the CBG with mixed strategies, in which both players choose their CPU allocation distribution and introduce randomness in their action selection to fool their opponent. The conditions under which the NEs exist in the CPU allocation game are provided to disclose how the number of storage devices, the size of the data stored in each storage device and the total number of CPUs in which the defender observes the impact on the data protection level and the utility of the cloud storage system against APTs.

The CBG-based CPU allocation game provides a framework to understand the strategic behavior of both sides, and the NE strategy relies on the detailed prior knowledge about the APT attack model. In particular, the cloud defender has to know the total number of the attack CPUs and the attack policy over the storage devices, which is challenging to accurately estimate in a dynamic storage system. On the other hand, the repeated interactions between the APT attacker and the defender over multiple time slots can be formulated as a dynamic CPU allocation game, and the defender can choose the security strategy according to the attack history. The APT defense decisions in the dynamic CPU allocation game can be approximately formulated as a Markov decision process (MDP) with finite states, in which the defender observes the state that consists of the previous attack CPU allocation and the current data storage distribution. Therefore, a defender can apply reinforcement learning (RL) techniques such as Q-learning to achieve the optimal CPU allocation over the storage devices to detect APTs in a dynamic game.

The policy hill-climbing (PHC) algorithm as an extension of Q-learning in the mixed-strategy game [18] enables an agent to achieve the optimal strategy without being aware of the underlying system model. For instance, the PHC-based CPU allocation scheme as proposed in our previous work in [17]

enables the defender to protect the storage devices with limited number of CPUs without being aware of the APT attack model. In this work, a “hotbooting” technique as a combination of transfer learning

[19] and RL exploits experiences in similar scenarios to accelerate the initial learning speed. We propose a “hotbooting” PHC-based CPU allocation scheme that chooses the number of the CPUs on each storage device based on the current state and the quality or Q-function that is initialized according to the APT detection experiences to reduce the exploration time at the initial learning stage.

We apply deep Q-network (DQN), a deep reinforcement learning technique recently developed by Google DeepMind in [20] and [21]

to accelerate the learning speed of the defender for the case with a large number of storage devices and defense CPUs. More specifically, the DQN-based CPU allocation exploits the deep convolutional neural network (CNN) to determine the Q-value for each CPU allocation and thus suppress the state space observed by the cloud storage defender. Simulation results demonstrate that this scheme can improve the data protection level, increase the APT attack cost, and enhance the utility of the cloud storage system against APTs.

The main contributions of this paper are summarized as follows:

(1) We formulate a CBG-based CPU allocation game against APTs with time-variant data size and attack policy. The NEs of the game are provided to disclose the impact of the number of storage devices, the amonut of data stored in each device and the total number of CPUs on the data protection level of the cloud storage system.

(2) A hotbooting PHC-based CPU allocation scheme is developed to achieve the optimal CPU allocation over the storage devices with low computational complexity and improve the data protection level compared with the PHC-based scheme as proposed in [17].

(3) A hotbooting DQN-based CPU allocation scheme is proposed to furture accelerate the learning speed and improve the resistance against APTs.

The rest of this paper is organized as follows: We review the related work in section II, and present the system model in section III. We formulate a CBG-based CPU allocation game and derive the NEs of the game in Section IV. A hotbooting PHC-based CPU allocation scheme and a hotbooting DQN-based scheme are developed in Sections V and VI, respectively. Simulation results are provided in Section VII and the conclusions of this work are drawn in Section VIII.

Ii Related work

The seminal work in [7] formulates a stealthy takeover game between an APT attacker and a defender, who compete to control a targeted cloud storage system. The APT scan interval on a single device has been optimized in [22] based on the FlipIt model without considering the constraint of scanning CPUs. The game between an overt defender and a stealthy attacker as investigated in [23] provides the best response of the periodic detection strategy against a non-adaptive attacker. The online learning algorithm as developed in[24] achieves the optimal timing of the security updates in the FlipIt game, and reduces the regret of the upper confidence bound compared with the periodic defense strategy. The APT defense game formulated in [12] extends the FlipIt game in [7] to multi-node systems with limited resources. The game among an APT attacker, a cloud defender and a mobile device as formulated in [25] combines the APT defense game in [7] with the signaling game between the cloud and the mobile device. The evolutionary game can capture the long term continuous behavior of APTs on cloud storage [26]. The information-trading and APT defense game formulated in [27] analyzes the joint APT and insider attacks. The subjective view of APT attackers under uncertainty scanning duration was analyzed in [9] based on prospect theory.

Colonel Blotto game models the competition between two players each with resource constraints. For example, the Colonel Blotto game with mixed-strategy as formulated in [14] studies the spectrum allocation of network service providers, yielding a fictitious play based allocation approach to compute the equilibrium of the game with discrete spectrum resources. The anti-jamming communication game as developed in [28] optimizes the transmit power over multiple channels in cognitive radio networks based on the NE of the CBG. The CBG-based jamming game as formulated in [15] shows that neither the defender nor the attacker can dominate with limited computational resources. The CBG-based jamming game as formulated in [16] shows how the number of subcarriers impacts the anti-jamming performance of Internet of Things with continuous and asymmetric radio power resources. The CBG-based phishing game as formulated in [29] investigates the dynamics of the detect-and-takedown defense against phishing attacks.

Reinforcement learning techniques have been used to improve network security. For instance, the minimax-Q learning based spectrum allocation as developed in [30] increases the spectrum efficiency in cognitive radio networks. The DQN-based anti-jamming communication scheme as designed in [31] applies DQN to choose the transmit channel and node mobility and can increase the signal-to-interference-plus-noise ratio of secondary users against cooperative jamming in cognitive radio networks. The PHC-based CPU allocation scheme as proposed in [17] applies PHC to improve the data protection level of the cloud storage system against APTs. Compared with our previous work in [17], this work improves the game model by incorporating the time-variant data storage model. We also apply both the hotbooting technique and DQN to accelerate the learning speed and thus improve the security performance for the case with a large number of storage devices and CPUs against the smart APT attacks in the dynamic cloud storage system.

Iii System Model

As illustrated in Fig. 1, the cloud storage system consists of storage devices, where device stores data of size at time , with . Let

be the data size vector of the cloud storage system, and

denote the total amount of the data stored in the cloud storage system at time ..

Fig. 1: CPU allocation game, in which a defender with CPUs chooses the CPU allocation strategy to scan the storage devices in the cloud storage system against an APT attacker with CPUs.

In this work, we consider an APT attacker who combines multiple attack methods, tools, and techniques such as [5] to steal data from the targeted cloud storage system over a long time. The attacker aims to steal more data from the storage devices with CPUs without being detected. At time , out of the CPUs are used to attack storage device , with . The attack CPU allocation at time is given by , where the attack action set is given by

(1)

A defender uses CPUs to scan the storage devices in the cloud storage system and aims to detect APTs as early as possible to minimize the total amount of the data stolen by the attacker. At time , out of the CPUs are allocated to scan the device for APTs, with . As each time slot is quite short, the storage defender can not scan all the data stored in the storage devices in a single time slot. The defense CPU allocation vector denoted by is defined as , where the defense action set is given by

(2)

If the attacker uses more CPUs than the defender in the APT defense game, the data stored in the storage device are assumed to be at risk. More specifically, the data stored in storage device are assumed to be safe if the number of the defense CPUs is greater than the number of attack CPUs at that time, i.e., , and the data are at risks if . If , both players have an equal opportunity to control the storage device. Let denote a sign function, with if , if , and 0 otherwise. Therefore, the data protection level of the cloud storage system at time denoted by is defined as the normalized size of the “safe” data that are protected by the defender and is given by

(3)

For ease of reference, our commonly used notations are summarized in Table I. The time index in the superscript is omitted if no confusion occurs.

Number of storage devices
Total number of defense/attack CPUs
Defense/attack CPU allocation vector
Action set of the defender/attacker
Utility of the defender/attacker at time
Total size of the stored data at time
Data size vector of devices at time
Data protection level at time
TABLE I: SUMMARY OF SYMBOLS AND NOTATIONS

Iv CBG-Based CPU Allocation Game

Colonel Blotto game is a powerful tool to study the strategic resource allocation of two agents each with limited resources in a competitive environment. Therefore, the interactions between the APT attacker and the defender of the cloud storage system regarding their CPU allocations can be formulated as a Colonel Blotto game with battlefields. By applying the time sharing (or division) technique, the defender (or attacker) can scan (or attack) multiple storage devices with a single CPU in a time slot, which can be approximately formulated as a continuous CBG. In this game, the defender chooses the defense CPU allocation vector to scan the devices at time , while the APT attacker chooses the attack CPU allocation .

The utility of the defender (or the attacker) at time denoted by (or ) depends on the size of the data stored in the devices, and the data protection level of each device at the time. In the zero-sum game, by (3) the utility of the defender is set as

(4)

The CBG-based CPU allocation game rarely has a pure-strategy NE, because the attack CPU allocation can be chosen according to the defense CPU allocation and to defeat it for a higher utility . Therefore, we study the CPU allocation game with mixed strategies, in which each player randomizes the CPU allocation strategies to fool the opponent.

In the mixed-strategy CPU allocation game, the defense strategy at time denoted by is the probability that the defender allocates CPUs to scan device , i.e., . Let be the -th highest feasible value of . The mixed-strategy defense action set denoted by is given by

(5)

The defense mixed strategy vector denoted by is given by

(6)

Similarly, let denote the probability that CPUs are used to attack device , i.e., , and be the -th highest feasible value of . The action set of the attacker in the mixed-strategy game denoted by is given by

(7)

The attacker chooses the CPU allocation strategy in this game denoted by with

(8)

The expected utility of the defender (or the attacker) averaged over all the feasible defense (or attack ) strategies is denoted by (or ) and given by (IV) as

(9)

The NE of the CBG-based CPU allocation game with mixed strategies denoted by provides the best-response policy, i.e., no player can increase his or her utility by unilaterally changing from the NE strategy. For example, if the defender chooses the CPU allocation strategy , the APT attacker cannot do better than selecting to attack the storage devices. By definition, we have

(10)
(11)

We first consider a CBG-based CPU allocation game with symmetric CPU resources, , i.e., the defender and the attacker have the same amount of computational resources. Let (or ) be an all-1 (or 0) matrix, be the lower floor function, and the normalized defense CPUs .

Theorem 1.

If and , the CPU allocation game has a NE given by

(12)
Proof.

The CPU allocation game can be formulated as a CBG with symmetric players on battlefields. The resource budget of the defender is , the value of the -th battlefield is , and the total value of battlefields is . Let

denote the uniform distribution between

and . By Proposition 1 in [32], the mixed-strategy CBG game has an NE given by , where

is the probability distribution of

M, and each vector coordinate is uniform distribution between 0 and . Therefore, the CPU allocation of the -th device is uniformly distributed on , i.e.,

(13)

Thus this game has an NE given by , where , , each element of x is given by

(14)

which results in (12). ∎

Corollary 1.

At the NE of the symmetric CPU allocation game , the expected data protection level is zero and the utility of the defender is zero.

Proof.

By (3) and (12), the data protection level over all the realizations of the mixed-strategy NE is given by

(15)
(16)

Similarly, by (IV) and (IV) , we have . ∎

Remark: If the APT attacker and the defender have the same number of CPUs and no storage device dominates in the game (i.e., , ), both players choose a number from to attack or scan storage device with probability by (14). The data protection level by definition ranges between and . Therefore, the game makes a tie, yielding zero expected data protection level and zero utility of the defender.

We next consider a CBG-based CPU allocation game with asymmetric players denoted by , in which the attacker and the defender have different number of CPUs and compete over storage devices with an equal data size, i.e., , .

Theorem 2.

If , and , , the NE of the CPU allocation game is given by

(17)
(18)
(19)
Proof.

The CPU allocation game can be formulated as a CBG with asymmetric players on

battlefields, where the defender (or attacker) chooses the probability density functions

x (or y) according to (or ) resource budget, and the resources allocated to the -th battlefield is (or ).

By Theorem 2 in [13], the unique Nash equilibrium for the defender and the attacker with is given by

(20)
(21)

Therefore, the CPU allocation of the -th storage device on NE is uniformly distributed on , i.e.,

(22)

Thus, the NE strategy of the CPU allocation game is given by

(23)

Thus, we have (17).

Similarly, we have

(24)

and thus

(25)

Thus, we have (18). ∎

Corollary 2.

At the NE of the CPU allocation game , the expected data protection level is and

(26)
Proof.

According to (3), (15), (17) and (18), as , we have

(27)
(28)

Similarly, by (IV) and (IV), we have (26). ∎

Remark: The defender has to have more CPU resources than APT attackers, otherwise the cloud storage system is unlikely to protect the data privacy. Therefore, a subset of the storage devices are safe from the attacker who has to match the defender on the other storage devices. In this case, the defender wins the game, and the utility increases with the total data size. The expected data protection level increases with the resource advantage of the defender over the attacker, i.e., .

The APT defense performance of the CPU allocation game at the NE is presented in Fig. 2, in which the storage devices are threatened by an APT attacker with attack CPUs. If the defender uses 1200 CPUs instead of 600 CPUs to scan the 20 devices, the data protection level increases about 10.5% to 93% and the utility of the defender increases by 18.75%, The data protection level of the cloud storage system protected by 1200 CPUs slightly decreases by 2.8%, if the number of the storage devices changes from 20 to 80. The APT defense performance of the CBG game at the NE provides the optimal defense performance with known APT attack model and defense model, and can be used as a guideline to design the CPU allocation scheme.

Fig. 2: APT defense performance of the CBG-based CPU allocation game at the NE with storage devices and defense CPUs against an APT attacker with 150 CPUs.

V Hotbooting PHC-based CPU Allocation

Fig. 3: Illustration of the hotbooting PHC-based defense CPU allocation.

As a defender is usually unaware of the attack policy, we propose a hotbooting PHC-based CPU allocation scheme to scan storage devices in the dynamic APT detection game, as illustrated in Fig. 3. At each time slot, the defender of the cloud storage system observes the amount of the data stored in each storage device that is quantized into levels, . In addition, the defender also evaluates the compromised storage devices that are found to be attacked by APTs in the last time slot, and uses them to estimate the last attack CPU allocation . The defense CPU allocation is chosen according to the current state denoted by , which consists of the current data sizes and the previous attack CPU allocation, i.e., . The resulting defense strategy , where is the defense action set given by (2).

1:  Hotbooting defense process in Algorithm 2
2:  Initialize , , , and
3:  Set ,
4:  for  do
5:     Observe the current data size
6:     
7:     Choose with via (32)
8:     for  do
9:        Allocate CPUs to scan storage device
10:     end for
11:     Observe the compromised storage devices and estimate
12:     Obtain via (IV)
13:     Update via (29)
14:     Update via (30)
15:     Update via (V)
16:  end for
Algorithm 1 CPU allocation with hotbooting PHC

The Q-function for each action-state pair denoted by is the expected discounted long-term reward of the defender, and is updated in each time slot according to the iterative Bellman equation as follows:

(29)

where the learning rate is the weight of the current experience, the discount factor indicates the uncertainty of the defender on the future reward, is the next state if the defender uses at state , and the value function maximizes over the action set given by

(30)

The mixed-strategy table of the defender denoted by provides the distribution of the number of CPUs M over the storage devices under state s and is updated via

(31)

In this way, the probability of the action that maximizes the Q-function increases by , with , and the probability of other actions decrease by . The defender then selects the number of CPUs according to the mixed strategy , i.e.,

(32)
1:  Initialize , , , , , and
2:  Set , ,
3:  for  do
4:     Emulate a similar CPU allocation scenario for the defender to scan storage devices
5:     for  do
6:        Observe the current data size
7:        
8:        Choose via (32)
9:        for  do
10:           Allocate CPUs to scan storage device
11:        end for
12:        Observe the compromised storage devices and estimate
13:        Obtain via (IV)
14:        Update Q and via (29)-(V)
15:     end for
16:  end for
17:  Output ,
Algorithm 2 Hotbooting defense process

We apply the hotbooting technique to initialize both the Q-value and the strategy table with the CPU allocation experiences in similar environments. The hotbooting PHC-based CPU allocation saves random explorations at the beginning stage of the dynamic game and thus accelerates the learning speed. As shown in Algorithm 2, CPU allocation experiences are performed before the game. Each experiment lasts time slots, in which the defender chooses the number of CPUs to scan the storage devices according to the mixed-strategy table . The defender observes the attack CPU distribution and evaluates the utility . Both the Q-function and are updated via (29)-(V) in each time in the experiences.

The Q-values as the output of the hotbootng process based on the experiences denoted by is used to initialize the Q-values in Algorithm 1. Similarly, the mixed-strategy table as the output of Algorithm 2 based on the experiences denoted by is used to initialize in Algorithm 1. The learning time of Algorithm 1 increases with the dimension of the action-state space , which increases with the number of storage devices in the cloud storage system and the number of CPUs, yielding serious performance degradation.

Fig. 4: Hotbooting DQN-based defense CPU allocation.

Vi Hotbooting DQN-Based CPU Allocation

In this section, we propose a hotbooting DQN-based CPU allocation scheme to improve the APT defense performance of the cloud storage system. This scheme applies deep convolutional neural network, a deep reinforcement learning technique, to compress the action-state space and thus accelerate the learning process. As illustrated in Fig. 4, the deep convolution neural network is a nonlinear approximator of the Q-value for each action. The CNN architecture allows a compact storage of the learned information between similar states [33].

The DQN-based CPU allocation as summarized in Algorithm 3 extends the system state as in Algorithm 1 to the experience sequence at time denoted by to accelerate the learning speed and improve the APT resistance. More specifically, the experience sequence consists of the current system state and the previous system state-action pairs, i.e., .

1:  Initialize , , and
2:  Set ,
3:  for  do
4:     Observe the current data size
5:     
6:     if  then
7:        Choose at random
8:     else
9:        
10:        Set as the input of the CNN
11:        Observe the output of the CNN to obtain
12:        Choose via (34)
13:     end if
14:     for  do
15:        Allocate CPUs to scan storage device
16:     end for
17:     Observe the compromised storage devices and estimate
18:     Obtain via (IV)
19:     Observe
20:     
21:     for  do
22:        Select at random
23:        Calculate via (36)
24:     end for
25:     Update via (VI)
26:  end for
Algorithm 3 Hotbooting DQN-based CPU allocation

The experience sequence is reshaped into a matrix and then input into the CNN, as shown in Fig. 4. The CNN consists of two convolutional (Conv) layers and two fully connected (FC) layers, with parameters chosen to achieve a good performance according to the experiment results as listed in Table II. The filter weights of the four layers in the CNN at time are denoted by for simplicity. The first Conv layer includes 20 different filters. Each filter has size

and uses stride 1. The output of the first Conv is 20 different

feature maps that are then passed through a rectified linear function (ReLU) as an activation function. The second Conv layer includes 40 different filters. Each filter has size

and stride 1. The outputs of the 2nd Conv layer are 40 different

feature maps, which are flattened to a 360-dimension vector and then sent to the two FC layers. The first FC layer involves 180 rectified linear units, and the second FC layer provides the Q-value for each CPU allocation policy

at the current system sequence .

Layer Conv1 Conv2 FC1 FC2
Input 360 180
Filter size / /
Stride 1 1 / /
# Filters 20 40 180
Activation ReLU ReLU ReLU ReLU
Output 180
TABLE II: CNN Parameters

The Q-function as the expected long-term reward for the state sequence and the action M, is given by definition as

(33)

where is the next state sequence by choosing defense CPU allocation M at state .

To make a tradeoff between exploitation and exploration, the defense CPU allocation is chosen according to the -greedy policy [34]. More specifically, the CPU allocation that maximizes the Q-function is chosen with a high probability , and other actions are selected with a low probability to avoid staying in the local maximum, i.e.,

(34)

Based on the experience replay as shown in Fig. 4, the CPU allocation experience at time denoted by is given by , and saved in the replay memory pool denoted by , with . An experience is chosen from the memory pool at random, with . The CNN parameters

are updated by the stochastic gradient descent (SGD) algorithm in which the mean-squared error between the network’s output and the target optimal Q-value is minimized with the minibatch updates. The loss function denoted by

in the stochastic gradient descent algorithm is chosen as

(35)

where the target value denoted by approximates the optimal value based on the previous CNN parameters , and is given by

(36)
1:  Initialize , , , , and
2:  for  do
3:     Emulate a similar CPU allocation scenario for the defender to scan storage devices
4:     for  do
5:        Observe the output of the CNN to obtain
6:        Choose via (34)
7:        for  do
8:           Allocate CPUs to scan storage device
9:        end for
10:        Observe the compromised storage devices and estimate
11:        Obtain via (IV)
12:        Observe the resulting state sequence
13:        
14:        Perform minibatch update as steps 19-23 in Algorithm 3 to update
15:     end for
16:  end for
17:  Output
Algorithm 4 Hotbooting process for Algorithm 3

The gradient of the loss function with respect to the weights is given by

(37)

This process repeats times to update in Algorithm 3.

Similar to Algorithm 1, we apply the hotbooting technique to initialize the CNN parameters in the DQN-based CPU allocation rather than initializing them randomly to accelerate the learning speed. As shown in Algorithm 4, the defender stores the emulational experience in the database and the resulting based on experiences are used to set as shown in Algorithm 3.

Vii Simulation Results

Simulations have been performed to evaluate the APT defense performance of the CPU allocation schemes in a cloud storage system, with the CNN parameters as listed in Table II. In the simulations, some APT attackers applied the -greedy algorithm to choose the number of CPUs to attack each of the storage devices based on the defense history and some smarter attackers first induced the defender to use a specific “optimal” defense strategy based on the estimated defense learning algorithm and then attacked the system accordingly. We set , , , , and , if not specified otherwise, to achieve good security performance according to the experiments not presented in this paper.

Fig. 5: APT defense performance of the cloud storage system with 10 storage devices and 10 defense CPUs against an APT attacker with 2 attack CPUs. The size of data stored in each storage device is 1.

In the first simulation, the defender with CPUs resisted the attacker with CPUs over 10 storage devices, each with normalized data size. As shown in Fig. 5, the hotbooting DQN-based CPU allocation scheme achieves the optimal policy in a dynamic APT defense game after convergence, which matches the theoretical results of the NE given by Theorem 2. For example, the data privacy level almost converge to the NE given by (28), and the utility of the defender almost converge to the NE given by (26). The hotbooting DQN-based CPU allocation scheme outperforms the hotbooting PHC with a faster learning speed, a higher data protection level and a higher utility. The latter in turn exceeds both PHC and Q-learning. For instance, the data protection level of the hotbooting DQN-based scheme is 14.92% higher than the PHC-based scheme at time slot 1000, which is 30.51% higher than the Q-learning based scheme. As a result, the hotbooting DQN-based scheme has a 14.92% higher utility than the PHC-based strategy at time slot 1000, which is 30.51% higher than that of the Q-learning based strategy. As the hotbooting DQN-based algorithm, an extension of Q learning, compress the learning state space by using CNN to accelerate the learning process and enhance the security performance of the cloud storage system. If the interaction time is long enough, the hotbooting PHC and Q-learning scheme can also converge to the NE of the theoretical results in Theorem 2. The PHC-based scheme has less computation complexity than DQN. For example, the PHC-based strategy takes less than 4% of the time to choose the CPU allocation in a time slot compared with the DQN-based scheme.

Fig. 6: APT defense performance of the cloud storage system with 3 storage devices and 16 defense CPUs against an APT attacker with 4 attack CPUs. Both the size of data stored on each device and the attack policy change every 1000 time slots.

In the second simulation, the size of the data stored in each of the storage devices of the cloud storage system changed every 1000 time slots. The total data size increases 1.167 times at the 1000-nd time slot and then increases 1.143 times at the 2000-nd time slot. The cloud storage system used CPUs to scan the storage devices and the APT attacker used CPUs to attack them. Besides, the attack policy changed every 1000 time slots. The APT attacker estimated the “optimal” defense CPU allocation due to the learning algorithm and launched an attack specifically against the estimated defense strategy at time slot 1000 and 2000 to steal data from the cloud storage system. As shown in Fig. 6, the hotbooting DQN-based CPU allocation is more robust against smart APTs and the time-variant cloud storage system. For example, the data protection level of the hotbooting DQN-based scheme is 30.98% higher than that of the PHC-based scheme at time slot 1000, which is 97.87% higher than that of the Q-learning based scheme. As a result, the hotbooting DQN-based scheme has a 30.69% higher utility than the PHC-based strategy at time slot 1000, which is 96.97% higher than the Q-learning based strategy.

Fig. 7: APT defense performance of the cloud storage system with defense CPUs, and 3 storage devices that are attacked by 4 attack CPUs, averaged over 3000 time slots. The size of data stored in each storage device changes every 1000 time slots.

As shown in Fig. 7, both the data protection level and the utility increase with the number of defense CPUs. For instance, if the number of the defense CPUs changes from 12 to 16, the data protection level and the utility of the defender with hotbooting DQN-based APT defense increase by 14.20% and 14.03%, respectively. In the dynamic game with , and , the data protection level of the hotbooting DQN-based scheme is 15.85% higher than that of PHC, which is 21.62% higher than Q-learning, and the utility of the hotbooting DQN-based CPU allocation scheme is 15.04% higher than PHC, which is 21.64% higher than Q-learning.

Fig. 8: APT defense performance of the cloud storage system with storage devices and 21 defense CPUs against an APT attacker with 4 attack CPUs, averaged over 3000 time slots. The size of data stored in each storage device changes every 1000 time slots.

As shown in Fig. 8, the APT defense performance slightly decreases with more number of storage devices in a cloud storage system. However, the hotbooting DQN-based scheme still maintains a high data protection level, i.e., , if 4 storage devices are protected by 21 CPUs and attacked by 4 CPUs. In another example, the hotbooting DQN-based scheme can protect up to 80.05% of the data stored in 6 storage devices in the cloud. The defender with less number of CPUs has to distribute its resources among all the storage devices to resist APTs, as at least one CPU has to scan each storage device. The performance gain of the hotbooting DQN-based CPU allocation scheme over the hotbooting PHC-based scheme increases with the number of storage devices in the cloud system.

Viii Conclusion

In this paper, we have formulated a CBG-based CPU allocation game for the APT defense of cloud storage and cyber systems and provided the NEs of the game to show how the number of storage devices, the data sizes in the storage devices and the total number of CPUs impact on the data protection level of the cloud storage system and the defender’s utility. A hotbooting DQN-based CPU allocation strategy has been proposed for the defender to scan the storage devices without being aware of the attack model and the data storage model in the dynamic game. The proposed scheme can improve the data protection level with a faster learning speed and is more robust against smart APT attackers that choose the attack policy based on the estimated defense learning scheme. For instance, the data protection level of the cloud storage system and the utility of the defender increases by 22.29% and 22.4%, respectively, compared with the Q-learning based scheme in the cloud storage system with 4 storage devices and 16 defense CPUs against an APT attacker with 4 CPUs. A hotbooting PHC-based CPU allocation scheme can reduce the computation complexity.

References

  • [1] P. Giura and W. Wang, “Using large scale distributed computing to unveil advanced persistent threats,” Science, vol. 1, no. 3, pp. 93–105, 2013.
  • [2]

    Y. Han, T. Alpcan, J. Chan, C. Leckie, and B. I. Rubinstein, “A game theoretical approach to defend against co-resident attacks in cloud computing: Preventing co-residence using semi-supervised learning,”

    IEEE Trans. Inf. Forensics Security, vol. 11, no. 3, pp. 556–570, Dec. 2016.
  • [3] Q. Wang, C. Wang, K. Ren, W. Lou, and J. Li, “Enabling public auditability and data dynamics for storage security in cloud computing,” IEEE Trans. Parallel and Distributed Systems, vol. 22, no. 5, pp. 847–859, May 2011.
  • [4] J. Vukalović and D. Delija, “Advanced persistent threats-detection and defense,” in Proc. IEEE Inf. and Commun. Technol., Electron. and Microelectronics, pp. 1324–1330, Opatija, Croatia, May 2015.
  • [5] C. Tankard, “Advanced persistent threats and how to monitor and deter them,” Network Security, vol. 8, no. 8, pp. 16–19, Aug. 2011.
  • [6] R. B. Sagi, “The Economic Impact of Advanced Persistent Threats,” IBM Research Intelligence, May 2014, http://www-01.ibm.com.
  • [7] M. Van Dijk, A. Juels, A. Oprea, and R. L. Rivest, “Flipit: The game of ‘stealthy takeover’,” J. Cryptol., vol. 26, no. 4, pp. 655–713, Oct. 2013.
  • [8] M. Zhang, Z. Zheng, and N. B. Shroff, “Stealthy attacks and observable defenses: A game theoretic model under strict resource constraints,” in Proc. IEEE Global Conf. Signal and Inf. Processing (GlobalSIP), pp. 813–817, Atlanta, GA, Dec. 2014.
  • [9] L. Xiao, D. Xu, Y. Li, N. B. Mandayam, and H. V. Poor, “Cloud storage defense against advanced persistent threats: A prospect theoretic study,” IEEE J. Sel. Areas Commun., vol. 35, no. 3, pp. 534–544, Mar. 2017.
  • [10] L. Xiao, D. Xu, N. B. Mandayam, and H. V. Poor, “Cumulative prospect theoretic study of a cloud storage defense game against advanced persistent threats,” in Proc. IEEE Int. Conf. Comput. Commun. (INFOCOM), BigSecurity Workshop, pp. 1–6, Atlanta, GA, May 2017.
  • [11]

    M. H. Manshaei, Q. Zhu, T. Alpcan, T. Bacşar, and J.-P. Hubaux, “Game theory meets network security and privacy,”

    ACM Comput. Surv., vol. 45, no. 3, pp. 25:1–25:39, Jul. 2013.
  • [12] M. Zhang, Z. Zheng, and N. B. Shroff, “A game theoretic model for defending against stealthy attacks with limited resources,” in Proc. Int. Conf. Decision and Game Theory for Security, pp. 93–112, London, UK, Nov. 2015.
  • [13] B. Roberson, “The Colonel Blotto game,” Econ. Theory, vol. 29, no. 1, pp. 1–24, Sep. 2006.
  • [14] M. Hajimirsadeghi, G. Sridharan, W. Saad, and N. B. Mandayam, “Inter-network dynamic spectrum allocation via a Colonel Blotto game,” in Proc. IEEE Annu. Conf. Inf. Sci. Syst. (CISS), pp. 252–257, Princeton, NJ, Mar. 2016.
  • [15] M. Labib, S. Ha, W. Saad, and J. H. Reed, “A Colonel Blotto game for anti-jamming in the Internet of Things,” in Proc. IEEE Global Commun. Conf. (GLOBECOM), pp. 1–6, San Diego, CA, Dec. 2015.
  • [16] N. Namvar, W. Saad, N. Bahadori, and B. Kelley, “Jamming in the Internet of Things: A game-theoretic perspective,” in Proc. IEEE Global Commun. Conf. (GLOBECOM), pp. 1–6, Washington, D.C., Dec. 2016.
  • [17] M. Min, L. Xiao, C. Xie, M. Hajimirsadeghi, and N. B. Mandayam, “Defense against advanced persistent threats: A Colonel Blotto game approach,” in Proc. IEEE Int. Conf. Commun. (ICC), pp. 1–6, Paris, France, May 2017.
  • [18] M. Bowling and M. Veloso, “Rational and convergent learning in stochastic games,” in Proc. Int. Joint Conf. Artificial Intell., pp. 1021–1026, Seattle, D.C., Aug. 2001.
  • [19] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.
  • [20] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Jan. 2015.
  • [21] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, et al., “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, Dec. 2013.
  • [22] K. D. Bowers, M. Van Dijk, R. Griffin, A. Juels, A. Oprea, R. L. Rivest, and N. Triandopoulos, “Defending against the unknown enemy: Applying FlipIt to system security,” in GameSec, pp. 248–263, Springer, 2012.
  • [23] A. Laszka, B. Johnson, and J. Grossklags, “Mitigating covert compromises,” in Int. Conf. Web and Internet Econ., pp. 319–332, Cambridge, MA, Dec. 2013.
  • [24] Z. Zheng, N. B. Shroff, and P. Mohapatra, “When to reset your keys: Optimal timing of security updates via learning,” in

    Proc. AAAI Conf. Artificial Intelligence

    , pp. 1–7, San Francisco, CA, Feb. 2017.
  • [25] J. Pawlick, S. Farhang, and Q. Zhu, “Flip the cloud: Cyber-physical signaling games in the presence of advanced persistent threats,” in Proc. Int. Conf. Decision and Game Theory for Security, pp. 289–308, London, UK, Nov. 2015.
  • [26] A. Alabdel Abass, L. Xiao, N. B. Mandayam, and Z. Gajic, “Evolutionary game therotic analysis of advanced persistent threats against cloud storage,” IEEE Access, vol. 5, pp. 8482–8491, Apr. 2017.
  • [27] P. Hu, H. Li, H. Fu, D. Cansever, and P. Mohapatra, “Dynamic defense strategy against advanced persistent threat with insiders,” in Proc. IEEE Int. Conf. Comput. Commun. (INFOCOM), pp. 747–755, HK, China, May 2015.
  • [28] Y. Wu, B. Wang, K. R. Liu, and T. C. Clancy, “Anti-jamming games in multi-channel cognitive radio networks,” IEEE J. Sel. Areas Commun., vol. 30, no. 1, pp. 4–15, Jan. 2012.
  • [29] P. Chia and J. Chuang, “Colonel Blotto in the phishing war,” Decision and Game Theory for Security, pp. 201–218, 2011.
  • [30] B. Wang, Y. Wu, K. R. Liu, and T. C. Clancy, “An anti-jamming stochastic game for cognitive radio networks,” IEEE J. Sel. Areas Commun., vol. 29, no. 4, pp. 877–889, Apr. 2011.
  • [31] G. Han, L. Xiao, and H. V. Poor, “Two-dimensional anti-jamming communication based on deep reinforcement learning,” in Proc. IEEE Int. Conf. Acous., Speech, Signal Processing (ICASSP), pp. 1–5, New Orleans, LA, Mar. 2017.
  • [32] C. Thomas, “N-dimensional Blotto game with heterogeneous battlefield values,” Economic Theory, pp. 1–36, Jan. 2017.
  • [33] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A survey,” J. Artif. Intell. Res., vol. 4, pp. 237–285, 1996.
  • [34] E. Rodrigues Gomes and R. Kowalczyk, “Dynamic analysis of multiagent Q-learning with -greedy exploration,” in Proc. ACM Annual Int’l. Conf. Mach. Learn. (ICML), pp. 369–376, Montreal, Canada, Jun. 2009.