I Introduction
Cloud storage and cyber systems are vulnerable to Advanced Persistent Threats (APTs), in which an attacker applies multiple sophisticated methods such as the injection of multiple malwares to continuously and stealthily steal data from the targeted cloud storage system [1],[2] and [3]. APT attacks are difficult to detect and have caused privacy leakage and millions of dollars’ loss [4] and [5]. According to [6], more than 65% of the organizations in a survey in 2014 have experienced more APT attacks in their IT networks than last year.
The FlipIt game proposed in the seminal work [7] formulates the stealthy and continuous APT attacks and designs the scan interval to detect APTs on a given cyber system. The game theoretic study in [8] has provided insights to design the optimal scan intervals of a cyber system against APTs. Prospect theory has been applied in [9]
to investigate the probability distortion of an APT attacker against cloud storage and cumulative prospect theory has been used in
[10] to model the frame effect of an APT attacker to choose the attack interval. Most existing APT games ignore the strict resource constraints in the APT defense, such as the limited number of Central Processing Units (CPUs) of a storage defender and an APT attacker [7] and [11]. However, a cloud storage system with limited number of CPUs cannot scan all the data stored on the storage devices in a given time slot. To this end, encryptions and authentication techniques are applied to protect data privacy for cloud storage systems. On the other hand, an APT attacker with limited CPU resources cannot install malwares to steal all the data on the cloud storage system in a single time slot either [12].It is challenging for a cloud storage system to optimize the CPU allocation to scan the storage devices under a large number of CPUs and storage devices without being aware of the APT attack strategy. Therefore, we use the Colonel Blotto game (CBG), a twoplayer zerosum game with multiple battlefields to model the competition between an APT attacker and a storage defender, each with a limited total number of CPUs over a given number of storage devices. The player who applies more resources on a battlefield in a Colonel Blotto game wins it, and the overall payoff of a player in the game is proportional to the number of the winning battlefields [13]. The Colonel Blotto game has been recently applied to design the spectrum allocation of network service providers [14], the jamming resistance methods for Internet of Things [15] and [16].
Our previous work in [17] assumes that each storage device has the same amount of data and addresses APT attackers that does not change the attack policy. However, the storage devices usually have different amount of the data with different priority levels, and the data size and their priority level also change over time. By allocating more CPUs to scan the storage devices with more data, a storage defender can achieve a higher data protection level. Therefore, this work extends to a dynamic cloud storage system whose data size changes over time and addresses smart APTs, in which an attacker that can learn the defense strategy first chooses the attack strength to induce the storage system to apply a specific defense strategy and then attacks it accordingly.
By applying time sharing (or division), a defender can use a single CPU to scan multiple storage devices as battlefields to detect APTs in a time slot, and an attacker can use a single CPU to attack multiple devices with a single CPU yielding a roughly continuous CBG. According to [13], a purestrategy Colonel Blotto game rarely achieves Nash equilibria (NEs). Therefore, we focus on the CBG with mixed strategies, in which both players choose their CPU allocation distribution and introduce randomness in their action selection to fool their opponent. The conditions under which the NEs exist in the CPU allocation game are provided to disclose how the number of storage devices, the size of the data stored in each storage device and the total number of CPUs in which the defender observes the impact on the data protection level and the utility of the cloud storage system against APTs.
The CBGbased CPU allocation game provides a framework to understand the strategic behavior of both sides, and the NE strategy relies on the detailed prior knowledge about the APT attack model. In particular, the cloud defender has to know the total number of the attack CPUs and the attack policy over the storage devices, which is challenging to accurately estimate in a dynamic storage system. On the other hand, the repeated interactions between the APT attacker and the defender over multiple time slots can be formulated as a dynamic CPU allocation game, and the defender can choose the security strategy according to the attack history. The APT defense decisions in the dynamic CPU allocation game can be approximately formulated as a Markov decision process (MDP) with finite states, in which the defender observes the state that consists of the previous attack CPU allocation and the current data storage distribution. Therefore, a defender can apply reinforcement learning (RL) techniques such as Qlearning to achieve the optimal CPU allocation over the storage devices to detect APTs in a dynamic game.
The policy hillclimbing (PHC) algorithm as an extension of Qlearning in the mixedstrategy game [18] enables an agent to achieve the optimal strategy without being aware of the underlying system model. For instance, the PHCbased CPU allocation scheme as proposed in our previous work in [17]
enables the defender to protect the storage devices with limited number of CPUs without being aware of the APT attack model. In this work, a “hotbooting” technique as a combination of transfer learning
[19] and RL exploits experiences in similar scenarios to accelerate the initial learning speed. We propose a “hotbooting” PHCbased CPU allocation scheme that chooses the number of the CPUs on each storage device based on the current state and the quality or Qfunction that is initialized according to the APT detection experiences to reduce the exploration time at the initial learning stage.We apply deep Qnetwork (DQN), a deep reinforcement learning technique recently developed by Google DeepMind in [20] and [21]
to accelerate the learning speed of the defender for the case with a large number of storage devices and defense CPUs. More specifically, the DQNbased CPU allocation exploits the deep convolutional neural network (CNN) to determine the Qvalue for each CPU allocation and thus suppress the state space observed by the cloud storage defender. Simulation results demonstrate that this scheme can improve the data protection level, increase the APT attack cost, and enhance the utility of the cloud storage system against APTs.
The main contributions of this paper are summarized as follows:
(1) We formulate a CBGbased CPU allocation game against APTs with timevariant data size and attack policy. The NEs of the game are provided to disclose the impact of the number of storage devices, the amonut of data stored in each device and the total number of CPUs on the data protection level of the cloud storage system.
(2) A hotbooting PHCbased CPU allocation scheme is developed to achieve the optimal CPU allocation over the storage devices with low computational complexity and improve the data protection level compared with the PHCbased scheme as proposed in [17].
(3) A hotbooting DQNbased CPU allocation scheme is proposed to furture accelerate the learning speed and improve the resistance against APTs.
The rest of this paper is organized as follows: We review the related work in section II, and present the system model in section III. We formulate a CBGbased CPU allocation game and derive the NEs of the game in Section IV. A hotbooting PHCbased CPU allocation scheme and a hotbooting DQNbased scheme are developed in Sections V and VI, respectively. Simulation results are provided in Section VII and the conclusions of this work are drawn in Section VIII.
Ii Related work
The seminal work in [7] formulates a stealthy takeover game between an APT attacker and a defender, who compete to control a targeted cloud storage system. The APT scan interval on a single device has been optimized in [22] based on the FlipIt model without considering the constraint of scanning CPUs. The game between an overt defender and a stealthy attacker as investigated in [23] provides the best response of the periodic detection strategy against a nonadaptive attacker. The online learning algorithm as developed in[24] achieves the optimal timing of the security updates in the FlipIt game, and reduces the regret of the upper confidence bound compared with the periodic defense strategy. The APT defense game formulated in [12] extends the FlipIt game in [7] to multinode systems with limited resources. The game among an APT attacker, a cloud defender and a mobile device as formulated in [25] combines the APT defense game in [7] with the signaling game between the cloud and the mobile device. The evolutionary game can capture the long term continuous behavior of APTs on cloud storage [26]. The informationtrading and APT defense game formulated in [27] analyzes the joint APT and insider attacks. The subjective view of APT attackers under uncertainty scanning duration was analyzed in [9] based on prospect theory.
Colonel Blotto game models the competition between two players each with resource constraints. For example, the Colonel Blotto game with mixedstrategy as formulated in [14] studies the spectrum allocation of network service providers, yielding a fictitious play based allocation approach to compute the equilibrium of the game with discrete spectrum resources. The antijamming communication game as developed in [28] optimizes the transmit power over multiple channels in cognitive radio networks based on the NE of the CBG. The CBGbased jamming game as formulated in [15] shows that neither the defender nor the attacker can dominate with limited computational resources. The CBGbased jamming game as formulated in [16] shows how the number of subcarriers impacts the antijamming performance of Internet of Things with continuous and asymmetric radio power resources. The CBGbased phishing game as formulated in [29] investigates the dynamics of the detectandtakedown defense against phishing attacks.
Reinforcement learning techniques have been used to improve network security. For instance, the minimaxQ learning based spectrum allocation as developed in [30] increases the spectrum efficiency in cognitive radio networks. The DQNbased antijamming communication scheme as designed in [31] applies DQN to choose the transmit channel and node mobility and can increase the signaltointerferenceplusnoise ratio of secondary users against cooperative jamming in cognitive radio networks. The PHCbased CPU allocation scheme as proposed in [17] applies PHC to improve the data protection level of the cloud storage system against APTs. Compared with our previous work in [17], this work improves the game model by incorporating the timevariant data storage model. We also apply both the hotbooting technique and DQN to accelerate the learning speed and thus improve the security performance for the case with a large number of storage devices and CPUs against the smart APT attacks in the dynamic cloud storage system.
Iii System Model
As illustrated in Fig. 1, the cloud storage system consists of storage devices, where device stores data of size at time , with . Let
be the data size vector of the cloud storage system, and
denote the total amount of the data stored in the cloud storage system at time ..In this work, we consider an APT attacker who combines multiple attack methods, tools, and techniques such as [5] to steal data from the targeted cloud storage system over a long time. The attacker aims to steal more data from the storage devices with CPUs without being detected. At time , out of the CPUs are used to attack storage device , with . The attack CPU allocation at time is given by , where the attack action set is given by
(1) 
A defender uses CPUs to scan the storage devices in the cloud storage system and aims to detect APTs as early as possible to minimize the total amount of the data stolen by the attacker. At time , out of the CPUs are allocated to scan the device for APTs, with . As each time slot is quite short, the storage defender can not scan all the data stored in the storage devices in a single time slot. The defense CPU allocation vector denoted by is defined as , where the defense action set is given by
(2) 
If the attacker uses more CPUs than the defender in the APT defense game, the data stored in the storage device are assumed to be at risk. More specifically, the data stored in storage device are assumed to be safe if the number of the defense CPUs is greater than the number of attack CPUs at that time, i.e., , and the data are at risks if . If , both players have an equal opportunity to control the storage device. Let denote a sign function, with if , if , and 0 otherwise. Therefore, the data protection level of the cloud storage system at time denoted by is defined as the normalized size of the “safe” data that are protected by the defender and is given by
(3) 
For ease of reference, our commonly used notations are summarized in Table I. The time index in the superscript is omitted if no confusion occurs.
Number of storage devices  

Total number of defense/attack CPUs  
Defense/attack CPU allocation vector  
Action set of the defender/attacker  
Utility of the defender/attacker at time  
Total size of the stored data at time  
Data size vector of devices at time  
Data protection level at time 
Iv CBGBased CPU Allocation Game
Colonel Blotto game is a powerful tool to study the strategic resource allocation of two agents each with limited resources in a competitive environment. Therefore, the interactions between the APT attacker and the defender of the cloud storage system regarding their CPU allocations can be formulated as a Colonel Blotto game with battlefields. By applying the time sharing (or division) technique, the defender (or attacker) can scan (or attack) multiple storage devices with a single CPU in a time slot, which can be approximately formulated as a continuous CBG. In this game, the defender chooses the defense CPU allocation vector to scan the devices at time , while the APT attacker chooses the attack CPU allocation .
The utility of the defender (or the attacker) at time denoted by (or ) depends on the size of the data stored in the devices, and the data protection level of each device at the time. In the zerosum game, by (3) the utility of the defender is set as
(4) 
The CBGbased CPU allocation game rarely has a purestrategy NE, because the attack CPU allocation can be chosen according to the defense CPU allocation and to defeat it for a higher utility . Therefore, we study the CPU allocation game with mixed strategies, in which each player randomizes the CPU allocation strategies to fool the opponent.
In the mixedstrategy CPU allocation game, the defense strategy at time denoted by is the probability that the defender allocates CPUs to scan device , i.e., . Let be the th highest feasible value of . The mixedstrategy defense action set denoted by is given by
(5) 
The defense mixed strategy vector denoted by is given by
(6) 
Similarly, let denote the probability that CPUs are used to attack device , i.e., , and be the th highest feasible value of . The action set of the attacker in the mixedstrategy game denoted by is given by
(7) 
The attacker chooses the CPU allocation strategy in this game denoted by with
(8) 
The expected utility of the defender (or the attacker) averaged over all the feasible defense (or attack ) strategies is denoted by (or ) and given by (IV) as
(9) 
The NE of the CBGbased CPU allocation game with mixed strategies denoted by provides the bestresponse policy, i.e., no player can increase his or her utility by unilaterally changing from the NE strategy. For example, if the defender chooses the CPU allocation strategy , the APT attacker cannot do better than selecting to attack the storage devices. By definition, we have
(10)  
(11) 
We first consider a CBGbased CPU allocation game with symmetric CPU resources, , i.e., the defender and the attacker have the same amount of computational resources. Let (or ) be an all1 (or 0) matrix, be the lower floor function, and the normalized defense CPUs .
Theorem 1.
If and , the CPU allocation game has a NE given by
(12) 
Proof.
The CPU allocation game can be formulated as a CBG with symmetric players on battlefields. The resource budget of the defender is , the value of the th battlefield is , and the total value of battlefields is . Let
denote the uniform distribution between
and . By Proposition 1 in [32], the mixedstrategy CBG game has an NE given by , whereis the probability distribution of
M, and each vector coordinate is uniform distribution between 0 and . Therefore, the CPU allocation of the th device is uniformly distributed on , i.e.,(13) 
Thus this game has an NE given by , where , , each element of x is given by
(14) 
which results in (12). ∎
Corollary 1.
At the NE of the symmetric CPU allocation game , the expected data protection level is zero and the utility of the defender is zero.
Proof.
Remark: If the APT attacker and the defender have the same number of CPUs and no storage device dominates in the game (i.e., , ), both players choose a number from to attack or scan storage device with probability by (14). The data protection level by definition ranges between and . Therefore, the game makes a tie, yielding zero expected data protection level and zero utility of the defender.
We next consider a CBGbased CPU allocation game with asymmetric players denoted by , in which the attacker and the defender have different number of CPUs and compete over storage devices with an equal data size, i.e., , .
Theorem 2.
If , and , , the NE of the CPU allocation game is given by
(17)  
(18)  
(19) 
Proof.
The CPU allocation game can be formulated as a CBG with asymmetric players on
battlefields, where the defender (or attacker) chooses the probability density functions
x (or y) according to (or ) resource budget, and the resources allocated to the th battlefield is (or ).By Theorem 2 in [13], the unique Nash equilibrium for the defender and the attacker with is given by
(20)  
(21) 
Therefore, the CPU allocation of the th storage device on NE is uniformly distributed on , i.e.,
(22) 
Thus, the NE strategy of the CPU allocation game is given by
(23) 
Thus, we have (17).
Corollary 2.
At the NE of the CPU allocation game , the expected data protection level is and
(26) 
Proof.
Remark: The defender has to have more CPU resources than APT attackers, otherwise the cloud storage system is unlikely to protect the data privacy. Therefore, a subset of the storage devices are safe from the attacker who has to match the defender on the other storage devices. In this case, the defender wins the game, and the utility increases with the total data size. The expected data protection level increases with the resource advantage of the defender over the attacker, i.e., .
The APT defense performance of the CPU allocation game at the NE is presented in Fig. 2, in which the storage devices are threatened by an APT attacker with attack CPUs. If the defender uses 1200 CPUs instead of 600 CPUs to scan the 20 devices, the data protection level increases about 10.5% to 93% and the utility of the defender increases by 18.75%, The data protection level of the cloud storage system protected by 1200 CPUs slightly decreases by 2.8%, if the number of the storage devices changes from 20 to 80. The APT defense performance of the CBG game at the NE provides the optimal defense performance with known APT attack model and defense model, and can be used as a guideline to design the CPU allocation scheme.
V Hotbooting PHCbased CPU Allocation
As a defender is usually unaware of the attack policy, we propose a hotbooting PHCbased CPU allocation scheme to scan storage devices in the dynamic APT detection game, as illustrated in Fig. 3. At each time slot, the defender of the cloud storage system observes the amount of the data stored in each storage device that is quantized into levels, . In addition, the defender also evaluates the compromised storage devices that are found to be attacked by APTs in the last time slot, and uses them to estimate the last attack CPU allocation . The defense CPU allocation is chosen according to the current state denoted by , which consists of the current data sizes and the previous attack CPU allocation, i.e., . The resulting defense strategy , where is the defense action set given by (2).
The Qfunction for each actionstate pair denoted by is the expected discounted longterm reward of the defender, and is updated in each time slot according to the iterative Bellman equation as follows:
(29) 
where the learning rate is the weight of the current experience, the discount factor indicates the uncertainty of the defender on the future reward, is the next state if the defender uses at state , and the value function maximizes over the action set given by
(30) 
The mixedstrategy table of the defender denoted by provides the distribution of the number of CPUs M over the storage devices under state s and is updated via
(31) 
In this way, the probability of the action that maximizes the Qfunction increases by , with , and the probability of other actions decrease by . The defender then selects the number of CPUs according to the mixed strategy , i.e.,
(32) 
We apply the hotbooting technique to initialize both the Qvalue and the strategy table with the CPU allocation experiences in similar environments. The hotbooting PHCbased CPU allocation saves random explorations at the beginning stage of the dynamic game and thus accelerates the learning speed. As shown in Algorithm 2, CPU allocation experiences are performed before the game. Each experiment lasts time slots, in which the defender chooses the number of CPUs to scan the storage devices according to the mixedstrategy table . The defender observes the attack CPU distribution and evaluates the utility . Both the Qfunction and are updated via (29)(V) in each time in the experiences.
The Qvalues as the output of the hotbootng process based on the experiences denoted by is used to initialize the Qvalues in Algorithm 1. Similarly, the mixedstrategy table as the output of Algorithm 2 based on the experiences denoted by is used to initialize in Algorithm 1. The learning time of Algorithm 1 increases with the dimension of the actionstate space , which increases with the number of storage devices in the cloud storage system and the number of CPUs, yielding serious performance degradation.
Vi Hotbooting DQNBased CPU Allocation
In this section, we propose a hotbooting DQNbased CPU allocation scheme to improve the APT defense performance of the cloud storage system. This scheme applies deep convolutional neural network, a deep reinforcement learning technique, to compress the actionstate space and thus accelerate the learning process. As illustrated in Fig. 4, the deep convolution neural network is a nonlinear approximator of the Qvalue for each action. The CNN architecture allows a compact storage of the learned information between similar states [33].
The DQNbased CPU allocation as summarized in Algorithm 3 extends the system state as in Algorithm 1 to the experience sequence at time denoted by to accelerate the learning speed and improve the APT resistance. More specifically, the experience sequence consists of the current system state and the previous system stateaction pairs, i.e., .
The experience sequence is reshaped into a matrix and then input into the CNN, as shown in Fig. 4. The CNN consists of two convolutional (Conv) layers and two fully connected (FC) layers, with parameters chosen to achieve a good performance according to the experiment results as listed in Table II. The filter weights of the four layers in the CNN at time are denoted by for simplicity. The first Conv layer includes 20 different filters. Each filter has size
and uses stride 1. The output of the first Conv is 20 different
feature maps that are then passed through a rectified linear function (ReLU) as an activation function. The second Conv layer includes 40 different filters. Each filter has size
and stride 1. The outputs of the 2nd Conv layer are 40 differentfeature maps, which are flattened to a 360dimension vector and then sent to the two FC layers. The first FC layer involves 180 rectified linear units, and the second FC layer provides the Qvalue for each CPU allocation policy
at the current system sequence .Layer  Conv1  Conv2  FC1  FC2 
Input  360  180  
Filter size  /  /  
Stride  1  1  /  / 
# Filters  20  40  180  
Activation  ReLU  ReLU  ReLU  ReLU 
Output  180 
The Qfunction as the expected longterm reward for the state sequence and the action M, is given by definition as
(33) 
where is the next state sequence by choosing defense CPU allocation M at state .
To make a tradeoff between exploitation and exploration, the defense CPU allocation is chosen according to the greedy policy [34]. More specifically, the CPU allocation that maximizes the Qfunction is chosen with a high probability , and other actions are selected with a low probability to avoid staying in the local maximum, i.e.,
(34) 
Based on the experience replay as shown in Fig. 4, the CPU allocation experience at time denoted by is given by , and saved in the replay memory pool denoted by , with . An experience is chosen from the memory pool at random, with . The CNN parameters
are updated by the stochastic gradient descent (SGD) algorithm in which the meansquared error between the network’s output and the target optimal Qvalue is minimized with the minibatch updates. The loss function denoted by
in the stochastic gradient descent algorithm is chosen as(35) 
where the target value denoted by approximates the optimal value based on the previous CNN parameters , and is given by
(36) 
The gradient of the loss function with respect to the weights is given by
(37) 
This process repeats times to update in Algorithm 3.
Similar to Algorithm 1, we apply the hotbooting technique to initialize the CNN parameters in the DQNbased CPU allocation rather than initializing them randomly to accelerate the learning speed. As shown in Algorithm 4, the defender stores the emulational experience in the database and the resulting based on experiences are used to set as shown in Algorithm 3.
Vii Simulation Results
Simulations have been performed to evaluate the APT defense performance of the CPU allocation schemes in a cloud storage system, with the CNN parameters as listed in Table II. In the simulations, some APT attackers applied the greedy algorithm to choose the number of CPUs to attack each of the storage devices based on the defense history and some smarter attackers first induced the defender to use a specific “optimal” defense strategy based on the estimated defense learning algorithm and then attacked the system accordingly. We set , , , , and , if not specified otherwise, to achieve good security performance according to the experiments not presented in this paper.
In the first simulation, the defender with CPUs resisted the attacker with CPUs over 10 storage devices, each with normalized data size. As shown in Fig. 5, the hotbooting DQNbased CPU allocation scheme achieves the optimal policy in a dynamic APT defense game after convergence, which matches the theoretical results of the NE given by Theorem 2. For example, the data privacy level almost converge to the NE given by (28), and the utility of the defender almost converge to the NE given by (26). The hotbooting DQNbased CPU allocation scheme outperforms the hotbooting PHC with a faster learning speed, a higher data protection level and a higher utility. The latter in turn exceeds both PHC and Qlearning. For instance, the data protection level of the hotbooting DQNbased scheme is 14.92% higher than the PHCbased scheme at time slot 1000, which is 30.51% higher than the Qlearning based scheme. As a result, the hotbooting DQNbased scheme has a 14.92% higher utility than the PHCbased strategy at time slot 1000, which is 30.51% higher than that of the Qlearning based strategy. As the hotbooting DQNbased algorithm, an extension of Q learning, compress the learning state space by using CNN to accelerate the learning process and enhance the security performance of the cloud storage system. If the interaction time is long enough, the hotbooting PHC and Qlearning scheme can also converge to the NE of the theoretical results in Theorem 2. The PHCbased scheme has less computation complexity than DQN. For example, the PHCbased strategy takes less than 4% of the time to choose the CPU allocation in a time slot compared with the DQNbased scheme.
In the second simulation, the size of the data stored in each of the storage devices of the cloud storage system changed every 1000 time slots. The total data size increases 1.167 times at the 1000nd time slot and then increases 1.143 times at the 2000nd time slot. The cloud storage system used CPUs to scan the storage devices and the APT attacker used CPUs to attack them. Besides, the attack policy changed every 1000 time slots. The APT attacker estimated the “optimal” defense CPU allocation due to the learning algorithm and launched an attack specifically against the estimated defense strategy at time slot 1000 and 2000 to steal data from the cloud storage system. As shown in Fig. 6, the hotbooting DQNbased CPU allocation is more robust against smart APTs and the timevariant cloud storage system. For example, the data protection level of the hotbooting DQNbased scheme is 30.98% higher than that of the PHCbased scheme at time slot 1000, which is 97.87% higher than that of the Qlearning based scheme. As a result, the hotbooting DQNbased scheme has a 30.69% higher utility than the PHCbased strategy at time slot 1000, which is 96.97% higher than the Qlearning based strategy.
As shown in Fig. 7, both the data protection level and the utility increase with the number of defense CPUs. For instance, if the number of the defense CPUs changes from 12 to 16, the data protection level and the utility of the defender with hotbooting DQNbased APT defense increase by 14.20% and 14.03%, respectively. In the dynamic game with , and , the data protection level of the hotbooting DQNbased scheme is 15.85% higher than that of PHC, which is 21.62% higher than Qlearning, and the utility of the hotbooting DQNbased CPU allocation scheme is 15.04% higher than PHC, which is 21.64% higher than Qlearning.
As shown in Fig. 8, the APT defense performance slightly decreases with more number of storage devices in a cloud storage system. However, the hotbooting DQNbased scheme still maintains a high data protection level, i.e., , if 4 storage devices are protected by 21 CPUs and attacked by 4 CPUs. In another example, the hotbooting DQNbased scheme can protect up to 80.05% of the data stored in 6 storage devices in the cloud. The defender with less number of CPUs has to distribute its resources among all the storage devices to resist APTs, as at least one CPU has to scan each storage device. The performance gain of the hotbooting DQNbased CPU allocation scheme over the hotbooting PHCbased scheme increases with the number of storage devices in the cloud system.
Viii Conclusion
In this paper, we have formulated a CBGbased CPU allocation game for the APT defense of cloud storage and cyber systems and provided the NEs of the game to show how the number of storage devices, the data sizes in the storage devices and the total number of CPUs impact on the data protection level of the cloud storage system and the defender’s utility. A hotbooting DQNbased CPU allocation strategy has been proposed for the defender to scan the storage devices without being aware of the attack model and the data storage model in the dynamic game. The proposed scheme can improve the data protection level with a faster learning speed and is more robust against smart APT attackers that choose the attack policy based on the estimated defense learning scheme. For instance, the data protection level of the cloud storage system and the utility of the defender increases by 22.29% and 22.4%, respectively, compared with the Qlearning based scheme in the cloud storage system with 4 storage devices and 16 defense CPUs against an APT attacker with 4 CPUs. A hotbooting PHCbased CPU allocation scheme can reduce the computation complexity.
References
 [1] P. Giura and W. Wang, “Using large scale distributed computing to unveil advanced persistent threats,” Science, vol. 1, no. 3, pp. 93–105, 2013.

[2]
Y. Han, T. Alpcan, J. Chan, C. Leckie, and B. I. Rubinstein, “A game theoretical approach to defend against coresident attacks in cloud computing: Preventing coresidence using semisupervised learning,”
IEEE Trans. Inf. Forensics Security, vol. 11, no. 3, pp. 556–570, Dec. 2016.  [3] Q. Wang, C. Wang, K. Ren, W. Lou, and J. Li, “Enabling public auditability and data dynamics for storage security in cloud computing,” IEEE Trans. Parallel and Distributed Systems, vol. 22, no. 5, pp. 847–859, May 2011.
 [4] J. Vukalović and D. Delija, “Advanced persistent threatsdetection and defense,” in Proc. IEEE Inf. and Commun. Technol., Electron. and Microelectronics, pp. 1324–1330, Opatija, Croatia, May 2015.
 [5] C. Tankard, “Advanced persistent threats and how to monitor and deter them,” Network Security, vol. 8, no. 8, pp. 16–19, Aug. 2011.
 [6] R. B. Sagi, “The Economic Impact of Advanced Persistent Threats,” IBM Research Intelligence, May 2014, http://www01.ibm.com.
 [7] M. Van Dijk, A. Juels, A. Oprea, and R. L. Rivest, “Flipit: The game of ‘stealthy takeover’,” J. Cryptol., vol. 26, no. 4, pp. 655–713, Oct. 2013.
 [8] M. Zhang, Z. Zheng, and N. B. Shroff, “Stealthy attacks and observable defenses: A game theoretic model under strict resource constraints,” in Proc. IEEE Global Conf. Signal and Inf. Processing (GlobalSIP), pp. 813–817, Atlanta, GA, Dec. 2014.
 [9] L. Xiao, D. Xu, Y. Li, N. B. Mandayam, and H. V. Poor, “Cloud storage defense against advanced persistent threats: A prospect theoretic study,” IEEE J. Sel. Areas Commun., vol. 35, no. 3, pp. 534–544, Mar. 2017.
 [10] L. Xiao, D. Xu, N. B. Mandayam, and H. V. Poor, “Cumulative prospect theoretic study of a cloud storage defense game against advanced persistent threats,” in Proc. IEEE Int. Conf. Comput. Commun. (INFOCOM), BigSecurity Workshop, pp. 1–6, Atlanta, GA, May 2017.

[11]
M. H. Manshaei, Q. Zhu, T. Alpcan, T. Bacşar, and J.P. Hubaux, “Game theory meets network security and privacy,”
ACM Comput. Surv., vol. 45, no. 3, pp. 25:1–25:39, Jul. 2013.  [12] M. Zhang, Z. Zheng, and N. B. Shroff, “A game theoretic model for defending against stealthy attacks with limited resources,” in Proc. Int. Conf. Decision and Game Theory for Security, pp. 93–112, London, UK, Nov. 2015.
 [13] B. Roberson, “The Colonel Blotto game,” Econ. Theory, vol. 29, no. 1, pp. 1–24, Sep. 2006.
 [14] M. Hajimirsadeghi, G. Sridharan, W. Saad, and N. B. Mandayam, “Internetwork dynamic spectrum allocation via a Colonel Blotto game,” in Proc. IEEE Annu. Conf. Inf. Sci. Syst. (CISS), pp. 252–257, Princeton, NJ, Mar. 2016.
 [15] M. Labib, S. Ha, W. Saad, and J. H. Reed, “A Colonel Blotto game for antijamming in the Internet of Things,” in Proc. IEEE Global Commun. Conf. (GLOBECOM), pp. 1–6, San Diego, CA, Dec. 2015.
 [16] N. Namvar, W. Saad, N. Bahadori, and B. Kelley, “Jamming in the Internet of Things: A gametheoretic perspective,” in Proc. IEEE Global Commun. Conf. (GLOBECOM), pp. 1–6, Washington, D.C., Dec. 2016.
 [17] M. Min, L. Xiao, C. Xie, M. Hajimirsadeghi, and N. B. Mandayam, “Defense against advanced persistent threats: A Colonel Blotto game approach,” in Proc. IEEE Int. Conf. Commun. (ICC), pp. 1–6, Paris, France, May 2017.
 [18] M. Bowling and M. Veloso, “Rational and convergent learning in stochastic games,” in Proc. Int. Joint Conf. Artificial Intell., pp. 1021–1026, Seattle, D.C., Aug. 2001.
 [19] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.
 [20] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Jan. 2015.
 [21] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, et al., “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, Dec. 2013.
 [22] K. D. Bowers, M. Van Dijk, R. Griffin, A. Juels, A. Oprea, R. L. Rivest, and N. Triandopoulos, “Defending against the unknown enemy: Applying FlipIt to system security,” in GameSec, pp. 248–263, Springer, 2012.
 [23] A. Laszka, B. Johnson, and J. Grossklags, “Mitigating covert compromises,” in Int. Conf. Web and Internet Econ., pp. 319–332, Cambridge, MA, Dec. 2013.

[24]
Z. Zheng, N. B. Shroff, and P. Mohapatra, “When to reset your keys:
Optimal timing of security updates via learning,” in
Proc. AAAI Conf. Artificial Intelligence
, pp. 1–7, San Francisco, CA, Feb. 2017.  [25] J. Pawlick, S. Farhang, and Q. Zhu, “Flip the cloud: Cyberphysical signaling games in the presence of advanced persistent threats,” in Proc. Int. Conf. Decision and Game Theory for Security, pp. 289–308, London, UK, Nov. 2015.
 [26] A. Alabdel Abass, L. Xiao, N. B. Mandayam, and Z. Gajic, “Evolutionary game therotic analysis of advanced persistent threats against cloud storage,” IEEE Access, vol. 5, pp. 8482–8491, Apr. 2017.
 [27] P. Hu, H. Li, H. Fu, D. Cansever, and P. Mohapatra, “Dynamic defense strategy against advanced persistent threat with insiders,” in Proc. IEEE Int. Conf. Comput. Commun. (INFOCOM), pp. 747–755, HK, China, May 2015.
 [28] Y. Wu, B. Wang, K. R. Liu, and T. C. Clancy, “Antijamming games in multichannel cognitive radio networks,” IEEE J. Sel. Areas Commun., vol. 30, no. 1, pp. 4–15, Jan. 2012.
 [29] P. Chia and J. Chuang, “Colonel Blotto in the phishing war,” Decision and Game Theory for Security, pp. 201–218, 2011.
 [30] B. Wang, Y. Wu, K. R. Liu, and T. C. Clancy, “An antijamming stochastic game for cognitive radio networks,” IEEE J. Sel. Areas Commun., vol. 29, no. 4, pp. 877–889, Apr. 2011.
 [31] G. Han, L. Xiao, and H. V. Poor, “Twodimensional antijamming communication based on deep reinforcement learning,” in Proc. IEEE Int. Conf. Acous., Speech, Signal Processing (ICASSP), pp. 1–5, New Orleans, LA, Mar. 2017.
 [32] C. Thomas, “Ndimensional Blotto game with heterogeneous battlefield values,” Economic Theory, pp. 1–36, Jan. 2017.
 [33] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A survey,” J. Artif. Intell. Res., vol. 4, pp. 237–285, 1996.
 [34] E. Rodrigues Gomes and R. Kowalczyk, “Dynamic analysis of multiagent Qlearning with greedy exploration,” in Proc. ACM Annual Int’l. Conf. Mach. Learn. (ICML), pp. 369–376, Montreal, Canada, Jun. 2009.
Comments
There are no comments yet.