Log In Sign Up

Using Cyber Terrain in Reinforcement Learning for Penetration Testing

Reinforcement learning (RL) has been applied to attack graphs for penetration testing, however, trained agents do not reflect reality because the attack graphs lack operational nuances typically captured within the intelligence preparation of the battlefield (IPB) that include notions of (cyber) terrain. In particular, current practice constructs attack graphs exclusively using the Common Vulnerability Scoring System (CVSS) and its components. We present methods for constructing attack graphs using notions from IPB on cyber terrain analysis of obstacles, avenues of approach, key terrain, observation and fields of fire, and cover and concealment. We demonstrate our methods on an example where firewalls are treated as obstacles and represented in (1) the reward space and (2) the state dynamics. We show that terrain analysis can be used to bring realism to attack graphs for RL.


page 1

page 2

page 3

page 4


Crown Jewels Analysis using Reinforcement Learning with Attack Graphs

Cyber attacks pose existential threats to nations and enterprises. Curre...

A Layered Reference Model for Penetration Testing with Reinforcement Learning and Attack Graphs

This paper considers key challenges to using reinforcement learning (RL)...

Exposing Surveillance Detection Routes via Reinforcement Learning, Attack Graphs, and Cyber Terrain

Reinforcement learning (RL) operating on attack graphs leveraging cyber ...

Discovering Exfiltration Paths Using Reinforcement Learning with Attack Graphs

Reinforcement learning (RL), in conjunction with attack graphs and cyber...

Automated Adversary Emulation for Cyber-Physical Systems via Reinforcement Learning

Adversary emulation is an offensive exercise that provides a comprehensi...

Autonomous Penetration Testing using Reinforcement Learning

Penetration testing (pentesting) involves performing a controlled attack...

Discover the Hidden Attack Path in Multi-domain Cyberspace Based on Reinforcement Learning

In this work, we present a learning-based approach to analysis cyberspac...

I Introduction

Prediction of vulnerabilities and exploits to support cyber defense (i.e., the blue picture) typically involves analyzing ingested data acquired from sensors and agents in various networks. Learning behaviors based on what has been seen, i.e., observed on the network, involves meticulous curation and processing of this data to support model development, training, testing, and validation. While this has provided some degree of results in the past, this work intends to explore an alternative approach toward identifying weaknesses within networks. The goal is to leverage the attack graph construct [16]

and train machine learning models over them to predict weaknesses within network topologies.

Under this approach, instead of observing a static, curated data set, machine learning algorithms can learn by interacting with attack graphs directly. Reinforcement learning (RL) for penetration testing has shown this to be feasible given constraints on attack graph representation such as scale and observability. However, existing literature constructs attack graphs either with no vulnerability information [12, 27, 13, 4] or entirely with vulnerability information [36, 7, 15].

Yousefi et al., Chowdary et al., and Hu et al. use the Common Vulnerability Scoring System (CVSS) and its components to construct attack graphs [36, 7, 15], similar to Gallon and Bascou [11]. CVSS scores are an open, industry-standard means of scoring the severity of cybersecurity vulnerabilities. They provide an empirical and automatic means of constructing attack graphs for RL. However, they do not always correlate to a useful contextual picture for cyber operators. By relying totally on its abstractions, network representations unfortunately can be biased totally towards vulnerabilities and not on a realistic view of how an adversary plans or executes an attack campaign. As a result, this leads to RL methods converging to unrealistic attack campaigns.

While CVSS scores provide a strong foundation for attack graphs, we posit that notions of cyber terrain [8] should be built into attack graph representations to enable RL agents to construct more realistic attack campaigns during penetration testing. In particular, we suggest a focus on the OAKOC terrain analysis framework that consists of obstacles, avenues of approach, key terrain, observation and fields of fire, and cover and concealment [8]. This work makes the following contributions:

  • We contribute methodology for building OAKOC cyber terrain into Markov decision process (MDP) models of attack graphs.

  • We apply our methodology to RL for penetration testing by treating firewalls as cyber terrain obstacles in an example network that is at least an order of magnitude larger than the networks used by previous authors [12, 27, 13, 4, 36, 7, 15].

In doing so we extend the literature on using CVSS scores to construct attack graphs and MDPs as well as the literature on RL for penetration testing.

The paper is structured as follows. First, background is given on terrain analysis and cyber terrain, reinforcement learning, and penetration testing. Second, our methods for constructing terrain-based attack graphs are presented. Then, results are presented before concluding with remarks on future steps.

Ii Background

Ii-a Terrain Analysis and Cyber Terrain

Fig. 1: Figure 1A shows a reinforcement learning agent taking actions in environment and receiving state and reward . Figure 1

B shows a supervised learning agent

learning from example-label pairs provided by an oracle. Figure 1C shows an environment under the attack tree model. Figure 1D shows an environment under the attack graph model.

Intelligence preparation of the battlefield (IPB) considers terrain a fundamental concept [23]. In the physical domain, terrain refers to land and its features. Conti and Raymond define cyber terrain as, “the systems, devices, protocols, data, software, processes, cyber personas, and other network entities that comprise, supervise, and control cyberspace [8].”

They note that cyber terrain exists at strategic, operational, and tactical levels, including, e.g., transatlantic cables and satellite constellations, telecommunications offices and regional data centers, and wireless spectrum and Local Area Network protocols, respectively. In this paper, we consider what Conti and Raymond refer to as the logical plane of cyber terrain, which consists of data link, network, network transport, session, presentation, and application, i.e., layers 2-7 of the Open Systems Interconnection model [37].

Terrain analysis typically follows the OAKOC framework, consisting of observation and fields of fire (O), avenues of approach (A), key terrain (K), obstacles and movement (O), and cover and concealment (C). These notions from traditional terrain analysis can be applied to cyber terrain [1]. For example, fields of fire may concern all that is network reachable (i.e., line of sight) and avenues of approach may consider network paths inclusive of available bandwidth [8]. In this paper, we use obstacles to demonstrate how our methodology can be used to bring the first part of the OAKOC framework to attack graph construction for reinforcement learning.

Ii-B Reinforcement Learning

Reinforcement learning is concerned with settings where agents learn from taking actions in and receiving rewards from an environment [31]. It can be contrasted with supervised learning, where agents learn from example-label pairs given by an oracle or labeling function. This contrast is depicted in Figures 1A and 1B. Naturally, reinforcement learning solution methods take a more dynamic formulation.

An agent is considered to interact with an environment over a discrete number of time-steps by selecting an action at time-step from the set of actions . In return, the environment returns to the agent a new state and reward . Thus, the interaction between the agent and environment can be seen as a sequence . When the agent reaches a terminal state, the process stops.

Here we consider a case when is a finite MDP. A finite MDP is a tuple , where is a set of states, is a set of actions, is the set of admissible state-action pairs,

is the transition probability function, and

is the expected reward function where is the set of real numbers. denotes the transition probability from state to state under action , and denotes the expected reward from taking action in state .

The goal of learning is to maximize future rewards. Using a discount factor , the expected value of the discounted sum of future rewards at time is defined as , that is, the sum of discounted rewards from time onward. The action value function is the expected return after taking action in state and then following policy , where maps to the probability of picking action in state . The optimal action-value function .

The action value function can be represented by a function approximator. Herein, we use Deep Q-learning (DQN) to approximate

with a neural network

, where are parameters of the neural network [18, 19]. DQN has seen broad success and is the basis for many deep RL variants [14, 32, 33].

The parameters are learned iteratively by minimizing a sequence of loss functions


This specific formulation is termed one-step Q-learning, because is the state that succeeds , but it can be relaxed to -step Q-learning by considering rewards over a sequence of steps. Alternative solution methods to DQN include proximal policy optimization [26] and asynchronous advantage actor-critic A3C [17], both of which learn the policy directly.

Paper Network Description(s)
Ghanem and Chen [12] 100 machine local area network
Schwartz and Kurniawati [27] 50 machines with unknown services and 18 machines with 50 services
Ghanem and Chen [13] 100 machine local area network
Chaudhary et al. [4] Not reported
Yousefi et al. [36] Attack graph with 44 vertices and 43 edges
Chowdary et al. [7] Attack graph with 109 vertices, edges unknown, and a 300 host flat network
Hu et al. [15] Attack graph with 44 vertices and 52 edges
Our network Attack graph with 955 vertices and 2350 edges
TABLE I: Network Sizes.

Ii-C Penetration Testing

Penetration testing is defined by Denis et al. as, “a simulation of an attack to verify the security of a system or environment to be analyzed . . . through physical means utilizing hardware, or through social engineering [9].” They continue by emphasizing that penetration testing is not the same as port scanning. Specifically, if port scanning is looking through binoculars at a house to identify entry points, penetration testing is having someone actually break into the house.

Penetration testing is part of broader vulnerability detection and analysis, which typically combines penetration testing with static analysis [3, 28, 6]. Penetration testing models have historically taken the form of either the flaw hypothesis model [21, 34], the attack tree model [24, 25], or the attack graph model [16, 10, 22].

The flaw hypothesis model describes the general process of gathering information about a system, determining a list of hypothetical flaws, e.g., via domain expert brain-storming, sorting that list by priority, testing hypothesized flaws in order, and fixing those that are discovered. As McDermott notes, this model is general enough to describe almost all penetration testing [16]. The attack tree model adds a tree structure to the process of gathering information, generating hypotheses, etc., which allows for a standardization of manual penetration testing, and also gives a basis for automated penetration testing methods. The attack graph model adds a network structure, differing from the attack tree model in regard to the richness of the topology and, accordingly, the amount of information needed to specify the model.

Automated penetration testing has become a part of practice [30], with the attack tree and attack graph models as its basis. In reinforcement learning, these models serve as the environment . They are depicted in Figure 1C and 1D, respectively. Both modeling approaches involve constructing topologies of networks by treating machines (i.e., servers and network devices) as vertices and links between machines as edges between vertices. Variants involve integrating additional detail regarding sub-networks and services. In the case of attack trees, probabilities must be assigned to the branches between parent and child nodes, and in the case of attack graphs, transition probabilities between states must be assigned to each edge. While many of the favorable properties of attack trees persist in attack graphs, it is unclear whether attack graphs can outperform attack trees in largely undocumented systems, i.e., systems with partial observability [16, 29].

Ii-D Reinforcement Learning for Penetration Testing

Reinforcement learning in penetration testing is promising because it addresses many challenges. A single penetration testing tool has never been enough [2]. Yet, RL can be the basis for many tools, such as analysis, bypassing security, and penetration, and can by applied to the various types of penetration testing, i.e., external testing, internal testing, blind testing, and double-blind testing [35]. The automation and generality of RL means it can be deployed quickly, in the form of many variants with different policies, at many points in a network. And, as Chen et al. note [5], as future networks scale in the Internet of Things age, intelligent payload mutation and intelligent entry-point crawling, the kinds of tasks RL is well-suited for, will be necessary in penetration testing.

Reinforcement learning for penetration testing uses the attack graph model [12, 27, 13, 4, 36, 7, 15]. The environment is treated as either a MDP, mirroring classical planning, where actions are deterministic and the network structure and configuration are known, or as a Partially Observable Markov Decision Process (POMDP), where the outcomes of actions are stochastic and network structure and configuration are uncertain.

While POMDPs are more realistic, they have not been shown to scale to large networks and require modeling many prior probability distributions

[29]. Since full observability leads MDPs to underestimate attack cost, its main flaw is in finding vulnerabilities which are unlikely to be found or exploited. As such, penetration testing on MDPs gives a worst case analysis, making it the risk averse option in the sense that it tends towards false alarms. We use MDPs for our attack graph model because it can scale and because our methodology for adding cyber terrain to MDP attack graphs can later be extended to POMDPs.

Unlike most previous work in RL for penetration testing [12, 27, 13, 4], but similar to Yousefi et al., Hu et al., and Chowdary et al. [36, 15, 7], we use vulnerability information to construct the MDP. In particular, we use the Common Vulnerability Scoring System (CVSS) [11]. Unlike those previous authors, however, we extend beyond vulnerability information by folding in notions of cyber terrain. Following the literature, we use DQN as the RL solution method [27, 7, 15]. However, we use a larger network than those in the literature, as reported in Table I.

Iii Methods

Fig. 2: The network is extracted into an attack graph using MulVal [20]. Terrain can be added via state or reward. To add via state, the attack graph is first modified to include more state information related to OAKOC [8]. The CVSS MDP is constructed as usual with the transition terrain-adjusted probabilities. To add via reward, the CVSS MDP is constructed as usual followed by including terrain-adjusted rewards. Each method leads to a terrain-adjusted CVSS MDP. Note, attack complexity is a component of CVSS.

RL-based penetration testing involves a three-step procedure of (1) extracting the network structure into an attack graph, (2) specifying an MDP (or POMDP) over the extracted attack graph, and (3) deploying RL on the MDP. The outcome of deploying RL can then be studied in various ways to express the penetration testing results.

The attack graph is extracted using MulVal, a framework that conducts multihost, multistage vulnerability analysis on a network representation using a reasoning engine [20]. The states of the MDP are given by the vertices of the attack graph, which can be components of the network, e.g., entries into a specific subnet or an intermediary file server, or can be means of traversal, e.g., the interaction rules between network components. That is, not all states are locations in the network. The actions of the MDP that are available in a particular state are given by the outbound edges from that state.

The transition probabilities and the reward of the MDP are constructed using CVSS. The transition probabilities are assigned using the attack complexity associated with , which CVSS ranks as either low, medium, or high, and which we translate into transition probabilities of , , and , respectively, in following with Hu et al. [15]. The agent remains in if the action fails. The reward for arriving at is given by

Then, a target node in the network is deemed the terminal state and given a reward of . An initial state is defined and given a reward of , and, using a depth first search, reward is linearly scaled from the initial state to the terminal state. Lastly, reward is assigned to actions which bring the agent to a state from which the terminal state is inaccessible without backtracking, or otherwise lead to entering a sub-network from which the terminal state is not reachable.

We term this particular MDP the CVSS MDP. The RL agent is trained using DQN in an episodic fashion. Episodes terminate when the terminal state is reached or after taking a number of hops, i.e., actions, in the network. This formulation is similar to those of Yousefi et al., Hu et al., and Chowdary et al. [36, 15, 7]. This terrain-blind approach ignores the typical perspectives of attackers when traversing and navigating enterprise networks.

Our methodology for adding cyber terrain builds on this formulation. We propose to add terrain via state and reward to resolve its short-comings in realism. To add terrain information via state is to do so by modifying and . First, additional information must be included from MulVal and other sources into the attack graph originally generated by MulVal. Then, this additional state information can be used to modify . By using state, we represent terrain as an effect on the dynamics of the MDP. As such, it adds terrain by creating a more realistic model of the environment . To add terrain information via reward is to do so by modifying , i.e., by reward engineering. Depending on the OAKOC phenomena, this means incrementing or decrementing the reward. (We explore these in future work.) By using reward, we introduce terrain not by directly bringing realism to , but rather by incentivizing the agent to behave in a more realistic manner. These two processes are depicted in Figure 2. We term these terrain-adjusted CVSS MDPs.

Iii-a Firewalls as Obstacles

We now consider firewall as a cyber terrain obstacles. Conti and Raymond categorize obstacles as physical or virtual capabilities that filter, disrupt, or block traffic between networks using different methods [8]. For the purposes of this work, we consider firewalls as blocking obstacles and use the presented methodology to incorporate cyber terrain into a CVSS-based attack graph.

Iii-A1 Adding via Reward

We engineer the reward to incentivize realistic attack campaigns using a term such that the reward in state after taking action becomes

The term decrements the reward to incentivize avoiding firewalls. The value of is dependent on the protocol, i.e.,

where is a parameter for tuning the strength of incentivization. That is, we vary the change in reward based on the security of the communication protocol. Note, when multiple protocols are blocked, their values are averaged together.

Iii-A2 Adding via State

Alternatively, we introduce realism by engineering the state transition probabilities. We use two terms and such that the state transition probabilities become

The term corresponds to firewall presence and to the importance of the firewall. They are defined as follows.

Recall, is initialized using the low, medium, and high CVSS attack complexity classes. Note that, introduces an emphasis on avoiding firewalls and counterbalances that emphasis for high-value targets. Note, when multiple protocols are blocked, their values are averaged together.

Iv Results

We now compare the performance of DQN across (1) the vanilla, terrain-blind CVSS MDP, (2) the reward-adjusted MDP, enhanced via , and (3) the state-adjusted MDP, enhanced via state transition probabilities . We use a 122 host network whose attack graph has 955 vertices and 2350 edges. All presented results use . The top-line results are shown in Table II. The introduction of terrain increases the number of hops, as agents must now navigate around firewalls.

We can compare the reward as well. Note the reward functions are identical between the vanilla MDP and the state-adjusted MDP, but are different between the vanilla MDP and reward-adjusted MDP. The parameter for adjusting decreases reward, and so we expect to see a lower reward. Notably, we see a decrease in reward despite taking almost 30 more hops. Again, this simply confirms the reward has been decremented.

MDP “Vanilla” via via
Total Number of Hops 62 91 85
Total Reward 221 179 237
TABLE II: Total number of hops and total reward with all protocols available.

Similar comparisons between the vanilla MDP and state-adjusted MDP, we see a greater reward, as expected due to the larger number of hops. Whereas the agent averages  3.6 units of reward per hop on the vanilla MDP, the agent averages  2.8 units of reward per hop on the state-adjusted MDP. Recalling that reward for approaching the terminal state is linearly scaled using a depth first search from the initial state to terminal state, the maintenance of a high average reward suggests the RL agent can still make steady progress to the terminal state while accounting for obstacles.

A closer look at the results is shown in Figure 3. The plots show the average reward achieved by the DQN agent against the number of training episodes. The total reward was evaluated every 4 episodes and each episode had a maximum length of 2500 steps. The high average reward values achieved after 80 episodes signify that the agents spend a majority of their time close to the terminal state. The vanilla MDP is protocol agnostic. While Table II shows state-adjusted and reward-adjusted total reward when agents can choose between protocols, in Figure 3, the state-adjusted and reward-adjusted plots show average reward when the agent is restricted to a single choice of protocol. The plots show our method was able to represent that FTP is a more significant cyber obstacle than SSH.

Lastly, Figure 4 shows the paths derived from the approximated policies. The top figure shows the vanilla path, the middle figure show the state-adjusted path, and the bottom figure shows the reward-adjusted paths. The paths have been greatly reduced by focusing on key nodes along the path. The red edges highlight differences in the path taken from the initial to terminal state. At node 681, a firewall existed that led to the agents using state-adjusted and reward-adjusted MDPs to seek an alternate path. Their paths differentiate after node 136.

Fig. 3: Average reward plotted against training episode for each MDP. The middle and right plots show special cases where the state-adjusted and reward-adjusted MDPs were restricted to a single communication protocol.
Fig. 4: Visualization of attack campaigns.

V Conclusion

In this paper, we present methods for enhancing “vanilla”, CVSS-based attack graphs by using concepts of cyber terrain within intelligence preparation of the battlefield. Our method introduces cyber terrain by modifying the state transition probabilities and reward function . Using an example attack graph with nearly 1000 nodes, we showed how our approach can be used to introduce cyber obstacles, particularly firewalls. We evaluated using DQN, and showed notable differences in total reward, number of hops, average reward, and attack campaigns.

The shift from manually constructed MDPs [12, 27, 13, 4] to CVSS-based MDPs [36, 7, 15] marks an emphasis on scaling the construction of attack-graph-based MDPs. Our methodology maintains an automated, scale-oriented approach to constructing MDPs, while introducing notions of cyber terrain that help ground RL agent behavior to reality.

Future work should consider how more elements of cyber terrain can be folded into MDP construction. In doing so, a primary consideration should be to continue to scale the size of attack graphs, using more hosts at an enterprise scale. This would help further validate the use of cyber-terrain IPB principles in creating realistic contexts for penetration testing. Also, methods should be developed that use multiple initial and terminal states to assist in attack surface cartography. In addition, the current literature considers RL agents that are trained and deployed on the same network. Notions of transfer learning, meta-learning, and lifelong learning are promising paths for generalizing penetration testing agents.


  • [1] S. D. Applegate, C. L. Carpenter, and D. C. West (2017) Searching for digital hilltops. Joint Force Quarterly 84 (1), pp. 18–23. Cited by: §II-A.
  • [2] A. Austin and L. Williams (2011) One technique is not enough: a comparison of vulnerability discovery techniques. In 2011 International Symposium on Empirical Software Engineering and Measurement, pp. 97–106. Cited by: §II-D.
  • [3] A. G. Bacudio, X. Yuan, B. B. Chu, and M. Jones (2011) An overview of penetration testing. International Journal of Network Security & Its Applications 3 (6), pp. 19. Cited by: §II-C.
  • [4] S. Chaudhary, A. O’Brien, and S. Xu (2020) Automated post-breach penetration testing through reinforcement learning. In 2020 IEEE Conference on Communications and Network Security (CNS), pp. 1–2. Cited by: 2nd item, §I, §II-D, §II-D, TABLE I, §V.
  • [5] C. Chen, Z. Zhang, S. Lee, and S. Shieh (2018) Penetration testing in the iot age. computer 51 (4), pp. 82–85. Cited by: §II-D.
  • [6] B. Chess and G. McGraw (2004) Static analysis for security. IEEE security & privacy 2 (6), pp. 76–79. Cited by: §II-C.
  • [7] A. Chowdary, D. Huang, J. S. Mahendran, D. Romo, Y. Deng, and A. Sabur (2020) Autonomous security analysis and penetration testing. Cited by: 2nd item, §I, §I, §II-D, §II-D, TABLE I, §III, §V.
  • [8] G. Conti and D. Raymond (2018) On cyber: towards an operational art for cyber conflict. Kopidion Press. Cited by: §I, §II-A, §II-A, Fig. 2, §III-A.
  • [9] M. Denis, C. Zena, and T. Hayajneh (2016) Penetration testing: concepts, attack methods, and defense strategies. In 2016 IEEE Long Island Systems, Applications and Technology Conference (LISAT), pp. 1–6. Cited by: §II-C.
  • [10] B. Duan, Y. Zhang, and D. Gu (2008) An easy-to-deploy penetration testing platform. In 2008 The 9th International Conference for Young Computer Scientists, pp. 2314–2318. Cited by: §II-C.
  • [11] L. Gallon and J. J. Bascou (2011) Using cvss in attack graphs. In 2011 Sixth International Conference on Availability, Reliability and Security, pp. 59–66. Cited by: §I, §II-D.
  • [12] M. C. Ghanem and T. M. Chen (2018) Reinforcement learning for intelligent penetration testing. In 2018 Second World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4), pp. 185–192. Cited by: 2nd item, §I, §II-D, §II-D, TABLE I, §V.
  • [13] M. C. Ghanem and T. M. Chen (2020) Reinforcement learning for efficient network penetration testing. Information 11 (1), pp. 6. Cited by: 2nd item, §I, §II-D, §II-D, TABLE I, §V.
  • [14] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine (2016) Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, pp. 2829–2838. Cited by: §II-B.
  • [15] Z. Hu, R. Beuran, and Y. Tan (2020) Automated penetration testing using deep reinforcement learning. In 2020 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), pp. 2–10. Cited by: 2nd item, §I, §I, §II-D, §II-D, TABLE I, §III, §III, §V.
  • [16] J. P. McDermott (2001) Attack net penetration testing. In Proceedings of the 2000 workshop on New security paradigms, pp. 15–21. Cited by: §I, §II-C, §II-C, §II-C.
  • [17] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §II-B.
  • [18] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §II-B.
  • [19] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: §II-B.
  • [20] X. Ou, S. Govindavajhala, and A. Appel (2005-07) MulVAL: a logic-based network security analyzer. pp. 8–8. Cited by: Fig. 2, §III.
  • [21] C. P. Pfleeger, S. L. Pfleeger, and M. F. Theofanos (1989) A methodology for penetration testing. Computers & Security 8 (7), pp. 613–620. Cited by: §II-C.
  • [22] H. Polad, R. Puzis, and B. Shapira (2017) Attack graph obfuscation. In International Conference on Cyber Security Cryptography and Machine Learning, pp. 269–287. Cited by: §II-C.
  • [23] T. C. Purcell (1989) Operational level intelligence: intelligence preparation of the battlefield. Technical report ARMY WAR COLL CARLISLE BARRACKS PA. Cited by: §II-A.
  • [24] C. Salter, O. S. Saydjari, B. Schneier, and J. Wallner (1998) Toward a secure system engineering methodolgy. In Proceedings of the 1998 workshop on New security paradigms, pp. 2–10. Cited by: §II-C.
  • [25] B. Schneier (1999) Attack trees. Dr. Dobb’s journal 24 (12), pp. 21–29. Cited by: §II-C.
  • [26] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §II-B.
  • [27] J. Schwartz and H. Kurniawati (2019) Autonomous penetration testing using reinforcement learning. arXiv preprint arXiv:1905.05965. Cited by: 2nd item, §I, §II-D, §II-D, TABLE I, §V.
  • [28] S. Shah and B. M. Mehtre (2015) An overview of vulnerability assessment and penetration testing techniques. Journal of Computer Virology and Hacking Techniques 11 (1), pp. 27–49. Cited by: §II-C.
  • [29] D. Shmaryahu, G. Shani, J. Hoffmann, and M. Steinmetz (2016) Constructing plan trees for simulated penetration testing. In The 26th international conference on automated planning and scheduling, Vol. 121. Cited by: §II-C, §II-D.
  • [30] Y. Stefinko, A. Piskozub, and R. Banakh (2016) Manual and automated penetration testing. benefits and drawbacks. modern tendency. In 2016 13th International Conference on Modern Problems of Radio Engineering, Telecommunications and Computer Science (TCSET), pp. 488–491. Cited by: §II-C.
  • [31] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §II-B.
  • [32] H. Van Hasselt, A. Guez, and D. Silver (2016) Deep reinforcement learning with double q-learning. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 30. Cited by: §II-B.
  • [33] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas (2016) Dueling network architectures for deep reinforcement learning. In International conference on machine learning, pp. 1995–2003. Cited by: §II-B.
  • [34] C. Weissman (1995) Penetration testing. Information security: An integrated collection of essays 6, pp. 269–296. Cited by: §II-C.
  • [35] C. Weissman (1995) Security penetration testing guideline. Naval Research Laboratory, Unisys Government Systems 12010. Cited by: §II-D.
  • [36] M. Yousefi, N. Mtetwa, Y. Zhang, and H. Tianfield (2018) A reinforcement learning approach for attack graph analysis. In

    2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE)

    pp. 212–217. Cited by: 2nd item, §I, §I, §II-D, §II-D, TABLE I, §III, §V.
  • [37] H. Zimmermann (1980) OSI reference model-the iso model of architecture for open systems interconnection. IEEE Transactions on communications 28 (4), pp. 425–432. Cited by: §II-A.