Log In Sign Up

Incorporating Deception into CyberBattleSim for Autonomous Defense

by   Erich Walter, et al.
U.S. Navy

Deceptive elements, including honeypots and decoys, were incorporated into the Microsoft CyberBattleSim experimentation and research platform. The defensive capabilities of the deceptive elements were tested using reinforcement learning based attackers in the provided capture the flag environment. The attacker's progress was found to be dependent on the number and location of the deceptive elements. This is a promising step toward reproducibly testing attack and defense algorithms in a simulated enterprise network with deceptive defensive elements.


page 7

page 8


Secure (S)Hell: Introducing an SSH Deception Proxy Framework

Deceiving an attacker in the network security domain is a well establish...

Network Defense is Not a Game

Research seeks to apply Artificial Intelligence (AI) to scale and extend...

A Game-Theoretic Approach for AI-based Botnet Attack Defence

The new generation of botnets leverages Artificial Intelligent (AI) tech...

Informing Autonomous Deception Systems with Cyber Expert Performance Data

The performance of artificial intelligence (AI) algorithms in practice d...

Too Expensive to Attack: Enlarge the Attack Expense through Joint Defense at the Edge

The distributed denial of service (DDoS) attack is detrimental to busine...

A Model for Perimeter-Defense Problems with Heterogeneous Teams

We develop a model of the multi-agent perimeter-defense game to calculat...

1 Introduction

Cyber defenders are increasingly at an disadvantage against attackers in the unbalanced cyber domain. Skilled defensive operators are limited, their job is difficult and never ending, and it is nearly impossible to prove they have been successful at preventing or ejecting all attackers. Research that advances the ability for cyber defenses to automatically detect, respond, and remediate malicious activity quickly is critical.

Researchers profess

the need for publicly available cybersecurity datasets to further advance machine learning (ML) and artificial intelligence research in the field, but these static datasets are unlikely to help quickly advance autonomous decision making and response. To train an autonomous agent to learn to take optimal automated actions to impede and mitigate cyber attacks, a dynamic training environment that provides appropriate feedback based on the selected action is required. Microsoft has recently released such a simulated environment, called CyberBattleSim 

[30], based on the OpenAI gym interface [7]. CyberBattleSim “provides a high-level abstraction of computer networks and cyber security concepts”, like a physics engine for the cyber domain, allowing researchers to create scenarios and insert custom autonomous (offensive or defensive) agents for training and testing.

Cyber attackers often rely on human errors which cannot always be anticipated or corrected. These errors can be introduced through social engineering, spearphishing, an unenforced policy, or a software bug accidentally included by the software developer. Defenders can also take advantage of the fact that there is often a human behind a sophisticated cyber attack. Cyber deception considers the human aspects of an attacker in order to impede cyber attacks and improve security [17], which can also translate to advantages against automated attackers. Cyber deception aims to understand and influence an attacker even after they have already infiltrated a network, and ultimately to delay, deter, and disrupt their attack. While some ML methods for detection in cybersecurity are still working on improving true-positive/false-positive rates, cyber deception technologies can often naturally act as a high-confidence early warning mechanism. Cyber deception has been gaining traction in the cyber defense community, with many commercial vendors profiting from these techniques [9][25][8][10][5], however they are often omitted from autonomous defense frameworks, taxonomies, and simulations [11]. In this paper we investigate incorporating cyber deception into CyberBattleSim and provide initial results using it to explore several important research questions.

2 Related Work

We are primarily interested in determining the feasibility of incorporating cyber deception components into an autonomous cyber defense environment. Our goal is to quantify the effects of deception on a cyber attacker trained using reinforcement learning to achieve its goal within the autonomous cyber defense environment.

Reinforcement learning (RL) is a promising machine learning approach to support the development of autonomous agents that learn (near) optimal sequential decision making in complex, uncertain environments [29]. Between 2015 and 2020 alone, more than 1000 papers about RL have been published in the top Machine Learning conferences [27]

. The increased availability of open-source learning environments, such as OpenAIGym 

[7] and AI Safety Gridworlds [20] has provided a baseline to improve RL research across different application domains. Advancements in the cyber domain have lagged behind many others, due both to its complexity, and to a lack of available data or learning environments.

Within the cybersecurity domain, there are numerous research papers discussing the application of RL to attacking or defending network environments, such as corporate enterprise networks, Internet-of-Things (IoT) networks, cyber physical systems, and software-defined networks (SDNs) [24]. However, very few researchers incorporate some form of deception into their learning environment that train the RL agents. Often when researchers do incorporate some form of deception, they refer to deception or deception attacks

from an adversarial machine learning (AML) perspective, when an attacker perturbs observations to exploit the RL algorithm itself. Our use of the term cyber deception is different from those commonly used in the AML domain.

There are few, if any, open-source, cybersecurity-based RL experimentation environments that enable researchers to address the real-world challenges encountered when applying RL to the cybersecurity domain and improve the state-of-art in RL cyber applications [23]. With the release of CyberBattleSim environment in April 2021, Microsoft, leveraging the Python-based Open AI Gym interface, has created an initial, abstract simulation-based experimentation research platform to train automated agents using RL [30]. However, CyberBattleSim does not incorporate any cyber deception in its network simulation to potentially impede the autonomous attacker’s progress in achieving its goal of taking ownership of network resources by exploiting their vulnerabilities.

Although there is a lack of open-source RL training environments that include cyber deception, there is still a growing body of cybersecurity research that investigates deception as a defense against cyber attacks. Some of the research applies game theoretic principles to analyze the interactions of attackers and defenders to deceive the attackers with deceptive information and mitigate damage from the attack [3] [6] [16]. Other research also incorporate deception in the form of moving target defenses and adaptive honeypot management to increase attacker cost and to gather threat information. In the latter case, the deception is leveraged to reduce environment uncertainty to improve the RL policies learned by the defender [18]. Other cyber deception research has focused on human subjects experiments [26], sometimes also including also simulations [1], but these simulations are rarely publicly available to other researchers. However, this research does not incorporate deceptive decoys which are lightweight and low-interaction relative to honeypots. Decoys are less expensive for the defender to create and manage than honeypots as well as being less intrusive on normal users and system administrators than moving target defenses [12]. Decoys have been shown to impede and delay attacker forward progress regardless of whether the attacker is aware that deception is being used as a defense [14]. We aim to demonstrate that the advantages of cyber deception are also effective in degrading attacker performance and mitigating damage from cyber attacks within an RL simulation learning environment.

3 Methodology

Many different cyber deception techniques and technologies have been proposed and studied throughout the years [15]. In this paper, we examine three popular types of deception—honeypots, decoys, and honeytokens—to provide breadth to our inquiry. Honeypots tend to be high-fidelity, high-interaction fake systems which are designed to induce attentional tunneling in the attacker. They generally focus on looking especially interesting or vulnerable to lure an attacker into the honeypot and propel further interactions to gain intelligence about attacker tactics, techniques, and procedures (TTPs) during later stages of the Cyber Kill Chain [19]. Decoy Systems tend to be comprised of many low-fidelity, low-interaction fake systems which are designed to mimic the real systems on a network during earlier stages of the Cyber Kill Chain (i.e., reconnaissance). Honeytokens are fictitious tokens (e.g., honeypassword, honeypatch, honeyfile) which can be placed on real or deceptive machines. If stolen they act as an early warning system for defenders, as well as provide additional information that can help with response and mitigation. Decoys and honeytokens both benefit from attacker’s confirmation bias—a tendency to search for or interpret information in a way that only affirms one’s preconceptions. Since most things on computer networks are real, and are not traps, the attackers tend to assume the decoys and honeytokens are real, even when presented with contradictory information.

CyberBattleSim was devised to research autonomous agents using reinforcement learning in computer networks, but did not include any cyber deception concepts. We incorporated decoys, honeypots, and honeytokens (honeycredentials and honeylinks) into the codebase to investigate several research questions critical to cyber deception.

3.1 Research Questions

With this preliminary work we begin to address the following research questions:

  • Which type of cyber deception is best-suited to different attacker goals?

  • What effect does the number of deceptive elements have on the attacker goals?

  • When the number of possible deceptive elements is limited, what difference does the location they are placed make on the attacker goal?

For these research questions we looked at the following attacker goals:

  1. Network Ownership: The attacker has taken over a specific percentage of real nodes on the whole network.

  2. Key Terrain: The attacker has successfully exploited the most valuable asset on the network.

3.2 Simulation Details

The simulation environment was adapted from the toy capture the flag (ToyCTF) example provided in the CyberBattleSim code. The environment is designed to be high-level abstraction of an enterprise computing environment which is represented by a graph where nodes are computing elements and edges are connections between nodes. See Figure  1 for an example. We have attempted to maintain variable names from the code in the descriptions used in this paper, for greater transparency, even though they do not always align with the common terminology from the field.

Figure 1: CyberBattlerSim Example: the state space of ToyCTF with no deception.

3.2.1 Action Space

The action space includes local, remote, and connect and control exploits. Local exploits are executable on a node the attacker controls, remote exploits are executable on a node the attacker can see, and connect and control exploits are executable with a matching credential object on a node visible to the attacker. Local and remote exploits can reveal undiscovered nodes or credentials. Credentials are used in the connect and control action to take over nodes.

3.2.2 State Space and State Transitions

The scenario examined in ToyCTF involves a small network in which flags are located in a subset of machines, files, or programs (all represented as nodes in the graph). A local exploit reveals credentials if there are any on the local node, which can then be used later with a connect and control exploit to gain control or “own” a specific node, for which that credential is matched. It can also “discover” paths to neighboring nodes. A remote exploit is similar to the local exploit, but can be used on neighboring nodes. For the connect and control exploit to be successful, 1) the port needs to match, representing the correct service running, 2) the credentials need to match the target node, and 3) the node needs to have been discovered. This action can be taken from any location. The solved capture the flag example provided in the repository was used to infer a successful path for an attacker through the nine nodes in the state space.

Inferred optimal path from the attacker foothold (Client node): 1) Website, 2) GitHubProject, 3) AzureStorage, 4) Website.Directory, 5) Sharepoint, 6) AzureResourceManager, 7) Website[user=monitor], 8) AzureResourceManager[user=monitor], 9) AzureVM, where [user=monitor] indicates privilege escalation to a user named “monitor”. This inferred path was used to order the decoy and honeypot additions relative to the start (Client node, where the attacker has gained a foothold on the network) and finish: early (nearest the client node), or late (nearest the AzureVM node); results are displayed in Figures  6 and  4. These findings are expected for this scenario and these deceptive elements. The attack path for this scenario entails combining the Full Ownership and Key Terrain attacker goals, since the key terrain is at the end of the attack path, and the attacker is required to gain full ownership to progress to the end. The rewards in the ToyCTF example were altered to accommodate the addition of deceptive elements and our new scenario.

3.2.3 Deceptive Elements

The deceptive elements included honeypots [22], decoys [13], and honeytokens [28] (specifically, honey-credentials and honeylinks). Honey-credentials initially appear to be valid credentials to the attacker but they are invalid and cannot be used to connect to any nodes on the network. Honeylinks reveal potential new nodes, but the nodes are dummy nodes (no penalty for investigating them) and cannot be exploited or taken over. Honeypots appear to be a copy of an existing node which can be exploited and controlled. Honeypots additionally contain a number of honeytokens and honeylinks predefined at the start of the experiment. Honeypots can be exploited by the attacker with the command and control exploit but yield no reward. Decoys appear to be a copy of an existing node, but they can never be exploited or controlled. Credentials for decoys can be discovered in real nodes, leading the attacker to the decoy, but the connection attempt will fail. Any new action with a honeypot or a decoy generates a defined penalty for the attacker detailed in Section 3.2.4. Repeating actions does not fully penalize the attacker again for the same action, but instead incurs a repeated actions penalty of .

We have expanded on the ToyCTF example provided within CyberBattleSim which involves a nine node network of real assets. We modeled the following cyber deception techniques within CyberBattleSim:

  • Decoys: cannot be connected to but look like real assets. Connection attempts provide a reward of , since this will trigger an actionable decoy alert for defenders. No local actions are available on decoys. Any additional connections to the same decoy garners a reward, as a repeated action.

  • Honeypots: can be connected to as normal, look like real assets, but are only filled with fake credentials. The initial successful connection to a honeypot provides a reward of , since this will trigger a high confidence actionable alert for defenders. Any additional connections to the same honeypot garners a reward, as a repeated action.

  • Honeytokens: Fake credentials that do not work on any real machines and links leading to dummy machines are located on honeypots. When the attacker uses a honeytoken, it will provide a reward of , since this will trigger an actionable honeytoken alert for defenders. Repeated use of the same honeytoken garners a reward, as a repeated action. Honeylinks also reside within a honeypot, but provide a reward of to the attacker. They represent components of honeypots designed to waste attacker time and resources.

Honeypots and decoys were generated as clones of existing nodes and connected into the network in parallel to existing nodes with duplicate connections (new actions that target the new node) from the existing node. These new connections were designed to cancel the reward generated for a successful exploit so there is no reward for exploiting a honeypot.

3.2.4 Reward Function

ToyCTF did not include negative rewards, so the code was changed to allow penalties for interacting with deception.

  • Wrong password: -10

  • Repeated Mistake: -1

  • Exploit worked: +50

  • Decoy/honeypot touched: -100

  • Exploit use: -1

  • Control of node: +1000

  • Control of honeypot: 0

  • Win Condition: +5000

We had to modify the specifics of the default win condition to consider deception. Our win conditions are based on Network Ownership and Key Terrain described in Section  3.1.

3.2.5 Learning Algorithms

We utilize the attacker agents provided by CyberBattleSim which include the following algorithms (

, epsilon-min = 0.1, epochs = 250, steps = 5000):

  • Credential Lookup (CRED): A learner that exploits the credential cache, only taking other actions when no unused credentials are in hand.

  • Deep Q-Learning (DQL): A learner that uses a Deep Q-Network (DQN)

    . Because it uses Neural Networks as the Q-value function approximator, it has no theoretical convergence guarantees. It also has Q-value function overestimation bias, like Q-learning, which means the neural network weights/parameters must learn to recover from sub-optimal values. DQL attempts to overcome stability issues in convergence mainly by using experience replay to store and retrieve previous experiences and delayed updates of neural network weights.

  • Tabular Q-Learning (QTAB), (): Theoretically proven to converge to optimal solution under some mild assumptions; however, it is known to overestimate the Q-value function (i.e., state-action value function) because its initial values are randomly chosen at beginning of episodes and it attempts to maximize value at each step. Convergence of the Q-learning algorithm can be severely slow as it encounters and recovers from local/sub optimal values as a consequence of this overestimation bias.

  • Random Policy (RAND): An agent that selects a valid action randomly at each time step; with few available valid actions at each state, this agent performs surprisingly well.

For very large state and action spaces, Deep Q-Learning generally outperforms Q-learning, although both overestimate the Q-value function which causes convergence issues. For smaller states spaces, as in our case, Q-learning tends to converge slowly to a (near) optimal solution, while Deep Q-Learning might not converge at all under certain learning rate conditions. The presence and location of decoys might exacerbate these algorithm traits. For our purposes, there is no active defender agent. This allows us to better control and compare the attacker learned policies.

4 Results

The goal of a RL agent is to learn a policy to maximize expected cumulative reward. For this paper, we used the attacker agents provided in CyberBattleSim. It may not always be feasible to wrap all critical metrics up into the reward function, so in addition to reward, we consider additional metrics critical for evaluating cyber defense. Additional metrics to consider include the number of attacker successes, the amount of resources an attacker wastes due to the deception, and how quickly a defender can detect their presence.

Attacker Wins is the percentage of episodes for which the attacker meets with win criteria for the scenario. Figure 2 demonstrates that different attacker agents using different algorithms can have varying overall success on the cyber attack goal. With an increasing number of decoys or honeypots, when frontloaded (meaning a decoy of each element in the inferred optimal path was added starting at the beginning), the CRED attacker never achieved full ownership, while the agent using QTAB seems to outperform all the others. As expected, honeypots have a greater impact on attacker wins than decoys for each attacker agent, and frontloading deceptive elements outperforms backloading them; (backloaded graphs not shown due to space limitations).

Wasted Resources measures the percent of actions the attacker spent on deceptive elements and delaying their true forward progress. As expected, more attacker resources were wasted with the higher fidelity deceptive element—honeypots—and more resources are wasted as the number of deceptive elements increases. Frontloading also had a greater effect. These same patterns are seen across all the experiments (some are not included due to space limitations). These interactions also act as an early warning system, providing high-confidence alerts to the defender from interactions with deceptive elements. This can allow an active defender to pinpoint the attacker’s location, infer their goal, and apply countermeasures. There were also differences in the type of action wasted most among the various attacker agents, though this also depended on features of the deceptive elements (See Figure  3). For example, QTAB attacker wasted more connect and control exploit actions, less remote exploit actions, and few local exploit actions for various quantities of decoys or honeypots, while the number of actions wasted on real nodes (unsuccessful or duplicative actions; Figure  2(a)) decreased slightly as the number wasted on decoys (Figure  2(b)) or honeypots increases. This differs from other algorithms, such as DQL, in which connect and control exploit actions and remote exploit actions seemed to be wasted in equal amounts for decoys. Interestingly, for the case of honeypots, the DQL attacker wasted more remote exploit actions, and less connect and control exploit actions, and few local exploit actions for various quantities of honeypots on both real (Figure  2(c)) and deceptive nodes (Figure  2(d)). This indicates that just the presence of deceptive elements on the network have potential to alter the attacker’s behavior on the real nodes in the network.

Defender Detections is number of times, and the speed at which, the attacker has triggered a high-confidence alert via deceptive elements, which would allow an active defender to pinpoint the attacker’s location and apply countermeasures. In this case, since both the decoys and honeypots can act as an early warning system, any interaction counts as a detection, so this metric simply depends on the location of the deceptive element in the attack path.

(a) Percentage of attacker wins as the number of decoys increases. DQL: 58.64%; QTAB: 65.04%; CRED: 0%; RAND: 59.08%
(b) Mean number of steps for attacker to win as a function of the number of decoys.
(c) Percentage of attacker wins as number of honeypots increases. DQL: 18.68%; QTAB: 27.36%; CRED: 0%; RAND: 18.92%
(d) Mean number of steps for the attacker to win as a function of the number of honeypots.
Figure 2: Attacker Wins: Measures of attacker attaining full ownership when varying the quantity of frontloaded deceptive elements.
(a) Percentage of mean actions on real nodes as a function of the number of decoys present.
(b) Percentage of mean actions on decoy nodes.
(c) Percentage of mean actions on real nodes as a function of the number of honeypots.
(d) Percentage of mean actions on honeypot nodes.
Figure 3: Wasted Resources Metric: Percentage of mean actions spent on real or deceptive nodes per episode for the DQL attacker when varying number of a specific type of deceptive element.

4.1 Type of Deception Technology

These results investigate how each different type of deception affected attacker behavior by comparing the success of the autonomous attackers for each attacker goal. Comparing several of the figures we can compare the deceptive technologies. The decoys are best at impeding the CRED attacker, since its main TTP (using credentials) is most effected by the design of decoys. This provided agent was designed to “exploit the credential cache to connect to a node not owned yet”. The honeypots seem to have a slightly greater effect on the other attacker algorithms, since the multiple honeytokens increase the valid actions available within a honeypot and thus magnify their impact. See Figure 1(c) illustrating the reduced attacker success as more honeypots are added to the network (frontloaded), compared to a similar increase in decoys in Figure 1(a). However, there are other reasons that a lower fidelity deceptive element (like decoys) may be preferred by defenders, to something more effective (like honeypots). The reason honeypots are more effective is often because of the enhanced realism and fidelity, allowing longer and more complicated interactions with the attacker. However, this comes as an increased set-up and upkeep cost to defenders to ensure these high-fidelity elements are deep, consistent, and fresh.

4.2 Quantity of Deceptive Elements

By comparing the success of the included autonomous attackers, we investigate how the quantity of each of the three different types of deception technologies affects attacker behavior.

Decoys. In Figure 5 we examine how the attacker reward changes as the number of decoys is increased. We compare several default algorithms provided as the number of decoys increases from zero to nine until such there is a decoy copy of every real machine in the network. In the case where there are no decoys, the DQL attacker accumulates the most reward, however, as the number of decoys increases, this attacker’s performance dips. Furthermore, the CRED attacker, who focuses on using credentials to connect to machines, performs much worse as the number of decoys increases. This is because they are trying to connect to the decoy machines with unusable credentials and accumulating negative rewards. Examining Figures  4(e) and  4(f), compared to Figures  4(d) and  4(c), respectively, displays the difference between backloading and frontloading decoys for different automated attackers.
Honeypots. In Figure 4 we examine how the attacker reward changes as the number of honeypots increases. One honeypot seems to have a similar, but magnified effect of one decoy on the DQL attacker. The more honeypots that are deployed increases the negative effect on the attacker. When a honeypot is only encountered at later stages in the attack, this delays the negative effect on attacker reward since no extra barriers were encountered in the earlier part of their campaign.

(a) Honeypots Frontloaded RAND Rewards
(b) Honeypots Frontloaded QTAB Rewards
(c) Honeypots Frontloaded DQL Rewards
Figure 4: Quantity of Honeypots: Cumulative reward for various attacker agents as a function of the quantity of honeypots (maximum 5000 steps).
(a) Decoys Frontloaded RAND Rewards
(b) Decoys Frontloaded CRED Rewards
(c) Decoys Frontloaded QTAB Rewards
(d) Decoys Frontloaded DQL Rewards
(e) Decoys Backloaded DQL Rewards
(f) Decoys Backloaded QTAB Rewards
Figure 5: Quantity of Decoys: Cumulative reward for various attacker agents as a function of the quantity of decoys (maximum 5000 steps).

4.3 Layout of Deceptive Elements

These results illustrate how the deployment layout of two different deception technologies (decoys and honeypots) affects attacker behavior by comparing the success of the CyberBattleSim autonomous attackers. In real-world defensive situations it may not be evident where to deploy deceptive elements (i.e., at a static frontloaded or backloaded network location) since attackers do not advertise their foothold location or goal. Using an autonomous defender, we no longer need to rely on static elements. The defender agent, no longer needing to rely on static elements, can dynamically change the configuration or layout of deceptive elements based on real-time observations and alerts that can provide updated information about the attacker’s locations and intentions.
In Figures 6 and 7 we examine how the attacker reward changes as the location of a single deceptive element changes, assuming the network owner is resource constrained. The AzureStorage decoy (location step 3) and the Azure Resource Manager decoy (location step 6) impacted the QTAB and CRED, while the DQL agent was also affected by the Website decoy (location step 1). This kind of simulation can help resource-limited network owners or autonomous defenders determine the best placement for deceptive elements.

(a) Deep Q-Learning (DQL)
(b) Tabular Q-Learning (QTAB))
(c) Credential Lookup (CRED)
(d) Random Policy (RAND): Agent that selects a valid action randomly at each time step
Figure 6: Location of Decoys: Cumulative reward for various attacker agents as a function of location of one decoy. Location of the decoy is based on inferred optimal path, with Website at the beginning (step 1) and Azure VM at the end (step 9).
(a) Deep Q-Learning (DQL)
(b) Tabular Q-Learning (QTAB)
(c) Credential Lookup (CRED)
(d) Random Policy (RAND): Agent that selects a valid action randomly at each time step
Figure 7: Location of Honeypots: Cumulative reward for various attacker agents as a function of the location of one honeypot. Location of the decoy is based on inferred optimal path, with Website at the beginning (step 1) and Azure VM at the end (step 9).

4.4 Implications for an Autonomous Defender

The results presented in this paper, while not employing a defender agent, provide insight for an autonomous defender. A defender can take a response action immediately when a decoy, honeypot, or honeytoken alert is triggered, including ejecting the attacker from the network. Early placement of deceptive elements can go beyond a defender using this high-confidence early warning detection system to block or eject an attacker. It can more directly restrict an attacker’s ability to progress and achieve their goals. Furthermore, there are many additional actions an autonomous defender can take including deploying additional deceptive elements [2]. Type, quantity, and layout of the deceptive elements are all relevant features that the defender agent can learn to optimize. Our next steps include the addition of a deceptive defensive agent.

5 Conclusions & Future Work

Results presented in this paper demonstrate expected effects of cyber deception on attacker behavior.

The colored shading in the figures shows the standard deviation from each line, which in some cases indicates that neighboring lines have overlapping possible results.

While there are no unexpected findings, the contribution to demonstrate the capability of modeling cyber deception in CyberBattleSim was achieved. These fundamental results provided a necessary sanity check while incorporating deception into CyberBattleSim. With a reasonable model of cyber deception and how it effects cyber attacks built into the simulator, researchers can take further steps towards designing custom autonomous defender agents that can actively take deceptive responses and running more complex deception experiments with different scenarios and techniques. Additionally, more sophisticated automated attacker agents could be integrated within the environment to test cyber deception [4].

Future work also entails informing the models and rewards with insights from real-world attack behavior gathered from human subject experiments [21]. This grounding in reality will help ensure that results are generalizable and applicable to real systems and the humans who attack them.


  • [1] P. Aggarwal, C. Gonzalez, and V. Dutt (2017) Modeling the effects of amount and timing of deception in simulated network scenarios. In 2017 International Conference On Cyber Situational Awareness, Data Analytics And Assessment (Cyber SA), Vol. , pp. 1–7. External Links: Document Cited by: §2.
  • [2] E. Al-Shaer, J. Wei, K. W. Hamlen, and C. X. Wang (2019) Autonomous cyber deception: reasoning, adaptive planning, and evaluation of honeythings. Autonomous Cyber Deception. Cited by: §4.4.
  • [3] A. H. Anwar and C. Kamhoua (2020) Game theory on attack graph for cyber deception. In Decision and Game Theory for Security. (GameSec), Q. Zhu, J.S. Baras, R. Poovendran, and J. Chen (Eds.), Lecture Notes in Computer Science, Vol. 12513. Cited by: §2.
  • [4] A. Applebaum, D. Miller, B. Strom, C. Korban, and R. Wolf (2016) Intelligent, automated red team emulation. In Proceedings of the 32nd Annual Conference on Computer Security Applications, ACSAC ’16, New York, NY, USA, pp. 363?373. External Links: ISBN 9781450347716, Link, Document Cited by: §5.
  • [5] Attivo Networks, Inc. Deception and Denial for Detecting Adversaries. Note: 2021-03-04 Cited by: §1.
  • [6] M. Bilinski, J. DiVita, K.J. Ferguson-Walter, S.J. Fugate, R. Gabrys, J. Mauger, and B.J. Souza (2020) Lie another day: demonstrating bias in a multi-round cyber deception game of questionable veracity. In Decision and Game Theory for Security. (GameSec), Q. Zhu, J.S. Baras, R. Poovendran, and J. Chen (Eds.), Lecture Notes in Computer Science, Vol. 12513. Cited by: §2.
  • [7] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI Gym. CoRR abs/1606.01540. Note: External Links: Link, 1606.01540 Cited by: §1, §2.
  • [8] CounterCraft CounterCraft Cloud™. Note: 2021-03-04 Cited by: §1.
  • [9] F. Cybersecurity Fidelis Deception. Note: 2021-03-04 Cited by: §1.
  • [10] CYBERTRAP CYBERTRAP: Deception Technology from Austria. CYBERTRAP Software GmbH. Note: 2021-03-04 Cited by: §1.
  • [11] Edited by Jason Romano and Duncan Sparrell (2019-11) Open command and control (OpenC2) language specification version 1.0. Note: OASIS Committee Specification 02 Cited by: §1.
  • [12] K.J. Ferguson-Walter, S.J. Fugate, J. Mauger, and M.M. Major (2019-03) Game theory for adaptive defensive cyber deception. ACM Hot Topics in the Science of Security Symposium (HotSoS). Cited by: §2.
  • [13] K.J. Ferguson-Walter, D.S. LaFon, and T.B. Shade (2017) Friend or faux: deception for cyber defense. Journal of Information Warfare 16 (2), pp. 28–42. Cited by: §3.2.3.
  • [14] K.J. Ferguson-Walter, M.M. Major, C.K. Johnson, and D.H. Muhleman (2021-04) Examining the efficacy of decoy-based and psychological cyber deception. In USENIX Security Symposuim, Cited by: §2.
  • [15] D. Fraunholz, S.D. Antón, C. Lipps, D. Reti, D.Krohmer, F. Pohl, M. Tammen, and H. Schotten (2018) Demystifying deception technology: a survey. ArXiv abs/1804.06196. Cited by: §3.
  • [16] S. Fugate and K. Ferguson-Walter (2019-03) Artificial Intelligence and Game Theory Models for Defending Critical Networks with Cyber Deception. AI Magazine 40 (1), pp. 49–62. External Links: ISSN 2371-9621, 0738-4602 Cited by: §2.
  • [17] K.E. Heckman, F.J. Stech, R.K. Thomas, Be. Schmoker, and A.W. Tsow (2015) Cyber Denial, Deception and Counter Deception: A Framework for Supporting Active Cyber Defense. Advances in Information Security, Springer International Publishing. External Links: ISBN 978-3-319-25131-8 Cited by: §1.
  • [18] Q. Huang L. (2020) Strategic learning for active, adaptive, and autonomous cyber defense. In Adaptive Autonomous Secure Cyber Systems, J. S., C. G., S. V., S. V., and W. C. andWellman M. (Eds.), Cham. Cited by: §2.
  • [19] E.M. Hutchins, M.J. Cloppert, and R.M. Amin (2011) Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains. Leading Issues in Information Warfare & Security Research 1 (1), pp. 80. Cited by: §3.
  • [20] J. Leike, M. Martic, V. Krakovna, P. A. Ortega, T. Everitt, A. Lefrancq, L. Orseau, and S. Legg (2017) AI safety gridworlds. CoRR abs/1711.09883. Note: External Links: Link, 1711.09883 Cited by: §2.
  • [21] M.M. Major, B.J. Souza, J. DiVita, and K.J. Ferguson-Walter (2021-08) Informing autonomous deception systems with cyber expert performance data. In International Joint Conference on Artificial Intelligence: Adaptive Cyber Defense Workshop, Cited by: §5.
  • [22] I. Mokube and M. Adams (2007) Honeypots: concepts, approaches, and challenges. In in ACM-SE 45: Proceedings of the 45th Annual Southeast Regional Conference,, pp. 321–326. Cited by: §3.2.3.
  • [23] A. Molina-Markham, R. Winder, and A. Ridley (2021) Network defense is not a game. Note: arXiv External Links: 2104.10262 Cited by: §2.
  • [24] T. T. Nguyen and V. J. Reddi (2020-07) Deep reinforcement learning for cyber security. Note: arXiv External Links: 1906.05799 Cited by: §2.
  • [25] T. Security (2017) Active Defense: TrapX® DeceptionGrid®. TrapX Security, Inc.. Note: 2021-03-04 Cited by: §1.
  • [26] T.B. Shade, A.V. Rogers, K.J. Ferguson-Walter, S.B. Elsen, D. Fayette, and K.E. Heckman (2020-01) The Moonraker Study: An Experimental Evaluation of Host-Based Deception. In Hawaii International Conference on System Sciences (HICSS), Maui, Hawaii. Cited by: §2.
  • [27] A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B. (. Hsu, and K. Wang (2015) An overview of microsoft academic service (MAS) and applications. In Proceedings of the 24th International Conference on World Wide Web, WWW ’15 Companion, pp. 243–246. External Links: ISBN 9781450334730 Cited by: §2.
  • [28] D. Storey (2009) Honey tokens: catching flies with honey tokens. Network Security archive 2009, pp. 15–18. Cited by: §3.2.3.
  • [29] R. S. Sutton and A. G. Barto (2017) Reinforcement Learning: An Introduction. 2nd edition, MIT Press, Cambridge, MA. Cited by: §2.
  • [30] M. D. R. Team. (2021) CyberBattleSim. GitHub. Note: Created by Christian Seifert, Michael Betser, William Blum, James Bono, Kate Farris, Emily Goren, Justin Grana, Kristian Holsheimer, Brandon Marken, Joshua Neil, Nicole Nichols, Jugal Parikh, Haoran Wei. Cited by: Incorporating Deception into CyberBattleSim for Autonomous Defense, §1, §2.