In warfare, the ability to anticipate the consequences of the opponent’s possible actions and counteractions is an essential skill Caffrey Jr (2019). This becomes even more challenging in the future battlefield and multi-domain operations where the speed and complexity of operations are likely to exceed the cognitive abilities of a human command staff in charge of conventional, largely manual, command and control processes. Strategy games such as chess, poker, and Go have abstracted some of command and control (C2) concepts. While wargames are often manual games played with physical maps, markers, and spreadsheets to compute battle outcomes Perla (1990), computer games and modern game engines such as Unreal Engine11endnote: 1Unreal Engine game engine: https://www.unrealengine.com/. and Unity22endnote: 2Unity game engine: https://unity.com/ are capable of automating simulation of complex battle and related physics simulations.
Computer games and game engines are not only suitable to simulate military scenarios but also provide a useful test-bed for developing AI algorithms. Games such as StarCraft IIVinyals et al. (2017); Sun et al. (2018); Vinyals et al. (2019); Waytowich et al. (2019a, b); Wang et al. (2020); Han et al. (2020), Dota 2 OpenAI et al. (2019), AtariMnih et al. (2013, 2015); Hessel et al. (2018), GoSilver et al. (2016, 2017); Schrittwieser et al. (2020), chessCampbell et al. (2002); Hsu (2002), heads-up no-limit pokerBrown and Sandholm (2018) have all been used as platforms for training artificially intelligent agents. Another capability developed by the gaming industry that is relevant to the military domain is virtual reality and visual augmentation systems, which may potentially provide commanders and staff a more intuitive and information-rich display of the battlefield.
Recently, there has been a major effort towards leveraging the capabilities of computer games and simulation engines due to their suitability for integrating AI algorithms, some examples of which we discuss shortly. The work described in this paper focuses on adapting imperfect information real-time strategy games (RTS) and their corresponding simulators, to emulate military assets in battlefield scenarios. Further, this work aims to advance the research and development efforts associated with AI algorithms for command and control. Flexibility in these games and game engines allow for simulations of a rather broad range of terrains or assets. In addition, most games and simulators allow for battlefield scenarios to be played out at a “faster then real-time” speed which is a highly desirable feature for the rapid development and testing of data-driven approaches.
The paper is organized as follows. In the Background and Related Work section, we offer motivation for research on AI-based tools for Command and Control (C2) decision-making support, and describe a number of prior related efforts to develop approaches to such tools. In particular, we offer arguments in support of using machine learning that can leverage data produced using simulation / wargaming runs, to achieve affordable AI solutions. In the following section, we outline briefly the project that yielded the results described in the paper, and explain how critical for such a research is to find or adapt a game or a simulation with an appropriate set of features. Then, we describe in detail two case studies. In the first, we describe how we adapted a fast and popular gaming system to approximate partly-realistic military operation and how we used it in conjunction with a reinforcement learning algorithm to train an “artificial commander” (a trained agent) that commands a blue force to fight an opposing enemy force (red force). In the second case study, we describe a similar exploration using a realistic military simulation system. Then we discuss approaches to overcoming common challenges that are likely to arise in transitioning such future technologies to real-world battlefield, and also describe our key findings.
The key contributions of the research described in the paper are two-fold. First, we explore and describe two case studies on modeling a partly realistic mission scenario and training an artificial commander leveraging existing (with some adaptations) off-the-shelf deep reinforcement learning algorithms. We show that such trained algorithms can significantly outperform both human and doctrine-based baselines without using any pre-coded expert knowledge, and learn effective behaviors entirely from experience. Second, we formulate and empirically confirm a set of features and requirements that a game or simulation engine should meet in order to serve as a viable platform for research and development of reinforcement learning tools for C2. In effect, these are initial recommendations for the researchers who build related experimental and developmental systems.
2 Background and Related Work
We start by defining important terms used throughout this research work. Wargame is defined in this work as a largely manual, strategy game that uses rules, data, and procedures to simulate an armed conflict between two or more opposing sides Gortney (2010); Burns et al. (2015); Caffrey Jr (2019), which is used to train military officers and plan military courses of action. This is different from games, which is used here as a fully automated, recreational computer application with well-defined rules and score systems that uses a simulation of an armed conflict as a form of entertainment. Similarly, simulators are used here as a hybrid form of wargames and games. Simulators are fully automated computer applications that aim to realistically simulate the outcome of military battles and courses of action. They are not designed as an entertainment platform but as a tool to aid military planning.
Concerning command and control (C2), we use the same definition as command and control warfighting function defined in the Army Doctrine Publication No.3 - Operations Army (2019) as the “related tasks and a system that enable commanders to synchronize and converge all elements of combat power”. Similarly, AI-support to C2 are AI systems developed to aid human commanders by providing additional information or recommendations in the command and control process.
The past few decades have seen a number of ideas and corresponding research towards developing automated or semi-automated tools that could support decision-making in planning and executing military operations. DARPA’s JFACC program took place in late 1990’s Kott and others (2004) and developed a number concepts and prototypes for agile management of a joint air battle. Most of the approaches considered at that time involved continuous real-time optimization and re-optimization (as situation continually changes) of routes and activities of various air assets. Also in mid-to-late 1990’s, the Army funded the CADET project Kott et al. (2002) which explored potential utility of classical hierarchical planning, adapted for adversarial environments, for transforming high-level battle sketch into a detailed synchronization matrix – a key product of the doctrinal Military Decision-Making Process (MDMP). In early 2000’s, DARPA initiated the RAID project Ownby and Kott (2006) which explored a number of technologies for anticipating enemy battle plans, as well as dynamically proposing friendly tactical actions. At the time, game-solving algorithms emerged as the most successful among the technological approaches explored.
The role of multiple domains and their extremely complex interactions – beyond the traditional kinetic fights to include political, economic and social effects – were explored in late 2000’s in DARPA’s COMPOEX program Kott and Corpac (2007). This program investigated the use of interconnected simulation sub-models, mostly system-dynamic models, in order to assist senior military and civilian leaders in planning and executing large-scale campaigns in complex operational environments. The importance of non-traditional warfighting domains such as the cyber domain has been recognized and studied in mid-2010’s by a NATO research group Kott et al. (2017) that looked into simulation approaches to assessing mission impacts of cyberattacks and highlighted strong non-linear effects of interactions between cyber, human and traditional physical domains.
All approaches taken in the research efforts mentioned above – and many other similar ones – have major and somewhat common weaknesses. They tend to require a rigid, precise formulation of the problem domain. Once such a formulation is constructed, they tend to produce effective results. However, as soon as a new element needs to be incorporated into the formulation (e.g., a new type of a military asset or a new tactic), a difficult, expensive, manual and long effort is required to “rewire” the problem formulation and to fine-tune the solution mechanism. And the real world endlessly presents a stream of new elements and entities that must be taken into account. In rule-based systems of the 1980’s, a system would become un-maintainable as more and more rules (with often unpredictable interactions) had to be added to represent the real-world intricacies of a domain. Similarly, in optimization-based approaches, an endless number of relations between significant variables, and variety of constraints had to be continually and manually be added (a maintenance nightmare) to represent real-world intricacies of a domain. In game-based approaches, the rules governing legal moves and effects of moves for each piece would gradually become hopelessly convoluted as more and more realities of a domain had to be manually contrived and added to game formulation.
In short, such approaches are costly in their representation-building and maintenance. Ideally, we would like to see a system that learns its problem formulation and solution algorithm directly from its experiences in a real or simulated world, without any (or with little) manual programming. Machine learning, particularly reinforcement learning, offers that promise.
In contrast, applicability of machine learning to problems of real-world C2 remains a matter of debate and exploration. Walsh et al. Walsh et al. ; Walsh et al. investigated how the military can use the capacity of dealing with large volumes of data and greater decision speed of machine learning algorithms for military planning and command and control. Walsh et al. Walsh et al. analyzed and rated the characteristics of ten games, such as Go, Bridge, and StarCraft II, and ten command and control processes, such as intelligence preparation of the battlefield, operational assessment, and troop leading procedures. Example of characteristics are the operational tempo, rate of environment change, problem complexity, data availability, stochasticity of action outcomes, and others. Their main conclusion related to using games as a platform for military applications was that real-world tasks are very different from many of the games and environments used to develop and demonstrate artificial intelligence systems, which is mostly due to them having fixed and well-defined rules that are regularly exploited by AI agents.
Related to these games, StarCraft II (SC2) is a real-time strategy game where players compete for map domination and resources. The players need to manage resources to expand their bases, build more units, upgrade them, and coordinate tens of units at the same time to defeat their opponent. Vinyals et al. Vinyals et al. (2019) presented AlphaStar, the first AI agent to learn how to play the full game of SC2 at the highest competitive level. AlphaStar
played under the same kind of constraints that humans play under professionally approved conditions. The AI agent was able to achieve this success using a combination of self-play via reinforcement learning, multi-agent learning, and imitation learning using hundreds of thousands of expert replays. Sun et al.Sun et al. (2018) proposed the TStarBot AI agent, a combination of reinforcement learning and encoded knowledge of the game, becoming the first learning agent to defeat all levels of the built-in rule-based AI in the StarCraft II game. Han et al. Han et al. (2020) extended that to TStarBot-X with a complete analysis of the learned agent trained under a limited scale of computation resources. Also using limited computational resources, Wang et al. Wang et al. (2020) presented StarCraft Commander, an AI system that also achieves the highest competitive level but using a model with a third of the parameters used by AlphaStar and about one-sixth of the training time.
OpenAI et al. OpenAI et al. (2019) was the first AI system to defeat the world champions at the Dota 2 esport game, a partially-observable five-against-five real-time strategy game with high dimensionality of observation and action spaces. In the Dota 2 game, both teams compete for control of strategic locations of the map and to defend their bases at the corners of the map. Each agent controls a special hero unit with unique abilities that are used when teams engage in conflict against each other and non-player-controlled units. OpenAI was able to solve this complex task by scaling existing reinforcement learning algorithms, in this case, Proximal Policy Optimization (PPO) Schulman et al. (2017)
with Generalized Advantage Estimation (GAE)Schulman et al. (2015)
, to train using thousand of GPUs over ten months with a specialized neural-network architecture to handle the long time horizons of the game.
Boron and Darken Boron and Darken (2020) investigated the use of deep reinforcement algorithms to solve simulated, perfect information, turn-based, multi-agent, small tactical engagement scenarios based in military doctrine. The grid-world environment emulated an offensive scenario where the homogeneous units controlled by the reinforcement learning agent could move and attack and the defender was static and combat was resolved using both deterministic and stochastic Lanchester combat models Lanchester (1916); Washburn (2000); MacKay (2006). One of the main results was how the learned behavior can be controlled to follow different principles of war such as mass or force Corps (2011) based on the discount factor for the rewards.
Asher et al. Asher et al. (2018) proposed the adoption of militarily relevant tactics to a simulated predator-prey pursuit task that allowed for teams of deep reinforcement learning agents to engage in various tactics through capability modifications such as decoys, traps, and camouflage. Further, this research group has introduced methods for measuring coordination Asher et al. (2019); Zaroukian et al. (2019); Asher et al. (2020) towards a framework for integrating AI agents into mixed Soldier-agent teams Barton et al. (2018); Barton and Asher (2018).
With respect to simulators built using existing game engines, Sun et al. Fu et al. (2020) applied deep reinforcement learning to the command and control of air defense operations. Digital battlefield environment based on Unreal Engine was developed for reinforcement learning training process. The training setup consisted of static and random opponent strategies where attack route and formation of units was either fixed or random. For evaluation, win rate, battle damage statistics, and combat details were compared against human experts. Their experimental results showed that deep reinforcement learning agent achieved higher winning rate than the human experts in fixed and random opponent scenarios. Conversely, in this work, the authors are proposing a real-time decision-making and planning system that is designed using reinforcement learning formalism, which can operate in tandem with the commander. Unity game engine was used as a base for a multi-agent robot and sensor simulation for military applicationsKing-Doonan (2020) and as a quadrotor simulator Song et al. (2020), while other simulators, such as Microsoft AirSim Shah et al. (2017), use Unreal Engine to simulate interaction between multiple robotic agents and the environment, being flexible enough to allow the implementation of military-relevant tasks, as for example, an autonomous landing task where a drone landed on top of a combat vehicle Goecks et al. (2019).
With respect to military simulators, the AlphaDogfight Trial was a DARPA sponsored competition 2 where an AI controled a simulated F-16 fighter jet in an aerial combat against an Air Force pilot flying in a virtual reality simulator. The goal of this program was to automate air-to-air combat. First place winner, Heron Systems, designed an F-16 AI, which outperformed seven other participating companies and defeated an experienced Air Force pilot with a score of 5-0 1 in a simulated dogfight. The outcome of this competition demonstrated that an AI can provide precise aircraft maneuvering that may surpass human abilities. In addition, this F-16 AI opened possibilities of human-machine teaming such that human pilots can address high-level strategies while offloading low level, tedious tactical tasks to an AI.
Schwartz et al. Schwartz et al. (2020)
described a wargaming support tool that used a genetic algorithm to recommend modifications to an existing friendly course of action (COA), a sequence of decisions to be taken in a military scenario, framed as task scheduling problem with multiple tasks and their respective start times. The system integrated user input to the optimization process to constrain which tasks can be modified, minimum and maximum start times, all the AI settings, and monitor all the recommendations that were being simulated by the AI. The authors showed that their system was able to generate expert-level recommendation to friendly courses of action and support the military decision-making process (MDMP)Reese (2015).
Given these positive results on wargaming, real-time complex strategy and multi-agent recreational applications, and military simulators, extending these algorithms to military command and control applications as a whole is a promising research avenue that we explore in this research work.
3 Games and Simulations for Deep Reinforcement Learning of C2
In this work, we investigate whether deep reinforcement learning algorithms might support future agile and adaptive command and control (C2) of multi-domain forces that would enable the commander and staff to exploit rapidly and effectively fleeting windows of superiority.
To train reinforcement learning agents for a command and control scenario, a fast running simulator with the appropriate interfaces is needed to enable learning algorithms to run for millions of simulation steps, as is often required by state-of-the-art deep reinforcement learning algorithms Espeholt et al. (2018); Kapturowski et al. (2018). Much of what motivated us on adapting a game to the study of AI in C2 applications was the difficulty of finding a simulator that had all the required features to develop machine learning algorithms for C2. These features include, but are not limited to, being able to:
interact with the simulator, also known as the environment, via an application programming interface (API), which includes the ability to query environment states and send actions computed by the agent;
simulate interactions between agent and environment faster than real-time;
parallelize simulations for distributed training of the machine learning agent, ideally over multiple nodes;
randomize environment conditions during training to learn more robust and generalizable reinforcement learning policies; and
emulate realistic combat characteristics such as terrain, mobility, visibility, communications, weapon range, damage, rate of fire, and other factors, in order to mimic real world military scenarios.
While we were unable to find a simulator that met all these requirements out of the box, game-based simulators, such as the StarCraft II Learning Environment (SC2LE) Vinyals et al. (2017), were able to be adapted with the purpose of exploring use of applying artificial intelligence algorithms in potential military C2 applications. The next sections describe how we adapted SC2LE and the military simulator OpSim, respectively, to be used as our main platform to develop an artificial commander for command and control tasks using reinforcement learning.
3.1 Case Study: StarCraft II for Military Applications
As part of this project, we developed a research prototype command and control (C2) simulation and experimentation capability that included simulated battlespaces using the StarCraft II Learning Environment (SC2LE) Vinyals et al. (2017) with interfaces to deep reinforcement learning algorithms via RLlib Liang et al. (2018), a library that provides scalable software primitives for reinforcement learning on a high performance computing system.
StarCraft II is a complex real-time strategy game in which players balance high-level economic decisions with low-level individual control of potentially hundreds of units in order to overpower and defeat an opponent force. StarCraft II has a number of difficult challenges for artificial intelligence algorithms that make it a suitable simulation environment for wargaming, command and control, and other military applications. For example, the game has complex state and action spaces, can last tens of thousands of time-steps, can have thousands of actions selected in real time, and can capture uncertainty due to the partial observability or “fog-of-war”. Further, the game has heterogeneous assets, an inherent control architecture that in some elements resembles military command and control, embedded objectives that have adversarial nature, and a shallow learning curve for implementation/modification compared to more robust simulations.
DeepMind’s SC2LE framework Vinyals et al. (2017) exposes Blizzard Entertainment’s StarCraft II Machine Learning API as a reinforcement learning environment. This tool provides access to StarCraft II, its associated map editor, and an interface for RL agents to interact with StarCraft II, getting observations and sending actions.
Using the StarCraft II Editor we implemented TigerClaw, a Brigade-scale offensive operation scenario Narayanan et al. (2021) to generate a tactical combat scenario within the StarCraft II simulation environment, as seen in Fig. 1. The game was militarized by re-skining the icons to incorporate MIL-STD-2525C military symbology STD (2011) and unit parameters (weapons, range, scaling) associated with the TigerClaw scenario, as seen in Fig. 2.
In TigerClaw, the Blue Force’s goal is to cross the wadi terrain, neutralize the Red Force, and control certain geographic locations. These objectives are encoded in the game score for use by reinforcement learning agents and serving as a benchmarking baseline for comparison across different neural network architectures and reward driving attributes. The following sections describe the process we used to adapt the map, units, and rewards to this military scenario.
3.1.1 Map Development
We created a new Melee Map for TigerClaw scenario using the StarCraft II Editor. The map size was the largest available, 256 by 256 tiles, using the StarCraft II coordinate system. A wasteland tile set was used as the default surface of the map since it visually resembled a desert region in the area of operations in TigerClaw, as seen in Fig. 3. After the initial setup, we used the Terrain tools to modify the map to loosely approximate the area of operation. The key terrain feature was the impassable wadi with limited crossing points.
Distance scaling was an important factor for the scenario creation. In the initial map, we used the known distance between landmarks to translate StarCraft II distance, using its internal coordinate system, into kilometers and latitude and longitude. This translation is important for adjusting weapons range during unit modification and to ensure compatibility with other internal visualization tools that expect geographic coordinates as input.
3.1.2 Playable Units Modification
To simulate the TigerClaw scenario, we selected StarCraft II “fantastic” units that could be made to approximate the capabilities of realistic military units, even if crudely. The StarCraft II units were first duplicated and their attributes modified in the Editor to support the scenario. First, we modified the appearance of the units and replaced it with an appropriate MIL-STD-2525C symbol, as seen in Table 1.
|TigerClaw Unit||StarCraft II Unit|
|Armor||Siege Tank (tank mode)|
|Artillery||Siege Tank (siege mode)|
Other attributes modified for the scenario included, weapon range, weapon damage, unit speed, and unit health (how much damage it can sustain). Weapon ranges were discerned from open source materials and scaled to the map dimensions. Unit speed was established in theTigerClaw operations order and fixed at that value. The attributes for damage and health were estimated, with the guiding principle of maintaining a conflict that challenges both sides. Each StarCraft II unit usually had only one weapon making it challenging to simulate the variety of armaments available to a company size unit. Additional effort to increase the accuracy of unit modifications will require wargaming subject matter experts.
Additionally, the Blue Force units were modified so that they would not engage offensively or defensively unless specifically commanded by the player or learning agent in control. To control the Red Forces, we used two different strategies. The first strategy was to include a scripted course of action, a set of high-level actions to take, for Red Force movements that is executed in every simulation. The units default aggressiveness attributes controlled how it engaged Blue. The second strategy was to let a StarCraft II bot AI control the Red Force to execute an all-out attack, or suicide as it is termed in the Editor. The built-in StarCraft II bot has several difficulty levels (1–10) which dictate the proficiency of the bot. The bot levels indicate their proficiency where a level 1 is a fairly rudimentary bot that can be easily defeated and level 10 is a very sophisticated bot that uses information not available to players (i.e., a cheating bot). Finally, environmental factors such as fog-of-war were toggled across experiments to investigate their impact. This was done in the following manner:
3.1.3 Game Score and Reward Implementation
Reward function is an important component of reinforcement learning and it controls how the agent reacts to environmental changes by giving them positive or negative reward for each situation. We incorporated the reward function for the TigerClaw scenario in StarCraft II and our implementation overrode the internal game scoring system. The default scoring system in StarCraft II rewarded players for the resource value of their units and structures. Our new scoring system focused on gaining and occupying new territory as well as destroying the enemy.
Our reward function awarded points for the Blue Force crossing the wadi (river) and points for retreating back. In addition, we awarded points for destroying a Red Force unit and points if a Blue Force unit was destroyed. In order to implement the reward function, it was necessary to first use the StarCraft II map editor to define the various regions and objectives of the map. Regions are areas, defined by the user, which are utilized by internal map triggers that compute the game score, as seen in Fig. 4.
Additional reward components can be also integrated based on the the Commander’s Intent in the TigerClaw, or other scenario, warning orders. Ideally, the reward function will attempt to train the agent to create optimal behavior that is perceived as reasonable by a military subject matter expert.
3.1.4 Deep Reinforcement Learning Results
Using the custom TigerClaw map, units, and reward function, we trained a multi-input and multi-output deep reinforcement learning agent adapted from Waytowich et al 2019 Waytowich et al. (2019b). The RL agent was trained using the Asynchronous Advantage Actor Critic (A3C) algorithm Mnih et al. (2016). In this tactical version of the StarCraft II mini-game, as shown in Fig. 5
, the state-space consists of 7 mini-map feature layers of size 64x64 and 13 screen feature layer maps of size 64x64 for a total of 20 64x64 2D images. Additionally, it also consists of 13 non-spatial features containing information such as player resources and build queues. The mini-map and screen features were processed by identical 2-layer convolutional neural networks (top two rows) in order to extract visual feature representations of the global and local states of the map, respectively. The non-spatial features were processed through a fully-connected layer with a non-linear activation. These three outputs were then concatenated to form the full state-space representation for the agent.
The actions in StarCraft II are compound actions in the form of functions that require arguments and specifications about where that action is intended to take place on the screen. For example, an action such as “attack” is represented as a function that would require the attack locations on the screen. The action space consists of the action identifier (i.e., which action to run), and two spatial actions ( and
) that are represented as two vectors of lengthreal-valued entries between and .
The architecture of the A3C agent we use is similar to the Atari-net agent Mnih et al. (2015), which is an A3C agent adapted from Atari to operate on the StarCraft II
state and actions space. We make one slight modification to this agent and add a long short-term memory (LSTM) layerHochreiter and Schmidhuber (1997), which adds memory to the model and improves performance Mnih et al. (2016). The complete architecture of our A3C agent is shown in Fig. 6.
The A3C models were trained with 20 parallel actor-learners using 8,000 simulated battles against a built-in StarCraft II bot operating on hand crafted rules. Each trained model was tested on 100 rollouts of the agent on the TigerClaw scenario. The models were compared against a random baseline with randomized actions as well as a human player playing 10 simulated battles against the StarCraft II bot. Fig.7 show plots of total episode reward and number of Blue Force casualties during the evaluation rollouts. We see that the AI commander has not only achieved comparable performance compared to a human player, but has also performed slightly better at the task, while also reducing Blue Force casualties.
3.2 Case Study: Reinforcement Learning using OpSim: a Military Simulator
The TigerClaw scenario, as described in the previous case study, was also implemented in the OpSim Surdu et al. (1999) simulator. OpSim is a decision support tool developed by Cole Engineering Services Inc. (CESI) that provides planning support, mission execution monitoring, mission rehearsal, embedded training, and mission execution monitoring and re-planning. OpSim integrates with SitaWare C4I command and control, a critical component of Command Post Computing Environment33endnote: 3Command Post Computing Environment (CPCE) webpage: https://peoc3t.army.mil/mc/cpce.php (CPCE) fielded by PEO C3T allowing all levels of command to have shared situational awareness and coordinate operational actions, thus making it an embedded simulation that connects directly to operational mission command.
OpSim is fundamentally constructed as an extensible Service Oriented Architecture (SOA) based simulation and is capable of running faster than current state of the art simulation environments such as One Semi-Automated Forces44endnote: 4One Semi-Automated Forces (OneSAF) webpage: https://www.peostri.army.mil/OneSAF (OneSAF) Wittman Jr and Harrison (2001); Parsons et al. (2005), MAGTF Tactical Warfare Simulation55endnote: 5MAGTF Tactical Warfare Simulation (MTWS) webpage: https://coleengineering.com/capabilities/mtws (MTWS) Blais (1994). OpSim is designed to run much faster that wall clock time, and can run 30 replications of the TigerClaw mission, which would take two hundred and forty hours if run serially in real time. Output of a simulation plan in OpSim include an overall ranking of Blue Force plans based on criteria such as ammunition expenditure, casualties, equipment loss, fuel usage, and others. The OpSim tool however was not originally designed for AI applications and had to be adapted by incorporating interfaces to run reinforcement learning algorithms.
An OpenAI Gym Brockman et al. (2016) interface was developed to expose simulation state and offer simulation control to external agents with the ability to supply actions for select entities within the simulation, as well as the amount of time to simulate before responding back to the interface. The observation space consists of 17 features vector where the observation space is partially observable based on each entities’ equipment sensors. Unlike the StarCraft II environment, our OpSim-based prototype of an artificial C2 agent currently does not use image inputs or spatial features from the screen images. The action space primarily consists of simple movements and engagement attacks, as shown below:
Observation space: damage state, location, location, equipment loss, weapon range, sensor range, fuel consumed, ammunition consumed, ammunition total, equipment category, maximum speed, perceived opposition entities, goal distance, goal direction, fire support, taking fire, engaging targets.
Action space: no operation, move forward, move backward, move right, move left, speed up, slow down, orient to goal, halt, fire weapon, call for fire, react to contact.
Reward function: friendly damaged (), friendly destroyed (), enemy damaged (), enemy destroyed (), from goal destination at every step.
3.2.1 Experimental Results
Two types of “artificial commanders” were developed for the OpSim environment. The first is based on an expert designed rule engine provided as part of OpSim and developed by Military Subject Matter Experts (SMEs) using military doctrinal rules. The second is a reinforcement learning-based long short-term memory (LSTM) Hochreiter and Schmidhuber (1997) deep neural network with a multi-input and multi-output trained with Advantage Actor Critic (A2C) algorithm Mnih et al. (2016). OpSim’s custom Gym interface supports multi-agent training where each force can use an either rule or learning-based commander.
The policy network was trained on a high performance computing facility with 39 parallel workers collecting 212,000 simulated battles in 44 hours. The trained models were evaluated with 100 rollout simulation results using the frozen policy at a checkpoint with the highest mean reward for the Blue Force, in this case, a rolling average of for the Blue Force policy mean reward and for the Red Force policy mean reward. Analysis of 100 rollout, as seen in Figure 8, show that the reinforcement learning-based commander minimizes Blue Force casualties from about 4 to 0.4, on average, when compared to the rule-based commander and increases Red Force casualties from 5.4 to 8.4, on average, in the same comparison. This outcome is reached by employing a strategy to engage using only combat armor companies and fighting infantry companies. The reinforcement learning-based commander has learned a strategy to utilize Blue Force’s (BLUFOR) most lethal units with Abrams and Bradleys vehicles while protecting vulnerable assets from engaging with the Red Force (OPFOR), as seen in a snapshot of the beginning and end of one rollout shown in Figure 9.
4 Challenges and Future Applications with Virtual Reality
Transition to practice is a common and major challenge. An approach that might reduce challenges of transitioning AI-based C2 tools – as we are beginning to explore in the cases studies described above – is the adaptation of games and simulators with built-in AI capabilities to enable course of action (COA) development and wargaming. This path provides the commander and staff with opportunities to rapidly explore in great detail multiple friendly courses of action (COAs) against multiple enemy COAs, capability or asset allocation and employment, terrain or environmental changes, and various visualizations to better understand how an adversary might respond to selected COAs in a simulated battle. Therefore, the inclusion of AI enabled games into current military practices might accelerate, supplement, or otherwise enhance current practices in COA and wargaming development and decisions.
Another challenge is operator interfaces that often fail to gain end-user acceptance. To this end, another capability potentially adapted from the gaming industry are head-mounted displays (HMD) which afford users the ability to interact with content in virtual reality (VR), augmented reality (AR), and mixed reality (MR) settings - collectively XR. Traditionally, these technologies have been used in industry to provide immersive experiences for storytelling, to enable remote real-time collaboration in shared virtual spaces, and to augment natural perception with holographic overlays. For real-world military applications, XR technologies can be used to provide a more intuitive display of multi-domain information, enable distributed collaboration, and to enhance situational awareness and understanding. While the Army has already invested in XR technologies for training, through the Synthetic Training Environment66endnote: 6Synthetic Training Environment (STE) webpage: https://asc.army.mil/web/portfolio-item/synthetic-training-environment-ste/, and to enhance lethality, through the Integrated Visual Augmentation System (IVAS)77endnote: 7Integrated Visual Augmentation System (IVAS) webpage: https://www.peosoldier.army.mil/Program-Offices/Project-Manager-Integrated-Visual-Augmentation-System/, it has relied heavily on advances made in the gaming industry to do so. In fact, many military XR projects are developed using either Unity or Unreal - the two most popular game development engines.
For this project, initial integration has begun to pull live data from the StarCraft II simulation into a networked XR environment Dennison and Trout (2020). The goal is to allow multiple, decentralized users with the ability to not only examine AI-generated courses of action in an immersive platform, but to interact with the AI and other human decision-makers collaboratively in real time. Prior research (Skarbez et al., 2019) has shown that understanding of spatial information, like that depicted in a three-dimensional war-game, can be enhanced by visualization in an HMD. Furthermore, rendering of the tactical information in a game engine like Unity enables researchers with the ability to precisely control all aspects of how the battlefield is visualized and related user interfaces. Interoperability between this immersive XR interface and other military program of record command and control information systems, such as CPCE discussed in the OpSim case study, is also being investigated to allow explicit comparison between immersive and non-immersive tools.
5 Findings and Discussion
Our research highlights several important aspects of developing a training system for AI algorithms in military-relevant games and battlefield simulators for command and control.
First, with respect to both the adapted game and battlefield simulator, we found that an important aspect is simulation speed. Reinforcement learning algorithms are notorious for requiring millions, or even billions Espeholt et al. (2018); Kapturowski et al. (2018), of data samples from the environment to train the best-performing policies, so fast simulation cycles are essential to achieve results in a reasonable time. In our StarCraft II experiments we were able to simulate 8 million training steps, or 12,800 battles, per day with 35 CPU workers, and that was still not fast enough given that our agents required over 40 million training steps to achieve satisfactory results. Given that the simulations run at least for days, another important aspect for a simulator to be considered for this type of research work is scalability. Advances in computing capacity has propelled the field of reinforcement learning in the past decade and thus a simulator for AI research needs to support distributed computing at scale. Yet another aspect is adaptability. The simulator should give the experiment designer the flexibility to model the desired military task, including terrain, assets, and fog-of-war.
Second, with respect to training reinforcement learning agents in simulators, due to the nature of the exploratory behavior of these reward-driven algorithms, sometimes learning agents are able to exploit loopholes in simulator dynamics and learn unrealistic behaviors, usually unintended by the simulator designer, in order to maximize the received reward Amodei et al. (2016). This type of exploitation is often not detected during training, and are discovered only when carefully investigating the trained policy behavior after the training session. For example, in our early experiments, we found that the agent was able to exploit a simulator deficiency allowing armored assets to cross impassable terrain. In this sense, reinforcement learning can be a tool to detect and strengthen simulators.
Third, with respect to the training procedure, it should comprise of diverse training data, for instance terrain, assets, and location, so the learning artificial commander is able to develop a more robust and general policy and be able to address variability that exists in reality. How diverse and how much variability these scenarios should contain is still an open research question. During training, it also helps to evaluate the learning agent periodically, either against a rule or doctrine-based commander or against another learning artificial commander. Once training is complete and the agent is evaluated in test cases, we found that visualization tools are essential to uncover if the artificial commander learned any unintended behavior although more visualization tools are needed to understand each detail of the commander’s actions and plans.
Finally, we found that reinforcement learning has been able, at least in a set of cases we explored, to outperform both human and doctrine-based baselines without access to prior human knowledge of the task, which becomes even more relevant when task complexity increases and human-dictated rules are more difficult to specify. However, there are still open questions on how human-like the strategies learned are, or how human-like they need to be if that is the case. Another concern is that the reinforcement learned policies are willing to sacrifice assets in the battle if that leads to a higher reward value at the end of the scenario, which may not be an acceptable decision for a human commander. We address this issue when handcrafting the reward function by penalizing the agent for assets lost, however, given that the reward function is also composed of additional goals, this solution is non-trivial to balance in practice Amodei et al. (2016).
As artificial intelligence (AI) algorithms have been successfully learning how to solve complex games, their impact on military applications is probably imminent. We envision the need to develop agile and adaptive AI support tools for the future battlefield and multi-domain operation scenarios, under the main assumption that the future flow of information and speed of operation will likely exceed the capabilities of the current human staff if the command and control processes remain largely manual. This includes leveraging AI algorithms to analyze battlefield information from multiple sources, from both red and blue forces, and correctly identify and exploit emerging windows of superiority.
In this research work, we explore two case studies on modeling a partially realistic mission scenario, and training artificial commanders that leverage existing (with some adaptations) off-the-shelf deep reinforcement learning algorithms. We show that such trained algorithms can outperform significantly both human and doctrine-based baselines without using any pre-coded expert knowledge, and learning effective behaviors entirely from experience. Furthermore, we formulate and empirically confirm a set of features and requirements that a game or simulation engine must meet in order to serve as a viable platform for research and development of reinforcement learning tools for C2.
Portions of the Related Work, StarCraft II and OpSim case studies also appear in the DEVCOM Army Research Laboratory report ARL-TR-9192 Narayanan et al. (2021).
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Research was sponsored by the Army Research Laboratory and was accomplished partly under Cooperative Agreement [W911NF-20-2-0114]. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.
- 1 Unreal Engine game engine: https://www.unrealengine.com/.
- 2 Unity game engine: https://unity.com/
- 3 Command Post Computing Environment (CPCE) webpage: https://peoc3t.army.mil/mc/cpce.php
- 4 One Semi-Automated Forces (OneSAF) webpage: https://www.peostri.army.mil/OneSAF
- 5 MAGTF Tactical Warfare Simulation (MTWS) webpage: https://coleengineering.com/capabilities/mtws
- 6 Synthetic Training Environment (STE) webpage: https://asc.army.mil/web/portfolio-item/synthetic-training-environment-ste/
- 7 Integrated Visual Augmentation System (IVAS) webpage: https://www.peosoldier.army.mil/Program-Offices/Project-Manager-Integrated-Visual-Augmentation-System/
-  AlphaDogfight trials foreshadow future of human-machine symbiosis. Note: https://www.darpa.mil/news-events/2020-08-26Accessed: 2021-05-17 Cited by: §2.
-  AlphaDogfight trials go virtual for final event. Note: https://www.darpa.mil/news-events/2020-08-07Accessed: 2021-05-17 Cited by: §2.
- Concrete problems in ai safety. arXiv preprint arXiv:1606.06565. Cited by: §5, §5.
- Army doctrine publication no. 3-0: operations. Washington, DC: Headquarters, Department of the Army. Cited by: §2.
- Adapting the predator-prey game theoretic environment to army tactical edge scenarios with computational multiagent systems. arXiv preprint arXiv:1807.05806. Cited by: §2.
- Multi-agent collaboration with ergodic spatial distributions. In Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications II, Vol. 11413, pp. 114131N. Cited by: §2.
- Multi-agent coordination profiles through state space perturbations. In 2019 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 249–252. Cited by: §2.
- Reinforcement learning framework for collaborative agents interacting with soldiers in dynamic military contexts. In Next-Generation Analyst VI, Vol. 10653, pp. 1065303. Cited by: §2.
- Coordination-driven learning in multi-agent problem spaces. arXiv preprint arXiv:1809.04918. Cited by: §2.
- Marine air ground task force (magtf) tactical warfare simulation (mtws). In Proceedings of Winter Simulation Conference, pp. 839–844. Cited by: §3.2.
- Developing combat behavior through reinforcement learning in wargames and simulations. In 2020 IEEE Conference on Games (CoG), pp. 728–731. Cited by: §2.
- Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §3.2.
- Superhuman ai for heads-up no-limit poker: libratus beats top professionals. Science 359 (6374), pp. 418–424. Cited by: §1.
- War gamers handbook: a guide for professional war gamers. Technical report Naval War College Newport United States. Cited by: §2.
- On wargaming. Cited by: §1, §2.
- Deep blue. Artificial intelligence 134 (1-2), pp. 57–83. Cited by: §1.
- Marine corps doctrinal publication 1-0. Marine Corps Operations. Cited by: §2.
- The accelerated user reasoning for operations, research, and analysis (aurora) cross-reality common operating picture. Technical report Combat Capabilities Development Command Adelphi. Cited by: §4.
- Impala: scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning, pp. 1407–1416. Cited by: §3, §5.
- Alpha C2–an intelligent air defense commander independent of human decision-making. IEEE Access 8 (), pp. 87504–87516. External Links: Cited by: §2.
- Efficiently combining human demonstrations and interventions for safe training of autonomous systems in real-time. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 2462–2470. Cited by: §2.
- Department of defense dictionary of military and associated terms. Technical report Joint Chiefs Of Staff Washington. Cited by: §2.
- TStarBot-x: an open-sourced and comprehensive study for efficient league training in starcraft ii full game. arXiv preprint arXiv:2011.13729. Cited by: §1, §2.
- Rainbow: combining improvements in deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: §1.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.1.4, §3.2.1.
- Behind deep blue: building the computer that defeated the world chess champion. Princeton University Press. Cited by: §1.
- Recurrent experience replay in distributed reinforcement learning. In International conference on learning representations, Cited by: §3, §5.
- ARO year in review 2020. Technical report Army Research Office. External Links: Cited by: §2.
- COMPOEX technology to assist leaders in planning and executing campaigns in complex operational environments. Technical report DEFENSE ADVANCED RESEARCH PROJECTS AGENCY ARLINGTON VA. Cited by: §2.
- Toward practical knowledge-based tools for battle planning and scheduling. In AAAI/IAAI, pp. 894–899. Cited by: §2.
- Assessing mission impact of cyberattacks: toward a model-driven paradigm. IEEE Security & Privacy 15 (5), pp. 65–74. Cited by: §2.
- Advanced technology concepts for command and control. Xlibris Corporation. Cited by: §2.
- Aircraft in warfare: the dawn of the fourth arm. Constable limited. Cited by: §2.
- RLlib: abstractions for distributed reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 3053–3062. External Links: Cited by: §3.1.
- Lanchester combat models. arXiv preprint math/0606300. Cited by: §2.
- Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §3.1.4, §3.1.4, §3.2.1.
- Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1.
- Human-level control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: §1, §3.1.4.
- First-Year Report of ARL Director’s Strategic Initiative (FY20-23): Artificial Intelligence (AI) for Command and Control (C2) of Multi Domain Operations. Technical report Technical Report ARL-TR-9192, Adelphi Laboratory Center (MD): DEVCOM Army Research Laboratory (US). Cited by: §3.1, §6.
- Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680. Cited by: §1, §2.
- Reading the mind of the enemy: predictive analysis and command effectiveness. Technical report SOLERS INC ARLINGTON VA. Cited by: §2.
- Onesaf: a next generation simulation modeling the contemporary operating environment. In Proceedings of Euro-simulation interoperability workshop, Cited by: §3.2.
- The art of wargaming: a guide for professionals and hobbyists. Naval Institute Press. Cited by: §1.
- Military decisionmaking process: lessons and best practices. Technical report Center for Army Lessons Learned Fort Leavenworth United States. Cited by: §2.
- Mastering atari, go, chess and shogi by planning with a learned model. Nature 588 (7839), pp. 604–609. Cited by: §1.
- High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. Cited by: §2.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §2.
- AI-enabled wargaming in the military decision making process. In Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications II, Vol. 11413, pp. 114130H. Cited by: §2.
- AirSim: high-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics, External Links: Cited by: §2.
- Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484–489. Cited by: §1.
- Mastering the game of go without human knowledge. nature 550 (7676), pp. 354–359. Cited by: §1.
- Immersive analytics: theory and research agenda. Frontiers in Robotics and AI 6, pp. 82. External Links: Cited by: §4.
- Flightmare: a flexible quadrotor simulator. In Conference on Robot Learning, Cited by: §2.
- 2525C. Common Warfighting Symbology 1. Cited by: §3.1.
- Tstarbots: defeating the cheating level builtin ai in starcraft ii in the full game. arXiv preprint arXiv:1809.07193. Cited by: §1, §2.
- OpSim: a purpose-built distributed simulation for the mission operational environment. SIMULATION SERIES 31, pp. 69–74. Cited by: §3.2.
- Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 575 (7782), pp. 350–354. Cited by: §1, §2.
- Starcraft ii: a new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782. Cited by: §1, §3.1, §3.1, §3.
-  Exploring the Feasibility and Utility of Machine Learning-Assisted Command and Control. RAND 1. Cited by: §2.
-  Exploring the Feasibility and Utility of Machine Learning-Assisted Command and Control. RAND 2. Cited by: §2.
- SCC: an efficient deep reinforcement learning agent mastering the game of starcraft ii. arXiv preprint arXiv:2012.13169. Cited by: §1, §2.
- Lanchester systems. Technical report NAVAL POSTGRADUATE SCHOOL MONTEREY CA. Cited by: §2.
- Grounding natural language commands to starcraft ii game states for narration-guided reinforcement learning. In Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, Vol. 11006, pp. 110060S. Cited by: §1.
- A narration-based reward shaping approach using grounded natural language commands. International Conference on Machine Learning (ICML), Workshop on Imitation, Intent and Interaction. Cited by: §1, §3.1.4.
- OneSAF: a product line approach to simulation development. Technical report MITRE CORP ORLANDO FL. Cited by: §3.2.
- Algorithmically identifying strategies in multi-agent game-theoretic environments. In Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, Vol. 11006, pp. 1100614. Cited by: §2.