Deep reinforcement learning (RL) has yielded many recent successes in solving a variety of complex problems ranging from video games, self-driving vehicles, robotic manipulators as well as various other simulated and real scenarios [12, 1, 8]. RL, however, is a trial-and-error based method that is inherently sample inefficient and often requires large computational and data resources to learn reliable policies. The trial-and-error nature of RL results in random behaviors at the beginning of training, which makes RL poorly suited for certain tasks that require more guarantees on performance, such as physically embodied tasks. Additionally, traditional RL approaches are typically very sample inefficient and have slow convergence rates. This is even more problematic in tasks with either large state and action spaces or environments with relatively long time horizons, which can require tens of millions of training samples [2, 11].
One approach for overcoming these limitations is learning from example behavior or demonstrations of the desired task from expert policies. These expert policies can come from either some other autonomous agent or directly from a human. Traditionally, learning from human demonstrations via direct behavior cloning can lead to high-performance policies with significantly less computation time given that the algorithm has access to high-quality demonstration data covering the most likely scenarios to be encountered when the agent is operating. However, in real-world scenarios, this is often not the case as expert data is limited and thus most imitation learning techniques that learn from demonstrations, often fail when the agent encounters new scenarios that were never seen before in the training data due to the distributional drift problem
. Since we desire that our artificial intelligence (AI) agents learn policies robust enough to handle situations that were not demonstrated by the human expert, relying solely on demonstrations that cover the span of possible scenarios is simply not feasible.
One possible solution to improve learning efficiency for RL agents is to break-down complex tasks into a series of easier ones to solve problems that form a curriculum. Curriculum learning is a relatively new technique designed to speed up reinforcement learning by adjusting the difficulty of the task according to the agent’s current capabilities. Curriculum learning in AI draws inspiration from the curriculum learning techniques that humans use on a daily basis to teach complex tasks and concepts. In the field of reinforcement learning, curriculum learning has shown great promise in improving the speed and effectiveness of RL agents [15, 14, 9]. The main challenge in curriculum learning is how to design a proper curriculum for a complex problem. Our solution is to leverage a technique called Automatic Curriculum Learning (ACL) from human demonstrations in which we attempt to extract a reasonable curriculum automatically from expert human demonstrations of the task . Using human demonstrations as an implicit curriculum, we can incrementally allow the agent to learn from different portions of the demonstration, starting from a simpler task (i.e. near the end of the demonstration) and eventually building the agent’s ability to perform more and more complex tasks.
In this paper, we improve upon ACL, previously only demonstrated in simple Atari tasks , in which we use a single human demonstration to define an automatic curriculum to guide the exploration of an actor-critic RL agent to solve complex tasks within the challenging and high-dimensional state-space of StarCraft II, as illustrated in Fig. 1. We train deep reinforcement learning agents that can command multiple heterogeneous actors using an automatic curriculum learning technique where starting positions and overall difficulty of the task are controlled using a single human demonstration. Our results show that an agent trained via ACL outperform state-of-the-art deep reinforcement learning baselines and match the performance of the human expert in a simulated command and control task in StarCraft II modeled over a real military scenario.
2.1 StarCraft II Environment
StarCraft II has a number of difficult challenges for AI algorithms that make it a suitable simulation environment for experimentation and development of deep reinforcement learning agents. For example, the game has notoriously complex state and action spaces, can last tens of thousands of time-steps, can have thousands of actions selected in real time, and can capture uncertainty due to the partial observability or “fog-of-war”. Further, the game has multiple heterogeneous assets, that make learning infeasible for most traditional off-the-shelf RL techniques.
In this paper, we utilize a custom made StarCraft II map that we developed in a previous paper [13, 5] called “TigerClaw”. TigerClaw is a tactical command and control scenario where Blue Force units must engage and eliminate the Red Force units that currently occupy a neutral city. To facilitate training our RL agents, we developed a custom gym wrapper built around Deepmind’s StarCraft II Learning Environment (SC2LE) . To implement off-the-shelf deep reinforcement learning algorithms we utilize RLlib , a library that provides scalable software primitives and high-quality reinforcement learning algorithms on a high performance computing system. DeepMind’s SC2LE framework  exposes Blizzard Entertainment’s StarCraft II Machine Learning API as a reinforcement learning environment. This tool provides access to StarCraft II, its associated map editor, and an interface for RL agents to interact with StarCraft II, getting observations and sending actions.
2.1.1 Task: The TigerClaw Scenario
In TigerClaw, the Blue Force’s goal is to cross a dry riverbed (wadi) terrain, neutralize the Red Force, and control certain geographic locations. Fig. 2
shows the overall objective of the TigerClaw Scenario. These objectives are encoded in the game score for use by reinforcement learning agents as a baseline for comparison across different neural network architectures and reward driving attributes.
2.1.2 State and Action spaces
The original StarCraft II state-space consisted of approximately 20 images of size 64x64 (13 screen feature layers and 7 mini-map feature layers) which included categorical features ranging from unit type to geographical path traversal information . In this tactical version of the StarCraft II mini-game, we utilize a custom state-space consisting of two representations: (i) image representation and (ii) vector representation. The image representation consists of a single 256 x 256 image of the entire battlefield that includes all of the relevant information of the original feature maps as well as terrain information, as seen in Fig. (c)c
. The vector representation is a set ofnon-spatial features that include all of the information about the game and units, including unit type, position, health, player resources, and build queues.
The actions in StarCraft II are compound actions in the form of functions that require arguments and specifications about what unit should take an action and where that action is intended to take place on the screen. For example, once a unit is selected, an action such as “attack” is represented as a function that would require the attack locations on the screen. The action space consists of the action identifier (i.e., which action to run), and two spatial actions ( and ) that represent the pair of coordinates in the screen where the action should be executed. This results in a very large action space that is impractical to be represented in a flattened space. To reduce this complexity, we defined cardinal actions that divide the map into a x grid, where we set . Thus, the possible actions for the and coordinates are represented as two vectors of length with real-valued entries between and constituting left, center, and right, and top, middle, and bottom, respectively.
2.1.3 Game Score and Reward Implementation
The reward function is an important component of reinforcement learning as it ultimately controls how the agent reacts to environmental changes by giving them positive or negative reward for each situation. We developed a custom reward function for the TigerClaw scenario consisting of awarding points for the Blue Force crossing the wadi (river) and points for retreating back. In addition, we awarded points for destroying a Red Force unit and points if a Blue Force unit was destroyed. The overall goal of this reward function was to incentivize the elimination of the opposing force while preserving Blue Force units from being destroyed.
2.2 Reinforcement Learning Agent
Next, we developed and implemented a modern deep reinforcement learning agent into the StarCraft II environment to achieve the goal of the TigerClaw scenario discussed in Section 2.1.1. We designed the action-space of our agent by adapting the traditional control approach of StarCraft II, where the player performs the following steps to control each unit: (1) Select a unit using their mouse pointer, (2) Select an action based on the unit’s possible actions and observed state, and (3) Select the coordinate in which the unit will execute the selected action.
2.2.1 Learning Algorithm
To train our deep reinforcement learning agent, we used the Asynchronous Advantage Actor Critic (A3C) algorithm, which is a state-of-the-art on-policy RL algorithm shown to have success on numerous challenging environments . The A3C is a distributed RL algorithm that allows for parallelized exploration and training across multiple actors simultaneously. The A3C algorithm is an extension of the Advantage Actor Critic (A2C) in which multiple agents explore parallel environments simultaneously to speed up exploration and learning. Just like the A2C, the A3C maintains a a policy
and an estimate of the value function. In A3C, multiple copies of the actor policy are distributed across multiple instances of the environment to speed up exploration.
Our A3C model was trained with parallel actor-learners on separate threads for over million timesteps (around thousand simulated battles) against a built-in StarCraft II bot operating on hand crafted rules. Each trained model was tested on 100 rollouts of the agent on the TigerClaw scenario. The curriculum learning models are compared against a traditional baseline A3C approach with details in the Experiments and Results section.
2.2.2 Network Architecture
As shown in Fig. (a)a, our deep reinforcement learning agent is represented by a multi-input and multi-output neural network adapted from Waytowich et al 2019  capable of handling the complex state-space and the multi-faceted action-space of StarCraft II.
The state observations, as described in Section 2.1.2, are fed through a feature extraction network, shown in Fig. (b)b, to identify all of the relevant information available. Next, the state-space is fed into the Baseline network, shown in Fig. (c)c, which outputs the action and coordinates of the control group. The agent then executes the policy into the TigerClaw environment and receives a reward and a new state-space. The details of the state-space, feature extraction network, and the baseline network are described in the following subsections. As shown in Fig. (b)b
.i, the screen features were processed through three Convolutional Neural Network (CNN) layers with ReLu activation functions in order to extract visual feature representations of the global and local states of the map, respectively. The non-spatial features were processed through a fully-connected layer with a non-linear activation as shown in Fig.(b)b.ii. These two outputs were then concatenated to form the full state-space representation for the agent.
2.3 Curriculum Learning
Curriculum learning is a method for training machine learning models by gradually increasing the complexity of the task to be solved as well as the data samples used during training in order to improve training efficiency . In the reinforcement learning, curriculum learning is a vital component that ensures the agents receives positive rewards even during early stages of training when the policy is not fully developed. This technique has enabled reinforcement learning agents to learn how to control a quadrupedal robot to walk on challenging terrain [6, 14] and complete hiking trails  (policies were trained first on flat terrain, then progressively moved to more challenging ones with slopes and obstacles), control realistic bipedal robots in simulation to walk over stepping stones  (curriculum is used to generate courses of different complexities), and a 2D simulated obstacle course with different levels . Self-play or League-based training, is another form of curriculum learning whereby the agent learns from playing against previously trained versions of itself. Agents trained under this strategy have had notable success in competitive environments This strategy was previously used to train an agent to play the full-game of StarCraft II and achieve grandmaster level, which translates to ranking above 99.8% of the officially ranked human players. .
2.3.1 Automatic Curriculum Generation
In automatic curriculum generation (ACL), the difficulty of the task is automatically controlled based on the current skill level of the agent during the training procedure. Self-play can be thought as a form of ACL since the agent learns continuously playing against itself. This results in a self-sustaining loop where the agent gets better as its opponent (i.e the agent itself) also gets better. However, for our task, we are unable to utilize traditional self-play since the Blue Force and Red Forces are heterogeneous and are composed of different StarCraft units. Instead, our main approach revolves around training an RL model using curriculum learning defined by a single human demonstration. We do this by starting the agent at the near end of the human demonstration, and then progressively making the task more difficult by rolling back through the human demonstration as the agent gets better and better, illustrated in Fig. 4.
In detail, this procedure is accomplished by the following scheme: 1) We record the sequence of states, actions and rewards (i.e., the trajectory) observed during the demonstration of the human user performing the given task. 2) Given this trajectory, the automatic curriculum generation starts by letting the agent complete only the last few steps of the task, as demonstrated by the human. This is done by first sampling a state from the end of the human demonstration from which the learning agent will execute its current policy. We record the trajectory resulting from this rollout, and train the policy by comparing the rewards received by the agent in this trajectory to the ones observed during demonstration. We repeat this step until the agent is proficient at this part of the curriculum. We consider this an easier task to be solved because the agent only needs to perform a few good actions to complete the task. 3) Once the agent has achieved comparable performance to the human, at the current stage of the curriculum, we then increase the difficulty of the task by setting the initial states of the agent further away from the end of the task along the trajectory generated generated by the human. 4) The agent continues following this scheme until it becomes proficient in solving the task from all states demonstrated by the human, completing the curriculum.
For our ACL agent, the exact point in the human demonstrated trajectory where we start the curriculum is represented by a Gaussian distribution with mean that ranges from 0 to 1 (0 to 100% of the trajectory) and standard deviation of. During each rollout of the task, the starting point of the agent is sampled from this distribution and clipped between 0 and 100%. This helps the agents to experience diverse starting locations instead of overfitting to a single point, plus, ACL acts as a way to guide the exploration for an RL agent for efficient learning.
3 EXPERIMENTS and RESULTS
3.1 Experimental Conditions
In order to understand the benefits of ACL, we conducted the following four experiments: (i) Automatic Curriculum Learning with Image Representation, (ii) Automatic Curriculum Learning with Vector Representation, (iii) No Automatic Curriculum Learning (Traditional RL) with Image Representation, and (iv) No Automatic Curriculum Learning (Traditional RL) with Vector Representation. The first and third experiments follow the network architecture shown in Fig. (c)c, while the second and forth experiments remove the Image Features from the input. Thus, we aim to understand if (a) ACL achieves more reward than traditional reinforcement learning and if (b) increasing the observation state space to include terrain information in image format improves learning.
For each experimental condition, we trained our agent on the DoD High Performance Computing system with 35 parallel actor-learners for over million timesteps (around thousand simulated battles) against a built-in StarCraft II bot operating on hand-crafted rules. The traditional RL algorithms resets each episode so the agent starts at the beginning of the game, while the ACL experiments sample the start of the episode from a Gaussian distribution initially centered at 95% of the human-demonstrated trajectory, that is, the start of each episode the battle is played out according to the first sampled percentage of the actions of the human demonstration, then the ACL agent takes over. Once the ACL agent finishes at least 50 battles with a score similar to the human demonstration, the mean of the Gaussian distribution is rolled back through the curriculum in 20% increments to provide progressively more and more difficult tasks of the ACL agent.
Our first result compares agents that were trained using images (terrain information) within their observed states. Fig. (a)a presents the average reward achieved by the agent for experiments (i) and (iii), Automatic Curriculum Learning with Image Representation and No Automatic Curriculum Learning (Traditional RL) with Image Representation, respectively. As seen, initially, the reward achieved using ACL is much less than without ACL due to the agent starting at the end of the curriculum, as defined by the human-demonstrated trajectory. In this case, the agent only needs to eliminate the last couple units of the opposing force, which does not result in a large reward. However, after million episodes, the ACL agent has moved further along the curriculum and has learned to eliminate the opposing force, resulting in a larger reward than without ACL. Furthermore, the reward for the agent without ACL (Traditional RL) slowly increases, indicating that learning a complex task from the beginning, that is, searching over the full state space of the task, results in a agent that learns slower than an agent that is guided through a curriculum.
|Human||Autocurriculum RL||Traditional RL|
|Demonstration||Image Rep.||Vector Rep.||Image Rep.||Vector Rep.|
|Blue Force Casualties|
|Red Force Casualties|
We also evaluated the average amount of casualties for the Blue force and the Red force at the end of training. As seen in Table 1, learning with ACL achieves significantly fewer Blue force casualties and higher Red force casualties when compared to traditional RL. This also explains why the ACL agent achieves a larger reward since the Blue force casualties contribute to a large negative reward. Additionally, the ACL with image representation almost achieves the casualty rates of the human demonstration, showing that the curriculum helped guide the agent to learn a similar policy.
Next, we compared experiments (ii) and (iv), Automatic Curriculum Learning with Vector Representation and No Automatic Curriculum Learning (Traditional RL) with Vector Representation, respectively, where the agent uses only vector representations for the observed state. As seen in Fig. (b)b, similar to the behavior observed in Fig. (a)a, the reward for the ACL surpasses the reward without ACL after a few million training steps. However, the gap between the two experiments is not as significant as the agents trained with image representations. In the case of experiments where the agents are trained with only the vector representation of the environment, both agents result in similar Blue force and Red force casualties, with ACL slightly outperforming traditional RL.
When we compare experiments (i) and (iii) against (ii) and (iv), we note that agents with image representations achieve a larger reward than agents with vector representations due to the resulting fewer Blue Force casualties and similar number for Red Force casualties. This indicates that including additional information, i.e., terrain information, allows for the agent to learn a better policy. However, this result is at the cost of additional training time due to the complexity in the observational state space.
To further understand how the ACL policies were outperforming traditional RL and the differences from the human demonstration, we collected data on how the agents were commanding each coalition, as represented by their distance travelled (Table 2) and health percentage remaining at the end of the battle (Table 3). Looking at the policies that utilize terrain information (image representation as input), ACL policies maneuvered their units more often than traditional RL and the human demonstration. For example, aviation units (which are faster and able to attack both ground and air units, but with lower total health capacity) travelled more than five times more in ACL than traditional RL and three times more than the human demonstration. All other coalitions travelled at least twice more in ACL. This could have led to higher fuel usage but units were able to eliminate more Red Forces and remain alive. To illustrate this point, Table 3 shows that mortar units commanded by ACL received no damage. All coalitions commanded by ACL, on average, finished the battles with more health percentage remaining when compared to all other conditions.
|Human||Autocurriculum RL||Traditional RL|
|Demonstration||Image Rep.||Vector Rep.||Image Rep.||Vector Rep.|
|Human||Autocurriculum RL||Traditional RL|
|Demonstration||Image Rep.||Vector Rep.||Image Rep.||Vector Rep.|
Curriculum Learning is a technique designed to make learning complex problems easier by designing a curriculum of simple tasks and concepts to improve the speed and efficiency of learning Curriculum learning has been utilized in the reinforcement learning field in order to improve the sample efficiency of RL training, however, manually designing a curriculum is often challenging. In this paper, we utilize human demonstrations to generate a curriculum automatically (i.e. automatic curriculum learning (ACL)) and then use that curriculum to speed up reinforcement learning on the complex environment of StarCraft II. We show that with our curriculum learning agent, we are able to significantly improve training speed and performance compared to traditional RL agents in StarCraft II.
In reinforcement learning, the state-space that is made available to the agent is often a critical factor in determining the complexity of the learning problem. In this paper, we analyzed two different state-space representations for our learning agents, a high dimension, image-based representation and a low-dimension, vector-based representation. The image representation consists of game-screen images of the StarCraft II environment that contain unit information and terrain information. The vector representation is a much more compact representation that contains the entire state of the game in a vectorized format (i.e. unit position, health, etc.), however, it does not contain any terrain information. Generally, there is a trade-off between the higher-complexity image representation and lower-dimensional vector representation in terms of both training time and overall policy performance. Based on our results, as seen in Figs. (a)a and (b)b, we show that despite the larger state-space, both traditional RL and ACL agents achieve a higher overall reward using image representation when compared to the vector representation-based agents. Interestingly, as shown in Table 1, we see that for the Autocurriculum RL agent trained with images, the agent was able to dramatically reduce the number of Blue Force casualties to only 3, compared to the 10-13 casualties that the vectorized representation produced. The results in Table 2 show that the image based ACL agent utilizes the aviation units more often, as indicated by the amount of distance traveled. These are powerful units since they are fast and can attack enemy air and ground forces. Although more analysis needs to be performed, we speculate that having access to terrain information allows for the ACL agent to better follow the human guided curriculum, which relies on heavy use of the aviation units, which ultimately leads to fewer Blue Force casualties.
4.1 Limitations and Future Work
Although we have shown that there is a benefit to performance in terms of overall reward and performance achieved by the ACL agent compared to traditional RL, there are several limitations of the current work that we highlight here as the topic of future studies. Firstly, as with almost all deep reinforcement learning work, training RL agents on these tasks is incredibly difficult and requires extensive hyper-parameter turning to achieve proper convergence. We found that our approach of using curriculum learning to train RL agents is not immune to this problem.
With curriculum learning, one of the difficulties is deciding when and how the RL agent should progress through the curriculum. One limitation in the current work is that we utilized a rather naive and straightforward curriculum step policy of just setting a fixed performance threshold before allowing the agent to progress to the next step in the curriculum. This leads to problems of the agent potentially getting stuck at a certain part of the curriculum and never progressing because the threshold is too high. This ultimately slows down learning to an extent where curriculum learning is no longer useful or practical. Future work could involve developing more intelligent strategies for traversing the curriculum.
Acknowledgements.This work was supported in part by high-performance computer time and resources from the DoD High Performance Computing Modernization Program. This work was also sponsored by the Army Research Laboratory and was accomplished partly under Cooperative Agreement Number W911NF-20-2-0114. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.
-  (2018) Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177. Cited by: §1.
-  (2013-06) The arcade learning environment: an evaluation platform for general agents. Journal of Artificial Intelligence Research 47, pp. 253–279. Cited by: §1.
-  (2019) Causal confusion in imitation learning. CoRR abs/1905.11979. External Links: Cited by: §1.
-  (2020) The accelerated user reasoning for operations, research, and analysis (AURORA) cross-reality common operating picture. Technical report Combat Capabilities Development Command Adelphi. Cited by: Figure 2.
-  (2022) On games and simulators as a platform for development of artificial intelligence for command and control. The Journal of Defense Modeling and Simulation, pp. 15485129221083278. External Links: Cited by: §2.1.
-  (2020) Learning quadrupedal locomotion over challenging terrain. Science robotics 5 (47), pp. eabc5986. Cited by: §2.3.
-  (2018-10–15 Jul) RLlib: abstractions for distributed reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 3053–3062. External Links: Cited by: §2.1.
-  (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.
-  (2022) Learning robust perceptive locomotion for quadrupedal robots in the wild. Science Robotics 7 (62), pp. eabk2822. External Links: Cited by: §1, §2.3.
-  (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §2.2.1.
-  (2013) Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1.
-  (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §1.
-  (2021-05) First-Year Report of ARL Director’s Strategic Initiative (FY20-23): Artificial Intelligence (AI) for Command and Control (C2) of Multi Domain Operations. Technical report Technical Report ARL-TR-9192, Adelphi Laboratory Center (MD): DEVCOM Army Research Laboratory (US). Cited by: §2.1.
-  (2021) Learning to walk in minutes using massively parallel deep reinforcement learning. CoRR abs/2109.11978. External Links: Cited by: §1, §2.3.
-  (2018) Learning montezuma’s revenge from a single demonstration. CoRR abs/1812.03381. External Links: Cited by: §1, §1.
-  (2021) Curriculum learning: A survey. CoRR abs/2101.10382. External Links: Cited by: §2.3.
-  (2019) Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575 (7782), pp. 350–354. Cited by: §2.3.
-  (2017) StarCraft II: a new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782. Cited by: §2.1.
-  (2017) StarCraft II: A new challenge for reinforcement learning. CoRR abs/1708.04782. External Links: Cited by: §2.1.2.
-  (2019) Paired open-ended trailblazer (poet): endlessly generating increasingly complex and diverse learning environments and their solutions. arXiv preprint arXiv:1901.01753. Cited by: §2.3.
-  (2019) A narration-based reward shaping approach using grounded natural language commands. International Conference on Machine Learning (ICML), Workshop on Imitation, Intent and Interaction. Cited by: §2.2.2.
-  (2020) ALLSTEPS: curriculum-driven learning of stepping stone skills. In Computer Graphics Forum, Vol. 39, pp. 213–224. Cited by: §2.3.