ProLoNets: Neural-encoding Human Experts' Domain Knowledge to Warm Start Reinforcement Learning

02/15/2019 ∙ by Andrew Silva, et al. ∙ 0

Deep reinforcement learning has seen great success across a breadth of tasks such as in game playing and robotic manipulation. However, the modern practice of attempting to learn tabula rasa disregards the logical structure of many domains and the wealth of readily-available human domain experts' knowledge that could help "warm start" the learning process. Further, learning from demonstration techniques are not yet sufficient to infer this knowledge through sampling-based mechanisms in large state and action spaces, or require immense amounts of data. We present a new reinforcement learning architecture that can encode expert knowledge, in the form of propositional logic, directly into a neural, tree-like structure of fuzzy propositions that are amenable to gradient descent. We show that our novel architecture is able to outperform reinforcement and imitation learning techniques across an array of canonical challenge problems for artificial intelligence.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) has seen great success in many domains, most prominently in game playing (Espeholt et al., 2018) and robotics (Andrychowicz et al., 2018). However, as research examines increasingly complex domains, such as real-time strategy games (Synnaeve & Bessiere, 2016), it is clear that current approaches are not designed to quickly reason over such enormous state and action spaces. Recent success in advancing the state of the art requires incredible amounts of computation, data, and time, which are not readily available for most researchers and practitioners. We aim to reduce the need for voluminous data and compute power by better blending humans and machines.

Human domain experts develop strategies or heuristics that allow them to out-compete even the most advanced computational approaches. Furthermore, standard human factors practices have been developed to solicit this domain knowledge in the form of a condensed set of heuristics or rules. What is missing in current deep RL agents is a mechanism to encode this domain expertise into a sophisticated learning architecture, combining the best of both humans and machines. Yet, apart from using humans as oracles to label training or replay data, which

Amershi et al. (2014) note is unreasonable, no such mechanism exists. Furthermore, many real world domains do not have large, crowd-sourced datasets to leverage, and potentially only have a handful of human experts. New approaches are needed for intelligent initialization of RL agents that enable early success and efficient exploration, rather than wasting effort on unfruitful exploration or requiring human experts to label large datasets.

Figure 1:

A visual example of a ProLoNet. A traditional decision tree is on the left, and the ProLoNet version is on the right. The final output of an actor with a ProLoNet is a softmax over

We propose a new approach to initialization of RL agents, which we call Propositional Logic Nets (ProLoNets). By incorporating a set of logical propositions for initialization of neural network weights, an RL agent can immediately begin learning helpful actions or strategies, rather than expending hundreds or thousands of CPU hours in a difficult domain without the ability to explore meaningfully. This approach leverages readily available domain knowledge–obtained via standard knowledge solicitation techniques in human factors

(Crandall et al., 2006; Newell et al., 1983)–while still retaining the ability to learn and improve over time, eventually outperforming the expertise with which it was initialized. By exploiting structure and logical rules that are inherent to nearly all tasks that RL agents must solve, we can bypass early random exploration and expedite an agent’s learning in a new domain. We demonstrate that our approach outperforms both a randomly initialized agent and a baseline agent trained by imitation learning with the same rules we used to initialize a ProLoNet, demonstrating the value of intelligent initialization over learning heuristics through imitation learning, even when imitation learning agents have access to the same heuristics for labeling data.

We make three primary contributions in this work. The first is a novel approach to initializing deep RL agents using a new architecture that we call ProLoNets. This architecture allows for domain expertise to be translated into a deep RL agent’s policy before the agent even begins exploring a new domain. The second contribution is the ProLoNet as an architecture for learning, even without intelligent initialization, which proves to be more robust to the need for hyperparameter tuning than a fully-connected baseline. In conjunction with the architecture itself, we present a new loss function for off-policy updates. The third contribution is a new approach to dynamic network growth, which allows our ProLoNet architecture to expand its capacity dynamically and outgrow its initialization. We also include our implementation and code for experiments

111https://github.com/andrewsilva9/ProLoNets.

As noted in (Gabel & Riedmiller, 2006), many RL works have demonstrated an agent’s ability to quickly learn an impressive or optimal policy, but continued updates or experience will result in significant policy degradation. We demonstrate that our dynamic ProLoNet growth helps to prevent policy degradation, and helps improve performance. We evaluate our architecture in two basic RL domains from the OpenAI Gym (Brockman et al., 2016), namely cart pole (Barto et al., 1983) and lunar lander, and two complex domains, a StarCraft II (SC2) mini-game from the SC2 Learning Environment (SC2LE) (Vinyals et al., 2017), and a full game of SC2 against different levels of the in-game AI, where ProLoNets achieve up to a 100% win rate against the Easy AI with fewer than 100 games of experience.

2 Related Work

Our work is related to several active areas of research including warm starting, efficient exploration for RL agents, imitation learning, and dynamic network growth. We summarize each area briefly below.

2.1 Warm Starting

There has been an increase in researchers investigating ways to improve the initialization of deep neural networks, particularly in the RL domain where agents can spend a tremendous amount of time without learning anything meaningful before getting fortunate enough to start accruing useful reward signals. Warm starts have been used for RL in healthcare (Zhu & Liao, 2017)

, as well as in supervised learning for NLP

(Wang et al., 2017) or other classification tasks (Hu et al., 2016). While these works have provided interesting insight into the efficacy of warm starts in various domains, they either involve large labeled datasets, or they require RL agents to solve the same domain repeatedly. In domains where an RL agent will struggle to ever find a solution without a warm start, this is not a practical assumption, nor it is always possible to acquire a large labeled dataset for new domains.

Most closely aligned with our work is the deep jointly-informed neural networks (DJINN) approach (Humbird et al., 2018), which uses a decision tree to initialize a deep network for classification while preserving the decision tree rules for immediately accurate prediction from the network. While DJINN uses a decision tree trained on a target dataset for network initialization, our approach instead seeks to translate an expert policy into a network. This means that our approach does not require a supervised training set in order to construct a decision tree for initialization, we can instead convert a set of propositional logical rules into a set of neural network weights. The ability to begin with a set of arbitrary rules rather than a pretrained decision tree is important, particularly in RL, because a large dataset of state-action pairs may be unavailable, unreliable, or misleading given the covariate shift.

2.2 Efficient Exploration

Research in RL has revealed the value of efficient exploration (Thrun, 1992). Recent work has even found that exploration for its own sake can yield improved results in maximizing extrinsic reward (Burda et al., 2018). However, much of the advancement in RL has involved building out new approaches that scale well with millions of samples (Silver et al., 2017; Espeholt et al., 2018), as advances in compute capability have provided researchers with the ability to cheaply gather hundreds of years’ worth of samples in mere days (OpenAI, 2018). We introduce a method that seeks to reduce the number of samples needed to begin effective learning in complex domains, and that does not require supervision. Rather than spending hundreds of CPU years of experience taking random actions and losing without making positive steps toward reward, our approach can begin with a plausible policy and explore from initial success, allowing ProLoNet agents to meaningfully learn from the first episode.

2.3 Imitation Learning

Outside of RL, imitation learning problems do not always require large datasets for behavior cloning. Gombolay et al. (2016) demonstrate the success of eliciting and mimicking human expert behavior in multi-agent task assignment problems, and Mohseni-Kabir et al. (2015) implement a learning from demonstration system using a hierarchical task network from just a single expert demonstration.

Imitation learning in the RL domain, however, often requires large batch datasets on which to train. Methods such as DAgger (Ross et al., 2011) and ILPO (Edwards et al., 2018) require large labeled datasets, and approaches that combine imitating and exploring, such as LOKI (Cheng et al., 2018), require a pre-trained policy or heuristic to act as an oracle. Even with a pre-trained policy, LOKI still requires extensive domain experience before beginning the reinforcement learning stage. A human can also act as an oracle for imitation learning, but it is not reasonable to expect a human to patiently label replay data for the entirety of an imitation-learning agent’s life (Amershi et al., 2014). While there are many methods for extracting policies or general “rules of thumb” from humans (Newell et al., 1983; Crandall et al., 2006; Annett, 2003), these heuristics or rules must be translated into oracles which can be used to provide labels for imitation learning systems, and then these oracles must be run over large amounts of data. Our approach can leverage the same human factors research for extracting policies from humans, though we translate them directly into an RL agent’s policy and begin RL immediately, sidestepping the imitation learning phase.

2.4 Dynamic Network Growth

Adaptive neural network architectures and dynamic network growth have been areas of interest within the lifelong learning and continual learning communities for many years. Recent work in lifelong learning for multiple tasks (Lee et al., 2017) uses a dynamically expanding network to learn new tasks without catastrophically forgetting (Kirkpatrick et al., 2017)

old tasks, adding neurons throughout the network if old task performance ever suffers while learning new tasks or features.

Xiao et al. (2014) used a hierarchy of trained models arranged in a tree-like structure for incremental learning of new classes over small batches of data. While most related work uses some classification loss or reconstruction loss (Zhou et al., 2012) to determine when a network should be grown, Susan & Dwivedi (2014)

use a measure of entropy on new data to determine whether or not input data would be considered anomalous by the network and therefore require an increase in network capacity. We implement an approach to dynamic growth with ProLoNets that is closely related to this work, as we use a measure of leaf-node entropy to estimate when a model is unable to decide on a single action, which acts as our signal to add capacity to the network. The key difference is that our approach to deepening happens offline and without input data, and our entropy metric is calculated directly from the weights of the ProLoNet.

3 Approach

We provide a visual overview of our architecture in Figure 1. First, an expert provides a policy which can be followed by a series of simple checks against the input data. These policies can be solicited through well-established human factors techniques, such as cognitive and hierarchical task analyses (Schraagen et al., 2000; Klein, 2000; Stanton, 2006). This policy is then translated into a set of weights and comparator values , which represent each rule. Each weight determines which input features to consider and how to weight them. The comparator

is used as a threshold for the weighted features. For example, if an expert stated that an important decision criterion was whether or not the first feature of an input vector was greater than 5, we would set

and . Each decision node throughout the network is represented as

(1)

where is the input data,

is the sigmoid function, and

controls how fuzzy the decisions are, where the “fuzzier” a decision is, the less deterministic it is. A fuzzy tree allows for more uncertainty in decision making (Yuan & Shaw, 1995), and therefore allows for more exploration, even from an expert initialization. High values of will emphasize the difference between the comparator and the weighted input, therefore pushing the tree to be stricter. Lower values of will encourage a fuzzier tree, with producing random decisions. We allow to be a learned parameter, and in our experiments we initialize .

After all decision nodes are processed, the values of from each node represent the likelihood of that condition being met, while

represents the likelihood of the condition not being met. With these likelihoods, the network then multiplies out the probabilities for different paths to all of the leaf nodes. Every leaf

contains a path , which is a set of decisions nodes which should be met or not met in order to reach , as well as a prior set of weights for each output action . For example, in Figure 1, , and . The likelihood of each action is then determined by multiplying the probability of reaching leaf by the prior weight of the outputs within leaf

(2)

While the ProLoNet paths always yield valid probability distributions, the leaf weights are trainable parameters and therefore do not necessarily maintain output values constrained to valid probability distributions. Therefore, after calculating the outputs for every leaf, the leaves are summed and passed through a softmax function to provide the final probability distribution over all output actions.

In order to illustrate this more practically, we consider the simplest case of a cart pole (Section 4.2.1) ProLoNet with a single decision node. Assume we have solicited the following from a domain expert: “If the cart’s position is right of center, move left, otherwise move right,” and that they have told us that is the first input feature and that the center is at 0. We therefore initialize our primary node with and . We would then specify to be a new leaf with a prior of over the two available actions, and to be a new leaf with the prior . Finally, we set the path to to be , and the path to to be . Then, for each state, the probability distribution over the agent’s two actions is a softmax over .

Our architecture allows for intelligent initialization of a policy, and we use a clone of that policy to initialize an agent’s value network. For every incoming state, the output of our critic is a distribution over actions, which represents the critic’s value for each action given the state. Our critic network is instantiated with the same parameters for weights, comparators, and leaves as the actor, with the only difference being that we do not run the final critic distribution through a softmax function, as we do not need to sample from a probability distribution for value predictions.

Our network is updated with a modified version of PPO (Schulman et al., 2017). The standard PPO loss function is:

(3)

Where is the ratio between the old policy parameters and the new policy parameters, and epsilon is a hyperparameter to clip the gradients. We modify this loss by multiplying it by the log probability of selecting an action under the current policy:

(4)

For our more complex domains, we introduce a novel loss function which combines standard policy gradient with KL divergence as we found this to yield superior results for more challenging problems, though we leave a more thorough evaluation to future work:

(5)

Where is the advantage obtained for a given action defined as where is the reward for taking action in state , and is the value predicted by the critic for action in state . is the probability distribution over all actions given the agent’s current policy, and is the probability distribution over all actions given the agent’s policy when it took action .

The critic’s loss function is the mean-squared error between the output of the critic and the reward from the state-action pair. All approaches are trained with the RMSProp

(Tieleman & Hinton, 2012) optimizer. We set our reward discount factor to 0.99, and learning rates and batch sizes are given in Section 4.

Figure 2: An example of deepening on a simplified ProLoNet visualization, where decision nodes are in blue, leaves are in green, and the deeper tree is represented with dashed lines and paler colors. When , the tree deepens from the version on the left to the version on the right.

3.1 Deepening

While our initialized ProLoNets are able to follow expert strategies immediately, they may lack expressive capacity to learn more optimal policies once they are deployed into a domain. For instance, if an expert policy only involves a small number of decisions, the network will only have a small number of weight vectors and comparators to use for its entire existence. This lower expressive capacity is fine for short-lived agents or simple domains, but agents that must persist in the world for a long time or that must perform in highly complex domains will likely require greater expressive capacity than they may have upon initialization.

In order to address this desire for greater network capacity, we introduce dynamic deepening to ProLoNets. The process is outlined in Algorithm 1, where is an entropy function and measures how confident we must be to deepen. In our experiments, we set . A visual example of this process can be seen in Figure 2.

1:  Input: ProLoNet , Deeper ProLoNet
2:  for  do
3:     Calculate
4:     Calculate , for leaves under in
5:     if  then
6:        Deepen at using and
7:        Deepen at and randomly
8:     end if
9:  end for
Algorithm 1 Dynamic Deepening
(a) A comparison across different architectures for the cartpole problem. Reward averaged over 5 runs.
(b) A comparison between deepening and non-deepening networks. We see that the ability to deepen helps ameliorate policy degradation.
Figure 3: Results from the cart pole experiments. Best viewed in color.

Upon initialization, a ProLoNet agent maintains two copies of its actor: the shallower initialized version as-is, and a deeper version, where each leaf is transformed into a randomly-initialized node with two new randomly-initialized leaves (line 1 of Algorithm 1). As the agent interacts with its environment, it relies on the shallower networks to generate actions and value predictions, and to gather experience. After each episode, our off-policy update is run over the shallower and deeper networks. Finally, after the off-policy updates, the agent compares the entropy of the shallower actor’s leaves (line 3) to the entropy of the deeper actor’s leaves (line 4), and selectively deepens when the leaves of the deeper actor are less uniform than those of the shallower actor lines (5-7).

4 Experiments

Over the course of our experiments, we want to evaluate the ProLoNet architecture against other standard RL architectures, but we also want to examine the importance of intelligent initialization and deepening. As such, for our cart pole, lunar lander, and SC2LE domains, we run two sets of experiments. The first set of experiments compares a full ProLoNet as described above to a series of baselines. The second set of experiments is an ablation study comparing a full ProLoNet to a ProLoNet with -mistake initialization, where an expert has given a mistaken or sub-optimal policy, a ProLoNet with random initialization, and a ProLoNet without the ability to dynamically deepen. All ablation study results are aggregated into Table 1. We provide all heuristics for ProLoNet initialization in the appendix.

4.1 Agents

We compare different agents across our experimental domains including baseline linear and recurrent models, an imitation learning model, and a ProLoNet.

  • ProLoNet: a ProLoNet agent as described above

  • Random: a ProLoNet matching the architecture of the agent, but with random initialization

  • FC

    : a fully-connected-layer agent with ReLU activations.

  • LSTM: an LSTM agent with ReLU activations.

  • LOKI: a fully-connected layer agent as above, but which is trained via imitation learning for the first N episodes (Cheng et al., 2018), where N is a tuned hyperparameter. We use the same heuristic to train this agent that we use to initialize our ProLoNet agents.

For the and agents, we experiment with the number and size of linear layers and report results with the best performing agent in each experiment. Architectures are provided in the appendix.

4.2 Environments

We consider four environments to evaluate ProLoNets, cart pole, lunar lander, the FindAndDefeatZerglings minigame from the SC2LE (Vinyals et al., 2017), and a full game of SC2 against the in-game AI. We provide a brief overview of and present results from each domain below.

4.2.1 Cart Pole

Cart pole is a classic RL domain (Barto et al., 1983), where the object is to balance an inverted pendulum on a cart that moves left or right. The state space is a 4D vector representing {cart position, cart velocity, pole angle, pole velocity}. The action space is {left, right}. We use the cart pole domain from the OpenAI Gym (Brockman et al., 2016).

For the cart pole domain, we set all agent’s learning rates to 0.01, the batch size is set to dynamically grow as there is more replay experience available, and each agent trains on all data gathered after each episode, then empties its replay buffer. All agents train on 3 simulations concurrently, pooling replay experience after each episode, and updating their policy parameters. For the agent, we set =200. All agents are updated according to the modified PPO loss function in Equation 4. We selected all parameters empirically to produce the best results for each method. We found that ProLoNets could always solve this domain, whether they were initialized with anywhere from 1 to 10 decision nodes and 2 to 11 leaves, though and agents were very sensitive to such hyperparameter tuning.

Results are averaged across five runs (Henderson et al., 2018) for 1000 episodes each and are depicted in Figure 2(a). In order to evaluate the impact of deepening and of intelligent initialization, our ablation results are shown in Table 1. For each -mistake agent, weights, comparators, and leaves are randomly negated according to , up to a maximum of for each category. For example, “N-Mistaken 0.05” has a maximum of 10% of its weights, comparators, and leaves negated, and each is negated with probability 0.05.

In order to evaluate the effect of deepening in a lifelong learning scenario, we compare a full ProLoNet to a non-deepening ProLoNet on the cart pole problem run for 5000 episodes. We select the cart pole problem for its simplicity and low computational overhead, which allows us to evaluate how well the agent handles continued updates to an already-optimal policy. We present a comparison between a full ProLoNet and non-deepening ProLoNet in Figure. 2(b).

Figure 4: A comparison across different architectures for the lunar lander problem. Reward averaged over 5 runs of 1500 episodes

4.2.2 Lunar Lander

Lunar lander is the second domain we use from the OpenAI Gym (Brockman et al., 2016), and is based on the classic Atari game of the same name. Lunar lander is a game where the player attempts to land a small ship (the lander) safely on the ground, keeping the lander upright and touching down slowly. The state is an 8D vector consisting of the lander’s {x, y} position and velocity, the lander’s angle and angular velocity, and two binary flags which are true when the left or right legs have touched down.

We use the discrete lunar lander domain, and so the action space is 4D, where the actions are {do nothing, left engine, main engine, right engine}. For the lunar lander domain, we set most hyperparameters to the same values as in the cart pole domain. The two exceptions are the number of concurrent processes, which we set to 5, and the agent’s , which is set to 300. All agents use the modified PPO loss function in Equation 4. We found that ProLoNets always managed to solve this domain, using anywhere from 10 to 14 decision nodes and 10 to 15 leaves, and that, again, and agents were much more sensitive to hyperparameter tuning.

Results are averaged across five runs for 1500 episodes each and are depicted in Figure 4.

Random-Init. Non-Deepening N-Mistake N-Mistake N-Mistake
Domain ProLoNet ProLoNet ProLoNet 0.05 0.1 0.15
Cart Pole 449k 15k 401k 26k 386k 35k 426k 30k 369k 28k 424k 29k
Lunar Lander 86k 33k 55k 19 49k 20k 50k 22k 45k 22k 45k 22k
SC2 Mini-game 15.8k 3.2k 1.4k 1.7k 14.4k 3k 4.1k 2.1k
Table 1: Ablation study on ProLoNets with average cumulative reward across different domains

4.2.3 FindAndDefeatZerglings

FindAndDefeatZerglings is a minigame from the SC2LE designed to challenge RL agents to learn how to effectively micromanage their individual attacking units in SC2. The agent controls three attacking units on a small but not fully-observable map, and must explore the map while killing enemy units. The agent receives +1 reward for each enemy unit that is killed, and -1 for each allied unit that is killed. Enemy units respawn in random locations, and so the best agents are ones that continuously explore and kill enemy units until the three minute timer has elapsed.

We leverage the SC2 API 222https://github.com/Blizzard/s2client-api to manufacture a 32D state which contains {x_position, y_position, health, weapon_cooldown} for three allied units, and the five nearest visible enemy units. Missing information is filled with -1. Our action space is 10D, containing move commands for north, east, south, west, attack commands for each of the five nearest visible enemies, and a “do nothing” command. For this problem, we assign an agent to each individual allied unit, which generates actions for only that unit. Experience from each agent stops accumulating when the unit dies. All experience is pooled for policy updates after each episode, and parameters are shared between agents.

For the SC2LE minigame, we set all agents’ learning rates to 0.001, and the batch size to 4. Each agent trains on replay data for 8 update iterations per episode, and pools experience from 2 concurrent processes. The agent’s , is set to 500 and the agent is permitted to train 500 extra episodes in the more complex domain. The agents in this domain update according to the KL loss function in Equation 5. ProLoNet agents for this domain were more sensitive to their initial policies, and we found the best performing policy to be a compromise between specificity and simplicty, at 10 decision nodes and 11 leaves.

Results are averaged across five runs for 2000 episodes and are depicted in Figure 5. For this domain, we compare to only one mistakenly-initialized agent, as even a low value for resulted in markedly worse performance.

Figure 5: A comparison across different architectures for the FindAndDefeatZerglings problem. Reward averaged over 5 runs

4.2.4 SC2 Full Game

The complexity and state space of StarCraft II is exceptionally large, includes various levels of detail, and is never fully observable. Rather than attempt to solve the vision aspect of SC2, we again use the API to extract exact unit types and counts and game state information. The state contains:

  • Allied Unit Counts: A 36x1 vector in which each index corresponds to a type of allied unit, and the value corresponds to how many of those units exist.

  • Pending Unit Counts: As above, but for units that are currently in production and do not exist yet.

  • Enemy Unit Counts: A 112x1 vector in which each index corresponds to a type of unit, and the value corresponds to how many of those types are visible.

  • Player State: A 9x1 vector of specific player state information, including minerals, vespene gas, supply, etc.

The disparity between allied unit counts and enemy unit counts is due to the fact that we only play as the Protoss race, but we can play against any of the three races.

The number of actions in SC2 can be well into the thousands if one considers every individual unit’s abilities. As we seek to encode a high-level strategy, rather than rules for moving every individual unit, we restrict the action space for our agent. As in work by (Sun et al., 2018) developing TStarBot, we construct a set of heuristics to simplify the specification of our expert policy. Rather than using exact mouse and camera commands for individual units, we abstract actions out to simply: “Build Pylon.” As such, our agents have 44 available actions, including 35 building and unit production commands, 4 research commands, and 5 commands for attack, defend, harvest resources, scout, and do nothing.

ProLoNet at Random Init.
AI Difficulty ProLoNet Initialization FC LSTM LOKI ProLoNet
VeryEasy 100% 14.1 % 0 % 0 % 0 % 0 %
Easy 100% 10.9 % 0 % 0 % 0 % 0 %
Medium 82.2% 11.3 % 0 % 0 % 0 % 0 %
Hard 26% 10.7 % 0 % 0 % 0 % 0 %
Table 2: Win Rates for Various Agents Against SC2 In-Game AI
Figure 6: A comparison between our proposed loss function and the modified PPO loss function on the FindAndDefeatZerglings task.

For the full SC2 game, we set all agents’ learning rates to 0.0001, batch size to 4, and updates per episode to 8. We run 4 episodes between updates, and set the =1000. Agents train for as long as necessary to achieve a win-rate against the easiest AI, then move up to successive levels of difficulty as they achieve win-rates. The agents in this domain update according to the loss function in Equation 5. ProLoNet agents often succeeded in this domain with initial policies ranging from 10 to 16 nodes and 11 to 17 leaves. We present results from an agent that succeeds with one of our simplest policies.

In such a difficult and complex domain, making the right parameter update is a significant challenge for RL agents. As such, we validate each agent’s network updates by freezing the actor and critic parameters and playing out several games with the new policy. If the agent’s chance of a victory is at least as good as it was before the update, then the parameters are unfrozen and the agent continues its learning. However, if the agent’s probability of success is lower after the update, then the parameters are rolled back and the agent gathers experience for a new update. Probability of success is simply the number of victories under a new policy divided by the number of games under a new policy, and we determine when the result is significant enough to base a decision on by performing a Bernoulli test after every 5 games.

We find that after 5000 episodes, neither the , , nor the randomly initialized ProLoNet agents are able to win a single game against the Very Easy in-game AI. Even the agent, which has access to the same heuristics used by the ProLoNet, was unable to win even one game. The intelligently initialized ProLoNet, on the other hand, is able to progress all the way to the Hard in-game AI, achieving 100% win-rates against easier opponents along the way. We present all agents’ win-rates, in addition to a ProLoNet initialization without any training, in Table 2.

5 Discussion

Our results indicate that it is possible to encode heuristic policies into deep networks for warm-starting RL agents, and that such warm-starting can help agents begin to immediately explore and learn superior policies without wasting time taking random actions. As we can see in Figures 2(a), 4, and 5, ProLoNets outperform both baseline architectures, and imitation learning baselines. Examining the more complex domain in Figure 5, we can see that the imitation learning agent performs on par with its initial policy, but fails to generalize or improve beyond the heuristics that it has available for labeling data. ProLoNets, on the other hand, are able to start learning at the heuristic level, and improve from that initial success. This is particularly evident in our full game experiment, where even the imitation learning baseline is unable to win a single game against the in-game AI, though the ProLoNet agent can win immediately and improve from there. We speculate that this can be attributed to the failed exploration of baseline approaches. Highly complex domains can require a very narrow and specific set of actions to achieve victory, and only a specific subset of actions may even be possible at any given time. Agents that take random actions will spend thousands of episodes doing nothing at all, as the network continues to output actions that are not possible to perform (i.e. “Train Zealot” before a Gateway has been constructed).

We also observe that, at least for simple domains, our architecture is relatively robust to errant initialization, as is shown in Table 1. Furthermore, we note that ProLoNets were able to perform well with a wide range of decision nodes and leaves, while baseline methods were very sensitive to the number and size of hidden layers, as is also noted in (Henderson et al., 2018). For instance, an agent with twice as many parameters was unable to solve even the simple cart pole problem, while all ProLoNet approaches with initial policies ranging from 1 to 10 decision nodes were able to solve the problem.

For our two simpler domains, cart pole and lunar lander, we observe that our approach is as fast or faster than baseline methods to learn an optimal policy, and that it is more stable and robust to policy degradation than baseline methods. We also see the impact of dynamic network growth in Figure 2(b). Our dynamic deepening is crucial to long term success and to remaining optimal. Though the shallower ProLoNet is able to quickly find an optimal policy, the agent’s continued experience and updates to the policy result in degradation and deterioration of the agent’s performance. Even with a dynamically growing batch size, which has been shown to ameliorate overfitting (Smith et al., 2017), the shallower ProLoNet policy deteriorates. We can see that the deeper ProLoNet begins to experience the same deterioration, but is able to recover and re-solve the problem due to an increased network capacity.

In our more complex domains, we can see the importance of an intelligent initialization and initial policy. While the imitation learning baseline is able to perform well in the FindAndDefeatZerglings minigame, it is unable to improve on the imitated policy and degrades quickly once it enters the reinforcement learning phase. Similarly, noisily-initialized ProLoNets perform quite poorly, even with the exact same architecture as their successful counterparts. Both the shallower and deeper ProLoNet agents, however, are able to begin successfully and to improve upon their initial policies and perform even better in the complex domains. We see less of a disparity between the shallow and deep ProLoNets in this domain, and we speculate that this is due to a large network upon initialization.

6 Conclusion and Future Work

We have presented a new architecture for deep RL agents, ProLoNets, which permit intelligent initialization of agents, and grant agents the ability to grow their network capacity as necessary. We have shown that ProLoNets are robust to the number of parameters in the network, to policy degradation, and even to sub-optimal expert initialization in simpler domains. We demonstrate that our approach is superior to imitation and reinforcement learning on traditional architectures, and that intelligent initialization allows deep RL agents to explore and learn in environments that are too complex for randomly initialized agents.

We believe that there are many opportunities for future work in this area. First, the translation of an expert policy into a set of rules and weights is tedious, time-consuming, and requires computing expertise. Incorporation of cognitive architectures or human factors research to automate this process could prove to be an efficient and easy way to open up RL agent initialization to any domain expert. Second, we have not outlined ways that we can incorporate unstructured information, such as images or audio, into a ProLoNet framework (Edwards et al., 2016). We envision extensions along the same lines as our dynamic deepening approach, but we have yet to carry out further work in this area. Finally, our architecture does not currently employ any kind of recurrence or memory, presenting a key area for further development in domains with long time horizons, or in lifelong learning scenarios.

References

Appendix A Cart Pole Heuristics

We use a simple set of heuristics for the cart pole problem, visualized in Figure 7. If the cart is close enough to the center, we move in the direction opposite to the lean of the pole, as long as that motion will not push us too far from the center. If the cart is close to an edge, the agent attempts to account for the cart’s velocity and recenter the cart, though this is often an unrecoverable situation for the heuristic. The longest run we saw for a ProLoNet with no training was about 80 timesteps.

Appendix B Lunar Lander Heuristics

For the lunar lander problem, the heuristic rules are split into two primary phases. The first phase is engaged at the beginning of an episode while the lander is still high above the surface. In this phase, the lander focuses on keeping the lander’s angle as close to 0 as possible. Phase two occurs when the lander gets closer to the surface, and the agent then focuses on keeping the y_velocity lower than 0.2. As is depicted in Figure 8, there are many checks for both lander legs being down. We found that both and ProLoNets were prone to landing successfully, but continuing to fire their left or right boosters. In an attempt to ameliorate this problem, we added the extra “legs down” checks.

Appendix C FindAndDefeatZerglings Heuristics

For the SC2LE minigame, the overall strategy of our heuristic is to stay grouped up and fight or explore as a group. As such, the first four checks are all in place to ensure that the marines are all close to each other. After they pass the proximity checks, they attack whatever is nearest. If nothing is nearby, they will move in a counter-clockwise sweep around the periphery of the map, searching for more zerglings. Our heuristic is shown in Figure 9.

Appendix D SC2 Full Game Heuristics

The SC2 full game heuristic first checks for important actions that should always be high priority, such as attacking, defending, harvesting resources, and scouting. Once initial checks for these are all passed, the heuristic descends into the build order, where it simply uses building or unit count checks to determine when certain units should be built or trained. After enough attacking units are trained, the heuristic indicates that it is time to attack. The SC2 full game heuristic is depicted in Figure. 10.

Appendix E Agent Architectures Per Experiment

In this section we briefly overview the and layer information. The agent maintained the same architecture as the agent.

e.1 Cart Pole

The cart pole network is a 3-layer network following the sequence: 4x4 – 4x4 – 4x2.

We experimented with sizes ranging from 4-64 and numbers of hidden layers from 1 to 10, and found that the small network performed the best.

The network for cart pole is the same as the network, though with an LSTM unit inserted between the first and second layers. The LSTM unit’s hidden size is 4, so the final sequence is: 4x4 – LSTM(4x4) – 4x4 – 4x2.

We experimented with hidden-sizes for the LSTM unit from 4 to 64, though none were overwhelmingly successful, and we varied the number of layers after the LSTM unit from 1-10.

The ProLoNet agent for this task used 9 decision nodes and 11 leaves. For the deepening experiment, we tested an agent with only a single node and 2 leaves, and found that it still solved the task very quickly. We tested randomly initialized architectures from 1 to 9 nodes and from 2 to 11 leaves, and we found that all combinations successfully solved the task.

e.2 Lunar Lander

The lunar lander network is a 4-layer network, following the sequence: 8x8 – 8x64 – 64x8 – 8x4.

We again experimented with sizes from 8-64 and number of hidden layers from 2 to 11.

The network for lunar lander mimics the architecture from cart pole. The LSTM unit’s hidden size is 8, so the final sequence is: 8x8 – LSTM(8x8) – 8x8 – 8x4.

We experimented with hidden-sizes for the LSTM unit from 8 to 64, and again we varied the number of layers succeeding the LSTM unit from 1 to 10.

The ProLoNet agent for this task featured 14 decision nodes and 15 leaves. We experimented with intelligent initialization architectures ranging from 10 nodes to 14 and from 10 to 15 leaves, and found little difference between their performances. The additional nodes were an attempt to encourage the agent to “do nothing” once successfully landing, though this was only moderately successful.

e.3 FindAndDefeatZerglings

We failed to find a architecture that succeeded in this task, and so we choose one that compromised between the depth of the ProLoNet and the simplicity that agents seemed to prefer for toy domains. The final network is a 7-layer network with the following sequence: 32x32 – 32x32 – 32x32 – 32x32 – 32x32 – 32x32 – 32x10.

We choose to keep the size to 32 after testing 32 and 64 as sizes, and deciding that trying to get as close to the ProLoNet architecture was the best bet.

The network for FindAndDefeatZerglings features more hidden layers than the for lunar lander and cart pole. The hidden size is set to 32, and the LSTM unit is followed by 5 layers. The final sequence is: 32x32 – LSTM(32x32) – 32x32 – 32x32 – 32x32 – 32x32 – 32x10.

We experimented with hidden-sizes for the LSTM unit from 32 to 64 and varied the number of successive layers from 4-10.

The ProLoNet agent for FindAndDefeatZerglings featured 10 nodes and 11 leaves. We tested architectures from 6 to 15 nodes and from 7 to 13 leaves, and found that the initialized policy and architecture had more of an immediate impact for this task. The 7 node policy allowed agents to spread out too much, and they died quickly, whereas the 15 node policy had agents moving more than shooting, and they would walk around while being overrun.

e.4 SC2 Full Game

We again failed to find a architecture that succeeded in this task, and so used a similar architecture to that of the FindAndDefeatZerglings task. The 7-layer network is of the sequence: 194x194 – 194x194 – 194x194 – 194x194 – 194x194 – 194x194 – 194x44.

We again experimented with a variety of shapes and number of layers, though none succeeded.

Again, the network shadows the network for this task. As in the FindAndDefeatZerglings task, we experimented with a variety of LSTM hidden unit sizes, hidden layer sizes, and hidden layer numbers. The final architecture reflects the FindAndDefeatZerglings sequence: 194x194 – LSTM(194x194) – 194x194 – 194x194 – 194x194 – 194x194 – 194x44.

The ProLoNet agent for the SC2 full game featured 10 nodes and 11 leaves. We tested architectures from 10 to 16 nodes and from 1 to 17 leaves, and found that the initialized policy and architecture was not as important for this task as it was for the FindAndDefeatZerglings task. As long as we included a basic build order and the “attack” command, the agent would manage to defeat the VeryEasy in-game AI at least 10% of the time. We found that constraining the policy to fewer nodes and leaves provided less noise as updates progressed, and kept the policy close to initialization while also providing improvements. An initialization with too many parameters often seemed to degrade quickly, presumably due to small changes over many parameters having a larger impact than small changes over few parameters.

Figure 7: Visualization of the heuristics used to initialize the cart pole ProLoNet, and to train the LOKI agent
Figure 8: Visualization of the heuristics used to initialize the lunar lander ProLoNet, and to train the LOKI agent
Figure 9: Visualization of the heuristics used to initialize the FindAndDefeatZerglings ProLoNet, and to train the LOKI agent
Figure 10: Visualization of the heuristics used to initialize the SC2 full game ProLoNet, and to train the LOKI agent