Generalization to unseen situations is an important capability for autonomous agents. Especially in real-world decision making and control applications such as autonomous driving, robotics, process control, health care, and finance, the agents routinely need to adapt safely to unseen situations. A common practice is to train these models, mostly deep neural networks, with the data collected from a limited number of hand-designed scenarios. However, the tasks are often too complex to anticipate every possible scenario, and this approach is not scalable. Moreover, these models can be brittle when they are exposed to even small variations or noise.
One popular approach to address this problem is few-shot learning, in particular metalearning, either by utilizing gradients (Schmidhuber, 1987; Thrun and Pratt, 1998; Finn et al., 2017) or evolutionary procedures (Fernando et al., 2018; Grbic and Risi, 2019)
. In metalearning, systems are trained by exposing them to a large number of tasks, and then tested for their ability to learn new relevant but unseen tasks. There are also a number of approaches mostly for supervised learning setting where new labels need to be predicted based on limited number of training data. However, applications in control and decision making, including reinforcement learning problems, are very limited(Kansky et al., 2017).
The approach in this paper is motivated by prior work on opponent modeling in poker (Li and Miikkulainen, 2017, 2018). In that domain, an effective approach was to evolve one neural network, the game module, to decide what move to make, and another, the opponent module, to monitor the opponent, and modulate those decisions by taking the opponents playing style into account. When trained with only a small number of very simple but different opponents, the approach was able to generalize and play well against a wide array of opponents, include some that were much better than anything seen during training.
In a sense, the opponent forms a context for the decision making in poker. Each decision needs to take into account how the opponent is likely to respond, and select the right action accordingly. The player can thus adapt to many different game playing situations immediately, even those that have not been encountered before. In this paper, this approach is generalized and applied to control and decision making more broadly. In more general terms, a skill network reacts to the current situation, and a context network integrates observations over a longer time period. A third, controller, network combines the outputs of both networks, thus modulating decision making through context. Such a Context+Skill system can thus generalize to more situations than any of its components alone.
The Context+Skill approach is evaluated in this paper on an extended version of the popular Flappy Bird game. This version includes more actions and physical effects (i.e. forward flap and drag in addition to flap up and gravity). Such an extension allows generating a range of unseen scenarios both by extending the range of effects of those actions as well as their combinations. The approach generalizes remarkably well to new situations, and does so much better than its components alone. Context+Skill approach is thus a promising approach for building robust autonomous agents in real-world domains.
The remaining sections of this paper are organized as follows: Section 2 presents the experimental set up and the test domain, the architecture of the neural networks, and the multiobjective evolution procedure for constructing the system. Section 3 presents learning and generalization results, demonstrating that the Context+Skill approach indeed performs better than its parts. The behaviors of these networks are contrasted in Section 4, finding that Context+Skill can anticipate results of its actions more accurately, making it possible to adapt to unseen situations.
This section introduces the experimental setup, the neural networks used as the control policies for the agent, and the evolutionary training methodology.
2.1. The Flappy Ball Domain
Flappy Ball is an extension of the popular Flappy Bird computer game (6). Implemented in PyGame, it has less detailed visual effects but more complex physical dynamics, and is mainly developed to test the generalization behavior of an agent in a more challenging and controlled environment (Fig. 1). The agent, controlled by a neural network, aims to navigate through the openings between pipes without hitting them for a certain length of time. The Agent can control two actions, i.e., flapping forward and upward; both actions can be applied simultaneusly. If they are not applied, gravity will pull the agent down and drag will slow it down. The agent gets a reward of +1 every time it passes a pipe successfully, and various penalties depending how badly it crashes into the pipes, ceiling, or ground. Each time step spent in a collision incurs a penalty of -1, and in hitting the ceiling or the ground, of -5. This way, attempting to fly through the pipes but failing is penalized less than flying through a pipe or not trying.
At every time step, the agent receives sensory information as a vector of six numerical values as indicated in Fig.1: the vertical position of the agent (), its horizontal and vertical velocities ( and , respectively), the horizontal distance of the agent to the right edge of the closest pipe (), and the height of the top and bottom pipes ( and
, respectively). These values are normalized to the range [0,1]. In an environment with known physical effects, this setup is a Markov Decision Process (MDP) since all the state information necessary to decide on the right action is provided to the agent. However, the effect of the agent’s actions, i.e. flap up or forward, as well as the physical forces acting upon the agent, i.e. gravity and drag, can change between episodes unbeknownst to the agent, establishing a new task for the agent. Therefore, in order to perform well in new tasks, the agent has to infer such variations from its interactions with the environment over time, which makes the problem partially observable. Since acceleration and velocity are linearly correlated, such learning is possible.
The Flappy Ball domain can be seen as a proxy for control and decision making problems where the changes in the environment require immediate adaptation, such as operating a vehicle under different weather conditions, configuration changes, wear and tear, or sensor malfunctions. The challenge is to adapt the existing policies to the new conditions immediately without further training, i.e. to generalize the known behavior to unseen situations.
2.2. Evolutionary Multi-objective Optimization (EMO)
The original Flappy Bird game is usually treated as a single-objective optimization problem, where the number of pipes passed until one is hit is maximized. To provide for more varied behaviors, Flappy Ball is formulated as a multi-objective optimization problem instead. The number of successfully passed pipes () is maximized, whereas the number of any type of collisions (, where stands for hits) is minimized.
Non-dominated sorting genetic algorithm (NSGA-II) (Deb et al., 2002) was implemented in DEAP (Fortin et al., 2012) as the EMO method for Flappy Ball. Although finding the safest solution () is the ultimate goal, as in the single-objective case, the diversity resulting from the multiobjective search speeds up training and helps discover well-performing solutions (Knowles et al., 2001). EMO algorithms use Pareto dominance to sort the solutions into sets of equally preferable solutions (or Pareto fronts). The one containing the non-dominated solutions are called Pareto-optimal set (Deb, 2001); it is up to the user to select one of them based on his or her needs. In the experiments in this paper, one that is perfectly safe or close to it is usually selected.
2.3. Neural Networks
The Context+Skill Network consists of three components: the Skill and the Context modules and the Controller (Figure 2). The first two modules receive sensory information from the environment as numerical values, as described in Section 2.1. They send their output to the Controller, a fully connected feedforward neural network that makes the decisions on which actions to take.
The Skill module is also a fully connected feedforward network. Together with the Controller they form the Skill-only Network, S (Fig. 2(c)). The Skill module used in this study has 10 hidden and five output nodes and the Controller has 20 hidden hidden nodes and two outputs, i.e. one for each action. S is used as the baseline model throughout the study. In principle it has all the information for navigating through the pipes, but does not have the benefit of explicit representation of context.
The other main component in the Context-Skill framework is the Context module. It is composed of a vanilla Long Short Term Memory (LSTM) cell ((Hochreiter and Schmidhuber, 1997); Fig. 3
). There are three gates in this recurrent neural network architecture: input, forget, and output. The gates are responsible for learning what to store, what to throw away, and what to read out from the long-term memory of the cell. Thus, the cell can learn to retain information from the past, update it, and output it at an appropriate time, thereby making it possible to learn sequential behavior(Greff et al., 2017; Géron, 2017).
The C-module used in this study consists of an LSTM cell size of 10. The memory of the C-module (h and c) is reset at the beginning of each new task, and accumulated (transferred) across episodes within each task. It can therefore form a representation of how actions affect the environment. The output of the LSTM (h) is sent to Controller as the context. Together the C-module and the Controller form the Context-only network, C Fig. 2(b). It serves as a second baseline, allowing integration of observations over time, but without a specific Skill network to map them directly to action recommendations.
The complete Context-Skill Network, CS (Fig. 2(a)) consists of both the Context and Skill modules as well as the Controller network of the same size as in C and S. The motivation behind the CS architecture, i.e. of integrating the Context module into S, is to make it possible for the system to learn to use an explicit context representation to modulate its actions appropriately. The method for discovering these behaviors is discussed next.
All three neural network models described in Section 2.3 are evolved using NSGA-II (Deb et al., 2002). The overall procedure is shown in Algorithm 3. The network architectures remain fixed while their weights are evolved. The goal is to maximize their average score across multiple tasks, where each task is based on different physical parameters of the Flappy Ball environment. The base values for the four actions are chosen as Flap=-12.0 (negative value is due to the coordinate system), Gravity=1.0, Fwd=5.0 and Drag=1.0. In each task during evolution, only one parameter is subject to change, while the rest are fixed at their base values. There are four tasks used in evolution, defined as:
Task-1: The effect of the Flap action varies of its base value, i.e., [Flap, Flap] = [-14.4, -9.6];
Task-2: The effect of the Gravity force varies of its base value, i.e., [Gravity, Gravity] = [0.8, 1.2];
Task-3: The effect of the Forward action varies of its base value, i.e., [Forward, Forward] = [4.0, 6.0]; and
Task-4: The effect of the Drag force varies of its base value, i.e., [Drag, Drag] = [0.8, 1.2].
Each task, and therefore each parameter, is uniformly sampled n=5 times within the limits specified above. The fitness of every individual in the population is evaluated in parallel on the same task distribution for a fair comparison. Each episode length is fixed to 500 time steps. The seed number for the random number generator (Line-3 in Algorithm 1) is included in the task parameters so that the distribution of the pipes can be repeated.
After the task parameters are prepared, fitness evaluation follows (Algorithm 2). The parameters of a network are stored as an array in the individual candidate and converted to the corresponding neural network representation (Line 3) before the fitness evaluation (Line 9). The memory of C-module in CS is reset at the begining of each task (Lines 5–7), and transferred from episode to episode otherwise. The average number of successfully passed pipes and collisions in each episode are returned as the two objective values to be maximized and minimized, respectively. There are a total of 20 episodes, since there are four tasks with five episodes in each.
The overall procedure, i.e., NSGA-II applied to evolving agents in the Flappy Ball domain, is shown in Algorithm 3. It receives n=4, n=5, perturb=0.2 (i.e., 20%), Flap = -12.0, Gravity = 1.0, Forward = 5.0, Drag = 1.0, = 96, p = 0.9, n = 2,500 as input. The population size () is chosen as a multiple of 24 since the fitness evaluations are distributed among 24 threads on a cluster (i.e., Dell PowerEdge M710, 2x Xeon X5675, 6 core @ 3.06GHz). The details about the genetic operators such as SBX (Simulated Binary Crossover), Polynomial Mutation, and Tournament Selection Based on Dominance can be found in the literature (Deb et al., 2002). NSGA-II uses ( + ) elitist selection strategy with a bias on individuals in lower fronts, where the Pareto-optimal front is the first front. If the individuals are located in the same front, the ones that are more distant from the others in objective space are selected to maintain the diverse set of trade-off solutions within the population.
Evolution as given in Algorithm 3 was run separately for CS, C, and S until an individual was found that achieved a fitness scores of at least =22.0 and =0.01, where is the average number of successfully passed pipes and is the number of collisions. Although the final Pareto-optimal set in each run contained individuals with higher values, the minimum requirement meant that only relatively safe solutions were accepted. Generalization ability of these solutions were then evaluated.
The evolution of S takes the shortest amount of generations since it has the least number of model parameters to optimize, i.e., 287, compared with 982 for C and 1207 for CS. To make sure the number of parameters was not a factor, another S with a larger Skill module, with the same number of parameters as CS, was also evolved until the same target level. However, it performed poorly compared to the smaller S in the generalization studies, apparently because it was easier to overfit. Thus, it was excluded from the comparisons that follow.
3.2. Generalization Behavior
To evaluate the generalization performance of the best performing networks, the task parameters (i.e., flap, gravity, forward, and drag) were changed in the following two ways while keeping the networks fixed:
The range of variation in the task parameters was increased from 20% to 75%; and
All four parameters were varied simultaneously as opposed to one at a time.
The task parameters were varied in a four-dimensional structured grid ranging from each parameter’s 25% and 175% of the base value, respectively. Thus, with the updated limits, the effect of
the flap action varied between [-21.0, -3] (i.e., [-14.4, -9.6]);
the gravity force varied between [0.25, 1.75] (i.e., [0.8, 1.2]);
the forward action varied between [1.25, 8.75] (i.e., [4.0, 6.0]); and
the gravity force varied between [0.25, 1.75] (i.e., [0.8, 1.2]).
Each parameter axis was divided into 10 equal steps and each set of task parameters were sampled three times (with varying pipe distribution) and averaged. Therefore, all three networks were tested for episodes. To compare the generalization performance of the networks pairwise, the difference in the number of successfully passed pipes and the number of collisions are presented in the following histogram plots of Figure 4. The horizontal axis shows the difference in either or , whereas the vertical axis shows the frequency of these results. Having a skewed distribution to the right side of the 0-value is better for the left histogram (i.e., score of pipes), whereas the opposite is better for the right histogram (i.e., score of hits) for each network.
The histograms show that CS performs better than both C and S by a large margin (Fig. 4(a) and (b)). Interestingly, C and S have similar generalization even though they have very different architectures (Fig. 4(c)). These results are also evident in the summary boxplot of Fig. 5. Therefore, even though each of C and S do not perform well alone, when combined into CS, they work well together and allow generalization to a wide range of new situations.
4. Behavior Analysis
To understand how the CS architecture outperforms its individual components C and S, a set of task parameters [Flap=-7.0, Gravity=0.6, Fwd=8.8, Drag=0.6], which was included in the generatization tests in Section 3 was evaluated further. This setting has previously unseen exaggarated effects for flap and forward, and previously unseen diminished effects for gravity and drag. Thus, actions tend to push up and speed up the agents more than expected, and it is difficult for it to slow down and come down. Generalization requires both extrapolation of the task parameter limits as well as understanding previously unseen interaction between them. All three networks were tested in the same environment and their behavior tracked in detail.
The C network was able to pass 15 pipes successfully, and collided with six pipes, whereas S performed slightly better by passing 16 pipes with five collisions. On the other hand, CS remarkably managed to pass all 21 pipes without hitting any of them. Fig. 6 illustrates how different the strategy of CS is from those of C and S. Both C and S use all four actions (flap, forward, simultaneously flap and forward, or do nothing, i.e. glide), but CS never uses flap. That action simply lifts the agent up, which is rarely optimal action in this environment where it takes such a long time to come down. If it is necessary to go up that is because the opening is high, and in that case it is more efficient to move forward as well.
As an illustration, Fig. 7 shows a situation at the 4th and 5th pipe. Both C and S make a similar mistake by flapping up and forward. They end up too high too fast, do not have enough time to come back down, and crash into the 5th pipe. In contrast, as soon as the 5th pipe becomes visible, CS refrains from both actions while there is enough time for weaker gravity and drag to slow and pull down the agent, and it reaches the opening in the 5th pipe just fine.
5. Discussion and Future Work
The proposed Context+Skill approach adapts to unseen situations by representing context explicitly. Compared to its components, it has a remarkable ability to generalize to unseen situations. In this proof-of-concept study, the architecture of the neural network model has a fixed-topology which constrains the model’s functionality. Evolution of the network topology together with its weights (Stanley and Miikkulainen, 2002; Schrum and Miikkulainen, 2014) will be a natural extension to this work.
Besides the architecture, the choice of the tasks used for training plays an important role in the generalization capability of the model. Therefore, one direction for future work is to investigate methodologies that can automatically design a curriculum, i.e., a set of new training tasks and a better order to learn them (Narvekar and Stone, 2018; Wang et al., 2019; Schmidhuber, 2011; Justesen and Risi, 2018; Risi and Togelius, 2019).
Another direction for future work is to look into the hidden layer patterns to see if any evidence can be found for the observed generalization capabilities (Zhang et al., 2016) or representational capacity (Arpit et al., 2017)
. There is plenty of work about learned hierarchical representations in applications such as computer vision(Yosinski et al., 2015) and natural language understanding (24), however it is still limited in reinforcement learning tasks.
Lifelong machine learning tries to mimic how humans and animals learn by accumulating the knowledge gained from past experience and using it to incrementally adapt to new situations(Parisi et al., 2018). The generalization ability presented in this work can serve as a foundation for continual learning. It can provide an initial rapid adaptation to new situations upon which further learning can be based. How to convert generalization into a permanent ability in this manner is an interesting direction of future research.
Perhaps the main challenge in deploying artificial agents in the real world is that they are brittle—they can only perform well in situations for which they were trained. However, this paper demonstrates an alternative approach based on separating contexts from the actual skills. Context can then be used to modulate the actions in a systematic manner, significantly extending the unseen situations that can be handled. This principle was evaluated in a challenging version of the Flappy Bird game, and shows to perform better than traditional training and general memory-based training. This Context+Skill approach should be useful in many control and decision making tasks in the real world.
This research was supported in part by DARPA L2M Award DBI-0939454.
- A closer look at memorization in deep networks. External Links: Cited by: §5.
- A fast and elitist multiobjective genetic algorithm: nsga-ii. IEEE Transactions on Evolutionary Computation 6 (2), pp. 182–197. Cited by: §2.2, §2.4, §2.4.
Multi-objective optimization using evolutionary algorithms. John Wiley & Sons, Inc., USA. External Links: Cited by: §2.2.
- Meta-learning by the baldwin effect. Proceedings of the Genetic and Evolutionary Computation Conference Companion. Cited by: §1.
- Model-agnostic meta-learning for fast adaptation of deep networks. External Links: Cited by: §1.
-  Flappy bird.. Note: https://en.wikipedia.org/wiki/Flappy_Bird(Online; accessed 3-February-2020) Cited by: §2.1.
- DEAP: evolutionary algorithms made easy. Journal of Machine Learning Research 13, pp. 2171–2175. Cited by: §2.2.
Hands-on machine learning with scikit-learn and tensorflow: concepts, tools, and techniques to build intelligent systems. 1st edition, O’Reilly Media, Inc.. External Links: Cited by: Figure 3, §2.3.
- Towards continual reinforcement learning through evolutionary meta-learning. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, GECCO ’19, pp. 119–120. Cited by: §1.
- LSTM: a search space odyssey. IEEE Transactions on Neural Networks and Learning Systems 28 (10), pp. 2222–2232. Cited by: §2.3.
- Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. Cited by: §2.3.
- Automated curriculum learning by rewarding temporally rare events. External Links: Cited by: §5.
- Schema networks: zero-shot transfer with a generative causal model of intuitive physics. External Links: Cited by: §1.
- Reducing local optima in single-objective problems by multi-objectivization. In Evolutionary Multi-Criterion Optimization, E. Zitzler, L. Thiele, K. Deb, C. A. Coello Coello, and D. Corne (Eds.), pp. 269–283. Cited by: §2.2.
- Evolving adaptive poker players for effective opponent exploitation. In AAAI-17 Workshop on Computer Poker and Imperfect Information Games, Cited by: §1.
- Opponent modeling and exploitation in poker using evolved recurrent neural networks. In Proceedings of The Genetic and Evolutionary Computation Conference (GECCO 2018), Kyoto, Japan. Cited by: §1.
- Learning curriculum policies for reinforcement learning. External Links: Cited by: §5.
- Continual lifelong learning with neural networks: a review. External Links: Cited by: §5.
- Procedural content generation: from automatically generating game levels to increasing generality in machine learning. External Links: Cited by: §5.
- Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-… hook. Ph.D. Thesis, Institut für Informatik, Technische Universität München. Cited by: §1.
- POWERPLAY: training an increasingly general problem solver by continually searching for the simplest still unsolvable problem. External Links: Cited by: §5.
- Evolving multimodal behavior with modular neural networks in ms. pac-man. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, GECCO ’14, pp. 325–332. Cited by: §5.
- Evolving neural networks through augmenting topologies. Evol. Comput. 10 (2), pp. 99–127. Cited by: §5.
-  The unreasonable effectiveness of recurrent neural networks. Note: http://karpathy.github.io/2015/05/21/rnn-effectiveness/(Online; accessed 3-February-2020) Cited by: §5.
- Learning to learn: introduction and overview. In Learning to Learn, pp. 3–17. External Links: Cited by: §1.
- Paired open-ended trailblazer (poet): endlessly generating increasingly complex and diverse learning environments and their solutions. External Links: Cited by: §5.
- Understanding neural networks through deep visualization. External Links: Cited by: §5.
Understanding deep learning requires rethinking generalization. External Links: Cited by: §5.