Hypothesis-Driven Skill Discovery for Hierarchical Deep Reinforcement Learning

05/27/2019 ∙ by Caleb Chuck, et al. ∙ The University of Texas at Austin 0

Deep reinforcement learning encompasses many versatile tools for designing learning agents that can perform well on a variety of high-dimensional visual tasks, ranging from video games to robotic manipulation. However, these methods typically suffer from poor sample efficiency, partially because they strive to be largely problem-agnostic. In this work, we demonstrate the utility of a different approach that is extremely sample efficient, but limited to object-centric tasks that (approximately) obey basic physical laws. Specifically, we propose the Hypothesis Proposal and Evaluation (HyPE) algorithm, which utilizes a small set of intuitive assumptions about the behavior of objects in the physical world (or in games that mimic physics) to automatically define and learn hierarchical skills in a highly efficient manner. HyPE does this by discovering objects from raw pixel data, generating hypotheses about the controllability of observed changes in object state, and learning a hierarchy of skills that can test these hypotheses and control increasingly complex interactions with objects. We demonstrate that HyPE can dramatically improve sample efficiency when learning a high-quality pixels-to-actions policy; in the popular benchmark task, Breakout, HyPE learns an order of magnitude faster than common baseline reinforcement learning and evolutionary strategies for policy learning.



There are no comments yet.


page 2

page 6

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

While recent advancements in deep reinforcement learning (RL) have been used to obtain exciting results on a variety of high-dimensional visual tasks, ranging from video games to robotic manipulation, these algorithms often require large amounts of data in order to achieve good performance. In many real-world tasks, such as robotics, this data is difficult and expensive to collect in large quantities. Though one of many factors, one cause of this high sample complexity is that deep RL algorithms typically strive to be as problem-agnostic as possible. Generality, while a laudable goal, often ignores powerful inductive biases—some of which may work well across large collections of problems of interest. In this work, we focus on object-centric tasks that (approximately) obey a set of common-sense physical laws; for example, we assume that objects have a consistent visual appearance and have properties (for instance, position or velocity) that do not change unless acted upon by another object. In exchange for this limiting set of assumptions, we will demonstrate that it is possible to realize dramatic gains in sample efficiency for problems in which these assumptions hold. Furthermore, we argue that this set of assumptions is quite reasonable for a wide range of tasks, ranging from pseudo-physical video games to robotic manipulation.

Figure 1: The HyPE (Hypothesis proposal and evaluation) loop. Each iteration of the HyPE loop tries to learn a new object interaction. Section 3 expands on each step in detail.

To leverage the problem structure provided by objects in physical reinforcement learning tasks, we introduce the Hypothesis Proposal and Evaluation (HyPE) algorithm. As an example of the type of learning process we wish to capture, consider a human learning to play the Atari game Breakout, in which the player must control a paddle to bounce a ball, which in turn, can destroy bricks at the top of the screen. By experimenting with random inputs, the player observes the ball and paddle moving, quickly identifying them as objects that obey pseudo-physical laws. However, only the paddle motion appears to be directly correlated with the controller inputs, so the player attempts to learn to control it first. As they do, they observe that the ball bounces off the paddle and destroys the bricks, which provides reward. Recognizing that they cannot affect the bricks through the paddle directly, the player learns to control the ball via bouncing with the paddle. Finally, once the player has learned to control the ball, they can learn strategies to aim and destroy bricks quickly, completing the game.

HyPE formalizes the intuitions behind such a learning process in a 3-step learning loop:

  1. State abstraction via object discovery: The first step of the HyPE loop discovers objects from raw pixel inputs by learning convolutional filters that meet certain physics-guided criteria. Reasoning about objects provides a factorization of state that can circumvent the need to experience dense samples from the full distribution of possible states. Instead, HyPE learns to control simpler object-object interactions that each only rely on a small subset of state information. For example, changing the directional velocity of the ball with the paddle is only dependent upon the paddle and ball positions, and not the wall or bricks.

  2. Skill proposal via hypothesis generation: The second step of the HyPE loop proposes hypotheses about what caused the changes it observed in object properties. Namely, it hypothesizes about whether the change is controllable and, if so, which object-object interactions can control it. HyPE then converts each hypothesis into a goal for a corresponding skill/option sutton , such that the control hypothesis becomes testable. For example, hypothesizing that the paddle can be used to change the directional velocity of the ball produces a goal of bouncing the ball.

  3. Hierarchical skill learning via hypothesis evaluation: The third step of the HyPE loop uses RL to try to learn the options proposed in the previous step, so testing the generated hypotheses. Successful learning of this option confirms the interaction hypothesis. However, for learning to be tractable, a given option often uses other previously learned options as primitive actions, leading to a hierarchy of skills. For example, an option for destroying a brick can use the ball-control options as primitive actions, which in turn uses paddle-control options as primitive actions, which in turn use raw controller inputs.

We demonstrate that HyPE improves the sample efficiency of policy learning on the classic arcade game of Breakout from raw pixels by an order of magnitude, as compared to baseline deep RL algorithms. In addition, we show that HyPE automatically constructs a control structure which describes and characterizes several intuitive components of the game, providing evidence that HyPE contains the right set of inductive biases to serve as a foundation for scaling RL to real-world manipulation tasks.

2 Assumptions

We frame our problem as a Markov Decision Process (MDP). At each time step

, the agent observes state , with starting state distribution , and takes action . The next state is determined by

, which is the probability of experiencing a subsequent state

given current state and current action . A policy is defined as , or the probability of an action given state . The agent receives external reward as a function of the current state and action . From this, we can define the return as the discounted future reward: , where is a given discount factor. The reinforcement learning problem is to find the policy that maximizes this total expected return.

To solve this MDP efficiently, our proposed algorithm generates and tests hypotheses about relationships between actions and objects in the world. Thus, our primary assumptions guide where interesting events will occur (locations), when to look for such events (salient times), and how to check for controllability (option policy learning).

In our instantiation of HyPE, we define to be a function that finds the location of object in an input image, represented as —the x,y coordinates of that object (time indexed by ). represents a change in position. In general, and can encompass a broader set of object properties, rather than only position. The following subsections outline additional assumptions made by our instantiation of HyPE, but note that many of these assumptions can similarly be generalized beyond properties such as position.

2.1 Object Structure

Our first assumption is that the world is comprised of objects with consistent properties and relationships, which we define for HyPE. We use to denote the input state, an image.

Consistent Properties: A function , which maps from the raw state , to the location of object . This implies that visual cues can be used to determine the location of an object in every state.

Consistent Relationships: A consistent relationship is one where control of one object can exert some control of another object . To specify this formally, we define some change in the state of , denoted as , two policies, the control policy , the base policy and a time horizon T. Then the control policy has the property:


The set of possible define ways to manipulate within a set time horizon, which if learned, we can use as an action space on object . In HyPE, and are only related if changes in can be induced by . Since is a position, is some displacement.

We can also treat the actions taken on the base MDP and the base MDP reward as special types of objects. The state, of these objects is their value taken at time (action taken and reward received respectively). We call these “abstract objects”, since they do not appear in .

2.2 Proximity

We will make use of a saliency , which determines timesteps where one object is likely to effect change in another object. While in general, any two objects might interact at any time, we use spatial proximity and the quasi-static property to limit the search for object interactions.

Spatial proximity is based on the definition of object locations from Section 2.1. Thus, for objects and , we can define the proximity of these objects to be if , that is, the -norm is less than some epsilon. In HyPE, we use the l2-norm.

Note that we implicitly assume temporal proximity in the defining saliency, since we say that certain timesteps are salient. This implies that object effects (changes) and salient events co-occur within a short time window.

2.3 Quasi-static Property

We additionally assume that some properties of objects are quasi-static—they do not change unless acted upon by another object. To define changepoints, we use the formulation from Changepoint detection using Approximate Model Parameters niekum . This formulation finds changepoints in a trajectory, , where changepoints are chosen so that within each segment , a model has the property , along with regularization of and constraints on the number of models. In other words, in segments between changepoints, a simple model can predict the next state. Our model choice is:

, a fixed vector

displacement model.

In this work, we use the quasi-static property in two ways. First, the quasi-static property assumes that changepoints are caused by an object-object interactions. Second, we use the quasi-static property to assert that object changepoints are salient times to search for changes in other objects.

2.4 Contingency

While there may be many objects in the world, we can limit the ones we are interested in by searching only for objects that are contingent. We define contingent objects recursively: a contingent object can either be controlled directly by the actions available to the agent, , or through control of a different contingent object. In the Breakout example, the ball is contingent because it can be controlled through the paddle (which is controlled directly by ), but the walls are not because they cannot be controlled by any contingent object. We can formalize this by defining the object interaction graph . A node of corresponds to an object . A directed edge between a node only exists if a control policy as defined in Equation 1 has been learned. An object is contingent if there exists a path from the abstract node corresponding to the raw actions (call this ), and . HyPE only attempts to learn edges from contingent nodes.

2.5 Option-based Causal Discovery

Finally, in HyPE we say that if we can learn from a contingent node , then we have control over object (adding an edge between and in the graph defined above). Recall that produces some with probability higher than a baseline policy (typically the random policy). This means that through displacements of , this learned policy can do nontrivial control of , which define an action space over the target object .

3 Proposed System

Our system involves a three-step hypothesis proposal and evaluation loop (HyPE loop). We define a hypothesis as a boolean statement about a relationship between two objects . At each full iteration of the loop, the system adds edges to the graph as described in Section 4. The HyPE loop’s three steps are:

  1. Object discovery, which locates a new object by learning a new .

  2. Hypothesis proposal, which uses past data (possibly collected using prior iterations of HyPE) to define a hypothesis about where and how two objects interact.

  3. Hypothesis verification, where the agent learns to reliably reproduce the interaction between two objects.

Over multiple iterations of the HyPE loop, an object interaction graph as defined in section 2.4 is generated, one edge at a time. We start with a single node, , and use it to discover filters to locate new objects in the frame, and learn how those objects can be controlled.

3.1 Object Discovery

This step of the algorithm seeks to visually identify objects that are controllable, either directly or via another object. This corresponds to searching for visual features whose movements can be explained by either direct action inputs or interactions with contingent objects. We make this into an optimization problem, in which we try to learn , a filter on the input image that identifies the location of an object that meets the aforementioned controllability criteria. We optimize this over a record of historical data, either from a a baseline policy or from previous iterations of the HyPE loop.

Given a premise object , which we can identify with (an object function we hae already learned), we search for another object which can affect. This effect is characterized by a changepoint (using the quasi-static property) in , the location of the at time , which is the output of . Recall that we expect one object to affect another in salient regions (based on proximity or changepoints). Thus, if we learn a new object, we want it to have changepoints more often at the salient regions than elsewhere. For example, in Breakout, if we were trying to learn to locate the ball from the paddle, we would expect the trajectory of our learned ball to change more often near the paddle than on some random frame.

In order to make this into an objective for our optimization, we start with a saliency function which operates on (the location of our premise object at time ) and (from which we are trying to learn). For example, some timestep might be salient if the positions were in close proximity. We define our saliency functions formally in the supplementary material.

Next, we determine if changepoints tend to happen to object at times when it is salient with . Define 111we abuse notation, applying to the entire dataset to get full dataset counts when computing as the number of times where the two positions are both salient and there is a changepoint in . Compare this with the total number of changepoints in : and the total number of salient points between :

. Our loss function should optimize over both getting a good number of changepoints and total number of salient points. If the learned filter can achieve good loss simply by raising saliency without caring about changepoints, then it would simply track the premise object, or some highly correlated component. On the other hand, if the optimizer can simply maximize the changepoints, it will learn a filter that constantly jumps between the premise object and other, random points. Thus, We optimize the F1 score (

), a common statistical measure of significance (see supplementary material for precise definition).

Maximizing the F1 score maximizes the number of desired events (in this case, ), relative to both the total number of significant events , and the total number of interesting points for : . Since changepoints define significant changes in the target object, we prioritize learning objects where a significant number of changepoints can be explained by saliency. To mitigate the effect of re-learning old objects, we subtract the mean representation of already-learned objects from the object location.

As an additional regularization on filter activations to keep it from jumping around wildly, we regularize displacement of (controlled by ). Since the F1 score is computed over counts on the whole dataset (length ), we use to denote the full dataset of object positions. Our full objective for object discovery:


We optimize this objective with covariance matrix adaptation evolution strategy (CMA-ES,  hansen

). In addition, we smooth the outputs to clean up relatively rare cases where the filter fails, an operation has been shown to improve performance on downstream deep learning policy tasks

chuck . We represent with a two layer convolutional network with and filters.

Figure 2: Images from the object discovery step of HyPE, where the cross-hairs show the location of the object. The bottom row shows the heatmap of the filter. On the left, is tracking of the paddle location, and the right shows tracking of the ball

3.2 Hypothesis Proposal

A hypothesis in HyPE is a boolean function that operates between two objects, a premise object and a target object , which evaluates to true when some interesting change (a changepoint or control) has occurred in during a salient time (a function of . This hypothesis defines the reward in hypothesis evaluation. In order to propose a hypothesis that HyPE will then try to reproduce, we want to check from existing data that the hypothesis we propose is supported in data.

The process of verifying a hypothesis mirrors the F1 criteria defined in the last section, since the objective is closely related: we want to see if our premise affects our target in a statistically significant way. However, since we are not trying to optimize the inputs, we simply test:


This checks if at salient times, changepoints occur significantly more often than not. If this check passes, then hypothesis proposal proposes the boolean function:


This checks if, when the two objects are salient, there is a changepoint in the target object. Maximizing this frequency induces changepoints in the target through the premise. For example, a ball bouncing hypothesis in Breakout using proximity as salience would attempt to observe as many ball displacement changepoints near the paddle as possible.

However, we might want to do more than just induce an arbitrary changepoint. In order to induce a particular behavior in our target object, we can extend the hypothesis to induce a changepoint where the subsequent segment has the desired behavior. Recall that segments are approximately modeled by , which is a fixed displacement in our instantiation. Then, if we expect there to be some characteristic displacements (or range of displacements) after a changepoint, we can hypothesize that, while the premise object is salient with the target, it causes the desired displacement. For example, using proximity as salience on raw actions (which are always “proximal”) we might hypothesize that actions control displacement(s) in the paddle.

To formalize this, for our historical data changepoints where the model is salient, , we take the corresponding models and collect the displacements . Thus, we extend the definition of hypothesis to (using as the states at times and ):


This hypothesis is true when the delta state matches , and the locations are salient. Notice that this hypothesis dropped changepoints on , which assumes that the changepoint hypothesis has already been checked (or in practice, has sufficiently high evidence as defined by Equation 3).

However, the model of motion after a changepoint might be noisy due to distractor objects or faulty vision. In the HyPE system, we mitigate this by clustering the model parameters

. We use Dirichlet Process Gaussian Mixture Models

rasmussen for this clustering.

3.3 Hypothesis Evaluation

Now, we want to learn to reproduce the hypothesis. This is done by simply incorporating a boolean hypothesis function into the reward function which outputs 1 if the hypothesis is true and 0 if false:


We define our actions as the learned options, , over the premise object (not the raw actions ). This allows us to gain sample efficiency through action abstraction and hierarchy.

To gain sample efficiency through state abstraction, we use as input the locations of the premise and target objects. Though this assumption is quite limiting (if there is some other, unknown object that affects the target, the policy cannot account for this object), we use this abstraction because the performance benefits appear to outweigh this cost. Even with other objects that are unaccounted for, we still expect to induce a change more often than random, allowing future iterations of HyPE to learn about the unknown object(s).

In order to maximize the advantage of our state abstraction, we add several simple transformations of the inputs to our state space, namely velocity and relative position of the objects: and . Our neural net architecture computes the following:


Where are our input operations. This architecture converts each input state into a length 128 vector, where is a matrix of weights (all input properties have dimension 2), and is some activation. It then takes the mean of all feature vectors, and feeds these forward to the outputs.

Figure 3: Performance comparison between PPO, A2C (orange), Rainbow (blue) and HyPE (maroon), evolutionary strategies salimans , CMA-ES and a baseline (10 trials each) of learning on the true paddle and ball locations (Base). The y-axis is average episode return, and the x-axis is number of frames experienced, on a log scale. HyPE outperforms Rainbow, the best performing baseline, by 7x in training and 18x in test. This difference is because HyPE uses CMA-ES as an optimizer, which tends to have significantly lower training performance than test, which is not as true for Q-learning methods like Rainbow. HyPE testing performance at 55k frames matches Rainbow training performance at 1m time steps (see Table 1).
Algorithm Base HyPE Rainbow A2C & PPO
Timesteps 52,000 55,500
Table 1:

Table of training time to find evaluation policy with 244 blocks hit, the average test score of HyPE after 55,500 frames of training (standard error 27, 20 trials). “Base” is a CMA-ES algorithm run on the relative positions of the paddle and the ball, ball velocity, and ball and paddle positions, from the true underlying game state.

4 Results

We demonstrate that HyPE can learn to achieve high performance on the classic game Breakout after two iterations of the algorithm loop. Both the visual and behavior components have intuitive interpretations, so we can observe the performance of the HyPE system as it progresses. As a step by step progression: since the HyPE loop has no initial information, it learns from 1k frames of data with a random policy, and then uses that information with the loss defined in Equation 2, and a prior object of , which is the only node in the graph so far. Since the vision loss is above the threshold F1 score, , it checks and proposes a hypothesis, which results in three characteristic behaviors as defined in Equation 5 after clustering222The HyPE loop automatically chooses a control hypothesis when it has control clusters with more than 10 assignments. Having finished step 2, the HyPE loop performs the learning of three hypothesis-generated options—moving the paddle or pixels, respectively. These options use paddle position and velocity as input, and learning converges in 2.5k timesteps. Though this first step of the HyPE loop learns an intuitive first object (the paddle), this is not encoded explicitly anywhere in the algorithm, but emerges from physical priors and controllability. Because learning succeeds, the HyPE loop adds a new node, connected to , which we call .

Using the cumulative data, the HyPE loop then applies the vision loss to all nodes. , with the paddle mean removed from the image, and does not meet the threshold to propose any new objects. However, discovers a new object—the ball (the results of the vision step are shown in Figure 2). In this case, the ball discovery is a consequence of the changepoints it exhibits near the paddle. The subsequent hypothesis check for changepoints near the paddle passes and is used to generate a reward function for learning an option to bounce the ball off of the paddle. Using this learned option continuously is sufficient to achieve high extrinsic reward in a small number of frames.

In Figure 3, We show that HyPE has roughly an order of magnitude improvement in sample efficiency compared to baseline RL methods. HyPE, at train time, achieves average reward per episode of in 55k frames, while Rainbow hessel takes 400k timesteps, and Proximal Policy Optimization schulman and A2C mnih take roughly 1.4M timesteps to achieve the same performance. However, CMA-ES, used to learn the HyPE bouncing policy, typically has higher test performance than train. Thus, we also show that the evaluation policy learned by HyPE after 55000 frames achieves average reward per episode performance, which is more than an order of magnitude better than Rainbow, the best performing baseline.

Finally, in order to better understand the performance gains of HyPE, we demonstrate that a learner using the actual positions of the ball and the paddle (and relative positions) achieves performance similar to that learned by HyPE, as shown by the “Base” in Figure 3 and Table 1. This implies that the majority of the performance improvement comes from using the object relative input states. However, HyPE provides a principled method for learning and basis for using those object-relative input states, as well as the ability to perform targeted, hierarchical exploration.

5 Related Work

Existing work has improved sample efficiency through state abstraction and skill learning. Some of this work is done in the context of exploration, such as by learning hash functions to search for novel states ostrovski ; Tang ; burda , or replicating actions to return to partially explored regions savinov ; Ecoffet . Other methods try to learn sub-goals from hindsight andrychowicz , using a learned controller vezhnevets or bottleneck regions bacon .

Broad physical assumptions have been incorporated into work in probing intuitive dynamics Piloto , learning physical dynamics or relations Chang ; Zambaldi , object representations Greff ; burgess , and physics belbute . However, our method incorporates these relationships into hierarchical learning similar to Zhou .

Several works inspired our use of hierarchy and model information. In particular, contingency has been used to focus on particular regions of the input space bellemare . Perception and control can be separated to learn policies with few parameters cuccu . Alternatively, learning hierarchies of control with options has been studied in detail sutton ; bacon , and can be used to define a system for learning skills and state spaces konidaris . Our work also carries similarities to much work in model based reinforcement learning kaiser . Despite similarities to these methods, the HyPE loop uniquely exploits the combination of broad physical assumptions and hierarchical reinforcement learning to achieve state and action abstraction.

The control graph structure and hypothesis verification components of HyPE draw upon ideas related to causality pearl . These components of HyPE relate to where graphs Shanmugam and policies Buesing are learned from interactions with the environment. Schema networks Kansky , combine intuitive physics with a process for learning causal networks for gaining sample efficiency on model transfer. These networks achieve results related to object interactions and useful control over them which parallels the learned object-interaction graph from the HyPE loop. However, Schema Networks do not learn objects from raw inputs or provide a curriculum of learning over different objects, preventing them from having the same sample-efficiency benefits on non-transfer problems as HyPE.

6 Conclusion

We introduced the HyPE algorithm, which incorporates general purpose priors about the world, such as proximity, object factorization, and quasi-static assumptions, in order to efficiently learn to hierarchically explore and control its environment. Though this system requires several limiting assumptions, making it less application agnostic than classic general-purpose RL algorithms, these assumptions are reasonable for physical domains and lead to sample efficiency that is roughly an order of magnitude better than baseline RL methods. Future work can aim to address the practical issues required to extend HyPE to successfully work in physical real-world domains, such as robotic manipulation. Furthermore, the causal graph structure generated by HyPE may have implications for both explainable AI as well as transfer learning that can be explored.


The authors are supported by the ONR through the National Defense Science And Engineering Graduate Fellowship (NDSEG) program. This work has taken place in the Personal Autonomous Robotics Lab (PeARL) at The University of Texas at Austin. PeARL research is supported in part by the NSF (IIS1724157, IIS-1638107, IIS-1617639, IIS-1749204) and ONR(N00014-18-2243).


7 Breakout

Figure 4:

The Breakout domain. The paddle has been colored green, the ball has been colored red, the walls have been colored orange, and the blocks have been colored purple. The ball moves at a constant velocity, unless it comes into contact with the blocks, wall or paddle. The ball has an instantaneous change in velocity on contact. The paddle can move left and right or stay stationary at each timestep. When a block is hit by the ball, it disappears and extrinsic reward is given. If the ball hits the bottom wall 5 times, the episode ends. However, if all blocks are hit the loss counter resets, which inflates scores past 100 points. When used to train HyPE, the full image is grayscale and binarized (all values are 0 or 1)

8 Atari 2600 Breakout

While our version of Breakout has all the same intuition as the Atari 2600 version, in this work we do not use the traditional Atari 2600 version of the game. This is because it encroaches on 3 of our guiding assumptions. In particular, consistent properties, consistent relationships and the quasi-static property. In the case of visual structure, we assume that the visual scene is the same as the world state. However, in Atari 2600 objects can become occluded and disappear from the scene. Second, we assume that control is constant in terms of displacement, but this is not true in two ways. First, paddle control has a momentum value such that the first action or change of action produces partial changes to the paddle shape and position. Second, the ball changes velocity and appearance depending on the number of bounces that has been made overall in the episode. This non-stationarity also affects the quasi-static property, where the properties of the ball change not as a result of interaction with an object, but because of an internal counter. While we do not expect that these problems are insurmountable, solving them requires significant extensions to the existing system.

  Input: Raw actions , Environment
  Initialize: Dataset , Graph
  Add to
  Collect data from random policy:
     Choose node from randomly
     object discovery Optimize (Equation 2)
     if  then
        Hypothesis Proposal
        if  (Equation 3 then
           Define (Section 2.2)
           Define all 333if multiple control hypotheses and input state 444where relevant, that is, only include terms where not abstract
           Perform RL to maximize all
        end if
     end if
  until Sufficient performance
Algorithm 1 HyPE Loop

9 Object Discovery F1 Score

The score is a common test for statistical significance, defined as . Given event A, and its complement

, a classifier

that assigns to , and a counts , which are the number of true occurrances of , , and classification to , then the precision is , and recall is . In our case, event is , which is interpreted as a timestep where are salient555In this work, we use proximity for all object-object interactions (such as the Paddle and Ball), and the quasi-static assumption for abstract object-object interactions (such as the Actions and Paddle), and there is a changepoint in . Given counting function , and operating on a full dataset of input




This precision and recall defines the desired

metric .

10 Object Discovery Smoothing

The optimization yields a candidate object hypothesis which fails on complicated frames due to the limited number of parameters. We correct the model by training a larger CNN to match the target heatmaps

generated by inserting a low-variance normal curve centered at the coordinate

of each frame. The last filter is penalized with a lasso regularization resulting in Equation 10. Lastly, the object recognition is composed by .


11 Object Discovery Hyperparameters

The initial population weights are sampled from a normal distribution

. The hyperparameters are set as followed: the proximity threshold

, regularization coefficient , a population size of , and run for epochs. An optimal filter can typically be picked up in this number of epochs, and the algorithm is not particularly sensitive to or population size (though a bad choice of will result in a stationary policy. The threshold for having passable F1 score for vision is . Values below this threshold can generally be generated by randomly jittering the frame to match some of the desired properties.

12 Saliency Functions

We use as saliency functions: 1) a proximity indicator. That is, the salience function returns true or false for a time step to define a salient time, based on the evaluation:


when are objects with locations. With abstract objects (such as the raw actions, that are always proximal, or just alternatively), we use a a quasi-static based saliency function that checks if there is a changepoint in the trajectory of . That is, given a set of changepoints , the salience is:


In terms of which salience to choose, we can simply apply the hypothesis testing or object detection with all salience functions until one sticks (though abstract objects don’t use proximity except as dummy salience). Finally, we can extend either salience function to include a window around salience, which can account for noise and some delayed reaction (i.e. calling the union of sets salient). For control hypotheses (hypotheses involving , since we assume that object in contact are interacting if proximity has already been shown to produce changepoint interactions, we use proximity for salience to give a dense reward whenever the object has the desired motion, given proximal. This is useful for abstract objects, where we expect the abstract object to control or be controlled directly by some property of the premise object.

In practice, we use a distance of approximately 6 pixels to define proximity, and do not use a window around changepoints.

13 Network Ablations

Figure 5: Network architecture, where different input operations on the locations of and

are mapped to a fixed length vectors, and the mean of the output is fed forward to get the probability distribution over actions

We tried several other network architectures. In particular, we used a fully connected network, which struggled to interpret the inputs, if included, but otherwise performed comparably (though with different, often simpler, output policies). We also tried a transformer style network, which computed keys, queries and values for each of the inputs. However, the size of our network was limited by using CMA-ES, and performance on comparably sized networks was strictly worse. This network was simplification, where the mean operation allows the network to balance the inputs without fixating on any one of them. We tried different sizes of maps (other than 128), which the network is fairly agnostic to. Even a size map can learn useful behavior that achieved decent reward (). The maximum size is around 512, limited by the number of parameters in the network that was viable for CMA-ES optimization.

14 Argument for CMA-ES

We tried a variety of policy gradient and deep Q-learning methods to train on some function of the input parameters, without any success, using the true object locations to remove possible noise from the object detection system. This suggests that learning policies on object locations in a reasonable number of frames seems to be a difficult for generic deep RL algorithms, at least for choosing a good set of hyperparameters. We attempted A2C, PPO, Rainbow, DQN, SARSA, SARSA with functional basis, and tabular Q learning. CMA-ES perform well on this reduced input space, because it requires matrix inversion, an operations. This capped the parameter number of the network to k parameters, which was probably not sufficient to perform well if learning from raw images.

15 Baseline architecture

Our base architecture for the baseline networks (A2C, PPO) consisted of a 8x8, 32 map filter with stride 4, followed by a 5x5, 64 map filter with stride 2, and a 3x3, 64 map filter of stride 1, followed by a linear layer from the output of the last convolutional layer to size 512. This layer is fully connected to actor and critic components. This architecture has been used in many Atari Deep RL environments, and is the default Atari network for the Google dopamine framework:


16 Learning Parameters for CHAMP and DP-GMM

The system is fairly agnostic to the parameters for CHAMP and DP-GMM, though a bad choice can produce bad behavior. CHAMP requires 5 parameters, and the displacement model requires an additional parameter. The champ parameters are: a guess at the mean length of a segment, the variance, the minimum segment length, the maximum number of particles, and the resampled particles per time step. For the mean and variance, we chose a reasonable value: 10, 10, but the method is fairly agnostic to these. For minimum segment length, we used 1, and using a different value will be damaging. The max particles was 100, with 100 as resample-able. Any reasonably large choice will perform fine. For model variance (the penalty for modeling errors), we used , which is agnostic to around , where segments become agnostic to the input, and , which exhibits over-segmenting (segments whenever the model does not perfectly predict, regardless of noise).

For the DP-GMMs, we use 10 means (it chooses the number that it needs), zero initial mean and initial covariance of 1e-10. This initial covariance is low because otherwise all the means end up being a single point. We use the implementation on sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html#sklearn.mixture. BayesianGaussianMixture, with default parameters there.

17 Learning Parameters for CMA-ES in Paddle Bouncing

For CMA-ES, we use a sample length of 100 frames, which increases after 12 epochs by 100 frames. We train for 30 epochs total. This value does not really change learning, but it optimizes sample efficiency. We use a population size of 10, and initial variance of 1 (initial mean is the initializations of the network, which uses a small uniform random). We use a gamma of and evaluate fitness based on return, though the choice of gamma does not strongly affect learning.

18 Details for the HyPE loop

The HyPE loop randomly chooses to run vision learning on every node in the existing graph. It then tests to see if the vision loss F1 score is above the threshold. Then, it performs the hypothesis test, testing if the numbers are above the threshold. Significantly greater is only relevant when there are more than 10 occurrences in the total categories. For example, the ball must have been bounced off the paddle at least 10 times. Then, the threshold is such that the ratio must be greater than . The check for a hypothesis about is always run afterwards, which involves taking the models after a salient changepoint, and clustering on the model displacements. Any cluster with greater than 20 occurrences is taken. For example, with the ball bouncing off the paddle, there are four clusters after the changepoint corresponding to the four angles at which the ball can bounce. Learning of multiple options involves switching between the different options when relevant (either after a changepoint with proximity-based saliency, or after a fixed number of timesteps with action-based saliency).

19 Details for Code

A series of code commands to run the two iterations of the HyPE loop are detailed in the README.md file in the code, as well as the requirements. Code can be found at https://github.com/CalCharles/contingency-options

20 Extension of base assumptions

While our base assumptions might seem restrictive, we believe that these assumptions still generalize to many problems, and especially real world domains. In general, the world can be reduced into objects whose properties do not spontaneously change: this assumption is the basis for tool use. The quasi-static property is generally true, since it is a formalization of the term “inanimate object”, which is a common class of objects to manipulate. Proximity in time and space is another well used assumption, which while not guaranteed to be true, certainly is often true. Finally, while formally proving a relationship is difficult, learning a policy to produce a particular change loosely mirrors experimentation in the scientific method, without the same precision of design.

21 Extension of the HyPE algorithm on Breakout

Notice that the HyPE agent, while maximizing a reward derived from causing a ball changepoint near the paddle, incidentally maximizes the true reward. However, this does not prevent us from continuing to apply the HyPE loop, and eventually even learning that block objects have temporal proximity to the true reward. By observing the interactions between the ball and the paddle, the HyPE agent can propose a hypothesis to define the different angles at which the ball can come off the paddle. In fact, running the HyPE loop on this data results in learning 4 different ball angles (described in supplementary material). Additionally, HyPE can learn a relationship between the ball and the blocks, and the blocks and true reward. Using the ball angles as options, then, one can optimize the removal of blocks directly. However, limitations with the vision algorithm and with learning options to hit the ball at different angles prevents us from demonstrating this functionality. We also posit that the relationships can be learned bi-directionally, backward from the node corresponding to true reward.

22 Details on Hypotheses and Paddle policies

The action-paddle hypotheses use the control hypotheses as defined in the paper, with a saliency function of proximal, and control behaviors learned from DP-GMM on the displacement models in the segments after paddle changepoints. The values turn out to be: , where the right and left command in true space are . For the ball, we also perform de-noising by DP-GMM, by taking the mean location when a changepoint occurs. This turns out to be , which is approximately above and to the left of the paddle. this is because the filters only look at changepoints, and are not necessarily centered. The different approximately mirrors the ball offset relative to the paddle offset (the ball in learning is more to the left).