We consider an agent in a Markovian environment that is partially specified by a set of human-defined attributes. The attributes provide a useful language for communicating tasks to the agent.
In (zhang2018composable, ) it was shown that using (state, attribute) pairs as supervision, it is possible for an agent to explore the environment and accomplish tasks at test time that are described by those attributes. However, in that work, the attributes were binary functions, and the worst case complexity of exploring the attributes scaled exponentially in the number of attributes.
Many environments of interest have geometric and/or arithmetic structure. In this work we show that equipping these attributes with the appropriate geometric and arithmetic structure brings substantial gains in sample complexity. We demonstrate our model on 2d grid-world environments.
2 Problem Setup
We start with a Markovian Enviroment (ME) , given by a state space , an action space and a transition kernel
specified by the probabilityto transition from state to by taking action
. Model-based approaches attempt to estimate the transition kernel in order to perform planning. In this context, it is crucial to exploit regularity priors: in many practical scenarios this ME is highly structured, in the sense that the transition kernel varies smoothly with respect to specific transformations in the state/action spaces. For example, applying a forceto an object at location will likely produce the same effect than applying the same action to the same object at a different location: .
For that purpose, the ME is augmented with a structured attribute space and a deterministic mapping , encoding the attributes associated to each state. This mapping may be either given by the user, or may be regressed from a dataset of labeled pairs , resulting in an estimate . Unless otherwise specified, in the following we shall write to denote the ground-truth state-attribute mapping. In order to leverage the regularity of the environment, we equip the attribute space with predefined algebraic and geometric structure. In this work, we consider attribute spaces built as outer products of elementary groups and monoids111 A monoid is a semigroup with identity element; a semigroup is a set with an associative binary operation., such as real numbers , integers , counts and modular arithmetic . Our model-based approach thus amounts to estimating the transition kernel induced in the attribute space. At test time, the agent will be given attribute goals , and its objective is to take an appropriate sequence of actions in the original environment to reach .
3 Related work
This work builds upon the unstructured attribute planning model from (zhang2018composable, ). In that work, a Markovian state space was augmented with a set of binary attributes. The attributes were used as a means of organizing exploration and communicating target states. In that work, the agent was built from three components: (i) a neural-net based attribute detector , which maps states to a set of attributes , i.e. . (ii) a neural net-based policy which takes a pair of inputs: the current state and attributes of an (intermediate) goal state , and outputs a distribution over actions. (iii) a transition table that records the empirical probability that can succeed at transiting successfully from to in a small number of steps.
In this work, in addition to allowing binary attributes, we consider attributes with more algebraic structure. In addition, we augment the transition table with a parametric edge detector that takes into account the structure of the attributes. We refer the reader to the references in (zhang2018composable, ) for a more complete review of the literature this work is built upon; but will briefly highlight a a few especially relevant works. Because we add further structure to the attributes, this work moves the unstructured attribute planner closer to (Hernandez-GardiolK03, ; Otterlo_Relational_RL, ; Diuk_object, ; AbelHBBOMT15, ), which discuss MDPs that can be written in terms of objects and relations between those objects. However, this current work still focuses on the interface between the symbolic description of the underlying Markovian space and the actual space; and the symbolic description in terms of attributes with algebraic structure is an approximation.
4 The Structured Attribute Model
In this section we describe our Structured Attribute Model, depicted in Figure 1. It contains several modules that interact with each other. We describe each of these modules in detail and their interactions.
We consider attribute spaces built as direct products of group building blocks. In this work, we consider natural arithmetic attributes , modular arithmetic , and real-valued attributes . We note however that our methodology can be easily extended to more exotic algebraic and geometric structures, such as modular real-valued attributes , rigid motions or dihedral groups.
For the purposes of this work we need a notion of differencevector in the attribute space. We denote by the set of admissible transitions in attribute space, and where the group operation is taken coordinate-wise.
4.1 Edge Detector
The core component of our model is a module that, given a pair , evaluates when a given transition is feasible in the environment using at most actions, that is, whether
This network receives as input using the corresponding group attribute parametrisations, and , and outputs , the estimated probability that this transition is feasible. This detector is trained in a supervised fashion by receiving both positive and negative samples. The positive samples are fed by the exploration policy , whereas the negative samples are produced by the execution policy .
Our model contains two policies, detailed next: the exploration policy and the execution policy. In each case, we record the empirical counts on which attributes they have visited. Since the size of the attribute space grows exponentially with respect to the number of attributes, we consider only the marginal counts. For each attribute dimension , we consider empirical marginal counts , over . In case some attributes are continuous, we quantize them using a predefined number of bins in order to produce the empirical counts. Finally, in order to keep track of the admissible transitions in the full attribute space (without marginalization), we consider a memory buffer that contains every observed transition , this time without marginalization.
4.3 Exploration Policy
The estimation of the transition kernel starts with an exploration policy that scans for transitions in attribute space
. This policy is parametrised by a neural network that takes as input the pairand outputs a distribution over . Its rewards are determined from the Edge Detector and from the empirical marginal counts as follows. If is the current estimated probability that the transition is feasible, then gets rewarded when he finds an actual transition with low estimated probability using at most steps. Additionally, we reward the exploration for uncovering unseen attribute values: with . Here, the function is inversely proportional to the times the exploration has previously seen this transition, measured according to each coordinate: The weighting parameters are adjustable to each environment and are reported in the experimental section. The episodes start where the last episode ends, and the game restarts when the agent is unable to perform more transitions. A game is always played only by one policy, either the exploration or the execution one, and the training phase consists in alternating games of both.
4.4 Execution Policy
The execution policy takes as input the current state-attribute pair as well as a target transition (which may or may not be in ) provided by the Transition Proposal module, and outputs a distribution over
. This policy is trained with reinforcement learning, via a positive rewardwhenever it reaches a state such that . If after a certain number of steps it has failed to reach the desired attribute transition, it sends the sample back to the Edge Detector with a negative label.
4.5 Transition Proposals
Finally, we describe the module that proposes which transitions the execution policy should be trained on. We want to accomplish two objectives: (i) enforce that the coverage of the execution policy matches that of the exploration , and (ii) provide the Edge Detector with negative samples, i.e. transitions that are not admissible in the environment. We propose to sample the target transitions from the current buffer of recorded transitions as follows. First, we filter out the transitions that are considered unlikely to exist according to the current edge detector using a threshold (we pick in all experiments); call the remaining transitions. Then we consider a mixture that samples uniformly at random within with probability , and according to the following distribution with probability : where In words, we look at the differences in marginal counts between and across the recorded transitions in the buffer, and sample more often those where exploration outpaces execution.
Our inference strategy at test time is analogous to the unstructured attribute work (zhang2018composable, ), except that our estimated probabilities to realize each transition are given by the Edge Detector. Specifically, we look for the path from the start to our goal that minimizes the distance defined by the Edge Detector as: . To do this we use Dijkstra’s algorithm on the graph that starts at the point where the agent is and extends to other points in the attribute space by applying the transitions in the buffer, giving each edge the cost .
We report preliminary experiments on Grid-World games using the Mazebase environments (sukhbaatar2015mazebase, ). It consists of 2-D maps that vary dynamically at each episode, of size between and in each dimension, with a single agent that interacts with the environment. In all scenarios, we train our structured model without access to the test-time tasks during a prespecified number of episodes. After training, given a target task, we perform planning using Djikstra as explained in Section 4.6. For simplicity, we assume the state-to-attribute mapping is known in our reported experiments. We consider two baselines: (i) A reinforcement learning agent parametrised with the same neural network architecture as our execution policy, taking the state and goal attribute as inputs, trained using a curriculum that starts from nearby tasks and extends them to the evaluation goals, and (ii) the Unstructred Attribute Planner from (zhang2018composable, ). Here, we treat attributes as a set. When the environment contains continuous attributes, we round them to the nearest integer and use the resulting discrete space.
|Modular Switches||Exchangeable Attr||Constrained Attr|
|Unstructured Attribute Model||9.6%||88%||20.8%|
|Structured Attribute Model||89.3%||93%||81.6%|
5.1 Modular Switches
The first environment consists in 2-d mazes containing a variety of different objects. Depending on the state of a switch, the agent is allowed to pick objects of a specific kind, as illustrated in Figure 2. For each object type, we consider two attributes: how many objects are still available in the map, and how many objects the agent already collected. The attribute space is thus modeled with where corresponds to the number of different objects. We consider in our experiment. This environment is highly structured, and the only admissible transitions are of the form with , , and . The evaluation tasks consist in requesting a specific number of items of each caterogy . The RL version is trained with a curriculum that grows the distance (induced by the transitions in the attribute space) between the start and the end of the task, from 1 to 15. In this environment, Table 1 shows that neither the curriculum RL agent nor the Unstructured Attribute Planner model are able to successfully complete the target goals. The number of transitions is large relative to the number of steps. RL trained in 50M steps, and our model as well as the Unstructured Attribute is trained in 25M steps for exploration and 25M steps to train the policy. Figure 3 displays the positive (resp. negative) transitions discovered by (resp. ) as training progresses. Our model is able to quickly generalize the transition kernel of the environment to unseen regions, by leveraging its rich arithmetic structure.
5.2 Exchangeable Attributes
This environment contains objects of several types, and for each type we consider two attributes: how many objects are still in the map, and how many objects the agent has already collected. At any time, the agent has the possibility to trade objects using pre-specified exchange rates, as shown in Figure 2. The attributes of this environment can be modeled as but this time the admissible transitions create interactions between attributes. The exchange rates determine transitions of the form , for . Inspired by real markets, we set for all pairs. The evaluation tasks consist in obtaining a predefined number of items of each type. We consider as before the case . The RL with curriculum is trained in 40M steps. Our model trained in 20M steps for exploration and 20M steps to train the policy. Similarly as before, the RL curriculum is implemented by growing the distance (induced by the transitions in the attribute space) between the start and the end of the task, from 1 to 30. In this case, Table 1 shows that, while our structured attribute model still outperforms the two baselines, the difference is less dramatic than in other environments. We attribute this to the fact that the transition kernel is more homogeneous and faster mixing than before, and therefore although the exploration in the unstructured attributes misses many transitions, the planning phase manages to cover the attribute space more efficiently than in the modular switch environment.
5.3 Constrained Attributes
Finally, we consider an environment with a continuous attribute component. Here, an agent is deployed in an terrain collecting minerals with a single-use hammer. Each time the hammer is used, the agent receives a random amount of mineral coming from a distribution over . The agent can go to the ‘hardware store’ and obtain a new hammer in order to keep mining. He can also go to the dump yard and throw away a fixed amount of mineral. In that case, the attribute space is modeled as The admissible transitions are of the form or , where , or These transitions are only admissible as long as , which is considered fixed. Figure 4 illustrates the setup with the constrain set . Since now the environment is stochastic, we can’t define the same distance to a target goal in order to build the RL curriculum. Consequently, the first stage of the curriculum are goals with high probability of being at just one transition away from the start, and next stages are defined by the euclidean distance in the attribute space between the start and the end of the task. Similarly as with other enviroments, at test-time, the goals are to reach a certain point in attribute space. Since being close to the boundary reduces the probability of executing a transition, planning in this environment essentially consists in traversing the attribute map trying to stay away from the boundary of . Table 1 shows that our structured attribute model is able to leverage the geometric structure of the environment to significantly outperform both RL and unstructured attribute baselines.
This work is a first step towards mitigating the scalability issues of the unstructured attribute model from (zhang2018composable, ). However, there are still important limitations that we are planning to address in current/future work. First, we are currently extending the model to operate on larger, multiagent environments such as StarCraft, as well as 3d environments. Next, currently each environment is given the correct arithmetic/geometric blocks, which is an unrealistic assumption in many real-life scenarios. The objective is to provide the agent with a large ‘dictionary’ of such blocks (e.g. numerals, modulars, discrete symmetry groups, etc.), and learn how to select the appropriate structure for each attribute.
- (1) David Abel, D. Ellis Hershkowitz, Gabriel Barth-Maron, Stephen Brawner, Kevin O’Farrell, James MacGlashan, and Stefanie Tellex. Goal-based action priors. In ICAPS, pages 306–314. AAAI Press, 2015.
- (2) Carlos Diuk, Andre Cohen, and Michael L. Littman. An object-oriented representation for efficient reinforcement learning. In ICML, volume 307 of ACM International Conference Proceeding Series, pages 240–247. ACM, 2008.
- (3) Natalia Hernandez-Gardiol and Leslie Pack Kaelbling. Envelope-based planning in relational mdps. In NIPS, pages 783–790. MIT Press, 2003.
- (4) Sainbayar Sukhbaatar, Arthur Szlam, Gabriel Synnaeve, Soumith Chintala, and Rob Fergus. Mazebase: A sandbox for learning from games. arXiv preprint arXiv:1511.07401, 2015.
- (5) M. van Otterlo. A Survey of Reinforcement Learning in Relational Domains. Number 31 in TR-CTIT-05, ISSN 1381-3625. Centre for Telematics and Information Technology University of Twente, 2005.
- (6) Amy Zhang, Adam Lerer, Sainbayar Sukhbaatar, Rob Fergus, and Arthur Szlam. Composable planning with attributes. arXiv preprint arXiv:1803.00512, 2018.
Appendix A Further Experimental Setup
a.1 State-Attribute Regressor and Parametrization
In the case where is not given by the user, on can train an estimator from labeled pairs , with a neural network trained with a mean-squared loss that reflects the geometry of each target attribute coordinate. If , , and is the output of the neural net regressor, we consider the following metric on each :
If , then , and .
If , then , and .
If , then , , and .
The loss aggregated through all attribute coordinates becomes .
a.2 Modular Switches
For our model and all baselines we have used a two fully-connected layers net with 128 hidden units per layer. The batch size was of 5000 steps for the policies. We have trained the exploration policy with and .
a.3 Exchangeable Attributes
We have used a two fully-connected layers net with 128 hidden units per layer. The batch size was of 5000 steps for the policies. We have trained the exploration policy with and . In this experiment we have made the rewards of the exploration continuous in time, in the sense that on each step we give reward not just for the transition that finishes the episode, but also for all the transitions that come after that episode until the end of the game. This way we are encouraging the explorer to look for trajectories that lead to late unseen attributes. This didn’t work on the other experiments, because it stimulates the policy to do as much transitions as possible, and it got stuck in places where one could execute a number of transitions in a row (mostly the switch, but also the hammer store, the dump, etc).
a.4 Constrained Attributes
We have used a two fully-connected layers net with 128 hidden units per layer. The batch size was of 5000 steps for the policies. We have trained the exploration policy with and .
In this game the transitions are only admissible as long as . In our experiments we have defined this support as . This set is illustrated in Figure 4 as the blue zone. In the figure, the black strips show the area where the agent starts the game.