Learning sparse relational transition models

by   Victoria Xia, et al.

We present a representation for describing transition models in complex uncertain domains using relational rules. For any action, a rule selects a set of relevant objects and computes a distribution over properties of just those objects in the resulting state given their properties in the previous state. An iterative greedy algorithm is used to construct a set of deictic references that determine which objects are relevant in any given state. Feed-forward neural networks are used to learn the transition distribution on the relevant objects' properties. This strategy is demonstrated to be both more versatile and more sample efficient than learning a monolithic transition model in a simulated domain in which a robot pushes stacks of objects on a cluttered table.



There are no comments yet.


page 4

page 8


Stacked Structure Learning for Lifted Relational Neural Networks

Lifted Relational Neural Networks (LRNNs) describe relational domains us...

Learning Neuro-Symbolic Relational Transition Models for Bilevel Planning

Despite recent, independent progress in model-based reinforcement learni...

SARN: Relational Reasoning through Sequential Attention

This paper proposes an attention module augmented relational network cal...

Learning Local Forward Models on Unforgiving Games

This paper examines learning approaches for forward models based on loca...

RelNN: A Deep Neural Model for Relational Learning

Statistical relational AI (StarAI) aims at reasoning and learning in noi...

Towards a Theory of Intentions for Human-Robot Collaboration

The architecture described in this paper encodes a theory of intentions ...

Intrinsically Motivated Multimodal Structure Learning

We present a long-term intrinsically motivated structure learning method...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many complex domains are appropriately described in terms of sets of objects, properties of those objects, and relations among them. We are interested in the problem of taking actions to change the state of such complex systems, in order to achieve some objective. To do this, we require a transition model, which describes the system state that results from taking a particular action, given the previous system state. In many important domains, ranging from interacting with physical objects to managing the operations of an airline, actions have localized effects: they may change the state of the object(s) being directly operated on, as well as some objects that are related to those objects in important ways, but will generally not affect the vast majority of other objects.

In this paper, we present a strategy for learning state-transition models that embodies these assumptions. We structure our model in terms of rules, each of which only depends on and affects the properties and relations among a small number of objects in the domain, and only very few of which may apply for characterizing the effects of any given action. Our primary focus is on learning the kernel of a rule: that is, the set of objects that it depends on and affects. At a moderate level of abstraction, most actions taken by an intentional system are inherently directly parametrized by at least one object that is being operated on: a robot pushes a block, an airport management system reschedules a flight, an automated assistant commits to a venue for a meeting. It is clear that properties of these “direct” objects are likely to be relevant to predicting the action’s effects and that some properties of these objects will be changed. But how can we characterize which other objects, out of all the objects in a household or airline network, are relevant for prediction or likely to be affected?

To do so, we make use of the notion of a deictic reference. In linguistics, a deictic (literally meaning “pointing”) reference, is a way of naming an object in terms of its relationship to the current situation rather than in global terms. So, “the object I am pushing,” “all the objects on the table nearest me,” and “the object on top of the object I am pushing” are all deictic references. This style of reference was introduced as a representation strategy for AI systems by Agre & Chapman (1987), under the name indexical-functional representations, for the purpose of compactly describing policies for a video-game agent, and has been in occasional use since then.

We will learn a set of deictic references, for each rule, that characterize, relative to the object(s) being operated on, which other objects are relevant. Given this set of relevant objects, the problem of describing the transition model on a large, variable-size domain, reduces to describing a transition model on fixed-length vectors characterizing the relevant objects and their properties and relations, which we represent and learn using standard feed-forward neural networks.

In the following sections, we briefly survey related work, describe the problem more formally, and then provide an algorithm for learning both the structure, in terms of deictic references, and parameters, in terms of neural networks, of a sparse relational transition model. We go on to demonstrate this algorithm in a simulated robot-manipulation domain in which the robot pushes objects on a cluttered table.

2 Related work

Rule learning has a long history in artificial intelligence. The novelty in our approach is the combination of learning discrete structures with flexible parametrized models in the form of neural networks.

Rule learning We are inspired by very early work on rule learning by Drescher (1991), which sought to find predictive rules in simple noisy domains, using Boolean combinations of binary input features to predict the effects of actions. This approach has a modern re-interpretation in the form of schema networks (Kansky et al., 2017). The rules we learn are lifted, in the sense that they can be applied to objects, generally, and are not tied to specific bits or objects in the input representation and probabilistic, in the sense that they make a distributional prediction about the outcome. In these senses, this work is similar to that of Pasula et al. (2007) and methods that build on it ((Mourão et al., 2012), (Mourão, 2014), (Lang & Toussaint, 2010).) In addition, the approach of learning to use deictic expressions was inspired by Pasula et al. and used also by Benson (1997). Our representation and learning algorithm improves on the Pasula et al. strategy by using the power of feed-forward neural networks as a local transition model, which allows us to address domains with real-valued properties and much more complex dependencies. In addition, our EM-based learning algorithm presents a much smoother space in which to optimize, making the overal learning faster and more robust. We do not, however, construct new functional terms during learning; that would be an avenue for future work for us.

Graph network models There has recently been a great deal of work on learning graph-structured (neural) network models (Battaglia et al. (2018) provide a good survey). There is a way in which our rule-based structure could be interpreted as a kind of graph network, although it is fairly non-standard. We can understand each object as being a node in the network, and the deictic functions as being labeled directed hyper-edges (between sets of objects). Unlike the graph network models we are aware of, we do not condition on a fixed set of neighboring nodes and edges to compute the next value of a note; in fact, a focus of our learning method is to determine which neighbors (and neighbors of neighbors, etc.) to condition on, depending on the current state of the edge labels. This means that the relevant neighborhood structure of any node changes dynamically over time, as the state of the system changes. This style of graph network is not inherently better or worse than others: it makes a different set of assumptions (including a strong default that most objects do not change state on any given step and the dynamic nature of the neighborhoods) which are particularly appropriate for modeling an agent’s interactions with a complex environment using actions that have relatively local effects.

3 Problem formulation

We assume we are working on a class of problems in which the domain is appropriately described in terms of objects. This method might not be appropriate for a single high-dimensional system in which the transition model is not sparse or factorable, or can be factored along different lines (such as a spatial decomposition) rather than along the lines of objects and properties. We also assume a set of primitive actions

defined in terms of control programs that can be executed to make actual changes in the world state and then return. These might be robot motor skills (grasping or pushing an object) or virtual actions (placing an order or announcing a meeting). In this section, we formalize this class of problems, define a new rule structure for specifying probabilistic transition models for these problems, and articulate an objective function for estimating these models from data.

3.1 Relational domain

A problem domain is given by tuple where is a countably infinite universe of possible objects, is a finite set of properties , and is a finite set of deictic reference functions where denotes the powerset of . Each function maps from an ordered list of objects to a set of objects, and we define it as

where the relation is defined in terms of the object properties in . For example, if we have a location property and , we can define so that the function associated with maps from one object to the set of objects that are within distance of its center; here is an indicator function. Finally, is a set of action templates , where is the space of executable control programs. Each action template is a function parameterized by continuous parameters and a tuple of objects that the action operates on. In this work, we assume that and are given.111There is a direct extension of this formulation in which we encode relations among the objects as well. Doing so complicates notation and adds no new conceptual ideas, and in our example domain it suffices to compute spatial relations from object properties so there is no need to store relational information explicitly, so we omit it from our treatment.

A problem instance is characterized by , where is a domain defined above and is a finite universe of objects with . For simplicity, we assume that, for a particular instance, the universe of objects remains constant over time. In the problem instance , we characterize a state in terms of the concrete values of all properties in on all objects in ; that is, . A problem instance induces the definition of its action space , constructed by applying every action template to all tuples of elements in and all assignments to the continuous parameters; namely, .

3.2 Sparse relational transition models

In many domains, there is substantial uncertainty, and the key to robust behavior is the ability to model this uncertainty and make plans that respect it. A sparse relational transition model (spare) for a domain , when applied to a problem instance

for that domain, defines a probability density function on the resulting state

resulting from taking action in state . Our objective is to specify this function in terms of domain elements , , and in such a way that it will apply to any problem instance, independent of the number and properties of the objects in its universe. We achieve this by defining the transition model in terms of a set of transition rules, and a score function . The score function takes in as input a state and a rule , and outputs a non-negative integer. If the output is , the rule does not apply; otherwise, the rule can predict the distribution of the next state to be . The final prediction of spare is


where and the matrix is the default predicted covariance for any state that is not predicted to change, so that our problem is well-formed in the presence of noise in the input. Here

is an identity matrix of size

, and represents a square diagonal matrix with

on the main diagonal, denoting the default variance for property

if no rule applies. In the rest of this section, we formalize the definition of transition rules and the score function.

Transition rule is characterized by an action template , two ordered lists of deictic references and of size and , a predictor and the default variances for each property under this rule. The action template is defined as operating on a tuple of object variables, which we will refer to as . A reference list uses functions to designate a list of additional objects or sets of objects, by making deictic references based on previously designated objects. In particular, generates a list of objects whose properties affect the prediction made by the transition rule, while generates a list of objects whose properties are affected after taking an action specified by the action template .

We begin with the simple case in which every function returns a single object, then extend our definition to the case of sets. Concretely, for the -th element in (), where is a deictic reference function in the domain, is the arity of that function, and integer specifies that object in the object list can be determined by applying function to objects . Thus, we get a new list of objects, . So, reference can only refer to the objects that are named in the action, and determines an object . Then, reference can refer to objects named in the action or those that were determined by reference , and so on.

When the function in returns a set of objects rather than a single object, this process of adding more objects remains almost the same, except that the may denote sets of objects, and the functions that are applied to them must be able to operate on sets. In the case that a function returns a set, it must also specify an aggregator, , that can return a single value for each property , aggregated over the set. Examples of aggregators include the mean or maximum values or possibly simply the cardinality of the set.

Figure 1: A robot gripper is pushing a stack of 4 blocks on a table.

For example, consider the case of pushing the bottom (block ) of a stack of 4 blocks, depicted in Figure 1. Suppose the deictic reference is above, which takes one object and returns a set of objects immediately on top of the input object. Then, by applying above starting from the initial set , we get an ordered list of sets of objects where .

Figure 2: Instead of directly mapping from current state to next state , our prediction model uses deictic references to find subsets of objects for prediction. In the left most graph, we illustrate what relations are used to construct the input objects with two rules for the same action template, and , where the reference list applied a deictic reference to the target object and added input features computed by an aggregator on to the inputs of the predictor of rule . Similarly for , the first deictic reference selected and then is applied on to get . The predictors and are neural networks that map the fixed-length input to a fixed-length output, which is applied to a set of objects computed from a relational graph on all the objects, derived from the reference list and , to compute the whole next state . Because and the is only predicting a single property, we use a “de-aggregator” function to assign its prediction to both objects .

Returning to the definition of a transition rule, we now can see informally that if the parameters of action template are instantiated to actual objects in a problem instance, then and can be used to determine lists of input and output objects (or sets of objects). We can use these lists, finally, to construct input and output vectors. The input vector consists of the continuous action parameters of action and property for all properties and objects that are selected by in arbitrary but fixed order. In the case that is a set of size greater than one, the aggregator associated with the function that computed the reference is used to compute . Similar for the desired output construction, we use the references in the list , initialize , and gradually add more objects to construct the output set of objects . The output vector is where if is a set of objects, we apply a mean aggregator on the properties of all the objects in .

The predictor is some functional form (such as a feed-forward neural network) with parameters (weights) that will take values as input and predict a distribution for the output vector

. It is difficult to represent arbitrarily complex distributions over output values. In this work, we restrict ourselves to representing a Gaussian distributions on all property values in

, encoded with a mean and independent variance for each dimension.

Now, we describe how a transition rule can be used to map a state and action into a distribution over the new state. A transition rule applies to a particular state-action pair if is an instance of and if none of the elements of the input or output object lists is empty. To construct the input (and output) list, we begin by assigning the actual objects to the object variables in action instance , and then successively computing references based on the previously selected objects, applying the definition of the deictic reference in each to the actual values of the properties as specified in the state . If, at any point, a or returns an empty set, then the transition rule does not apply. If the rule does apply, and successfully selects input and output object lists, then the values of the input vector can be extracted from , and predictions are made on the mean and variance values .

Let be the vector entry corresponding to the predicted Gaussian parameters of property of -th output object set and denote as the property of object in state , for all . The predicted distribution of the resulting state is computed as follows:

where is the default variance of property in rule . There are two important points to note. First, it is possible for the same object to appear in the object-list more than once, and therefore for more than one predicted distribution to appear for its properties in the output vector. In this case, we use the mixture of all the predicted distributions with uniform weights. Second, when an element of the output object list is a set, then we treat this as predicting the same single property distribution for all elements of that set. This strategy has sufficed for our current set of examples, but an alternative choice would be to make the predicted values be changes to the current property value, rather than new absolute values. Then, for example, moving all of the objects on top of a tray could easily specify a change to each of their poses. We illustrate how we can use transition rules to build a spare in Fig. 2.

For each transition rule and state , we assign the score function value to be if does not apply to state . Otherwise, we assign the total number of deictic references plus one, , as the score. The more references there are in a rule that is applicable to the state, the more detailed the match is between the rules conditions and the state, and the more specific the predictions we expect it to be able to make.

3.3 Learning spares from data

We frame the problem of learning a transition model from data in terms of conditional likelihood. The learning problem is, given a problem domain description and a set of experience tuples, , find a spare

that minimizes the loss function:


Note that we require all of the tuples in to belong to the same domain , and require for any that and belong to the same problem instance, but individual tuples may be drawn from different problem instances (with, for example, different numbers and types of objects). In fact, to get good generalization performance, it will be important to vary these aspects across training instances.

4 Learning algorithm

We describe our learning algorithm in three parts. First, we introduce our strategy for learning , which predicts a Gaussian distribution on , given . Then, we describe our algorithm for learning reference lists and for a single transition rule, which enable the extraction of and from . Finally, we present an EM method for learning multiple rules.

4.1 Distributional prediction

For a particular transition rule with associated action template , once and have been specified, we can extract input and output features and from a given set of experience samples . From and , we would like to learn the transition rule’s predictor to minimize Eq. (2). Our predictor takes the form . We use the method of Chua et al. (2017), where two neural-network models are learned as the function approximators, one for the mean prediction parameterized by and the other for the diagonal variance prediction , parameterzied by . We optimize the negative data-likelihood loss function

by alternatively optimizing and . That is, we alternate between first optimizing with fixed covariance, and then optimizing with fixed mean.

Let be the set of experience tuples to which rule applies. Then once we have , we can optimize the default variance of the rule by optimizing It can be shown that these loss-minimizing values for the default predicted variances are the empirical averages of the squared deviations for all unpredicted objects (i.e., those for which does not explicitly make predictions), where averages are computed separately for each object property.

We use to refer to this learning and optimization procedure for the predictor parameters and default variance.

1:procedure GreedySelect()
2:     train model using , save loss
4:     while  do
5:          None ;
6:         for all  do
10:              if  then ;                        
11:         if  then ;
12:         else  break               
Algorithm 1 Greedy procedure for constructing .

4.2 Single rule

In the simple setting where only one transition rule exists in our domain , we show how to construct the input and output reference lists and that will determine the vectors and . Suppose for now that and are fixed, and we wish to learn . Our approach is to incrementally build up by adding tuples one at a time via a greedy selection procedure. Specifically, let be the universe of possible , split the experience samples into a training set and a validation set , and initialize the list to be . For each , compute , where in Eq. (2) evaluates a spare with a single transition rule , where and are computed using the LearnDist described in Section 4.1222When the rule does not apply to a training sample, we use for its loss the loss that results from having empty reference lists in the rule. Alternatively, we can compute the default variance to be the empirical variances on all training samples that cannot use rule . . If the value of the loss function is less than the value of , then we let and continue. Else, we terminate the greedy selection process with , since further growing the list of deictic references hurts the loss. We also terminate the process when exceeds some predetermined maximum allowed number of input deictic references, . Pseudocode for this algorithm is provided in Algorithm 1.

In our experiments we set and construct the lists of deictic references using a single pass of the greedy algorithm described above. This simplification is reasonable, as the set of objects that are relevant to predicting the transition outcome often overlap substantially with the objects that are affected by the action. Alternatively, we could learn via an analogous greedy procedure nested around or, as a more efficient approach, interleaved with, the one for learning .

4.3 Multiple rules

Our training data in robotic manipulation tasks are likely to be best described by many rules instead of a single one, since different combinations of relations among objects could be present in different states. For example, we may have one rule for pushing a single object and another rule for pushing a stack of objects. We now address the case where we wish to learn rules from a single experience set , for . We do so via initial clustering to separate experience samples into clusters, one for each rule to be learned, followed by an EM-like approach to further separate samples and simultaneously learn rule parameters.

To facilitate the learning of our model, we will additionally learn membership probabilities , where represents the probability that the -th experience sample is assigned to transition rule , and for all . We initialize membership probabilities via clustering, then refine them through EM.

Because the experience samples may come from different problem instances and involve different numbers of objects, we cannot directly run a clustering algorithm such as -means on the samples themselves. Instead we first learn a single transition rule from using the algorithm in Section 4.2, use the resulting and to transform into and , and then run -means clustering on the concatenation of , , and values of the loss function when is used to predict each of the samples. For each experience sample, the squared distance from the sample to each of the cluster centers is computed, and membership probabilities for the sample to each of the transition rules to be learned are initialized to be proportional to the (multiplicative) inverses of these squared distances.

Before introducing the EM-like algorithm that simultaneously improves the assignment of experience samples to transition rules and learns details of the rules themselves, we make a minor modification to transition rules to obtain mixture rules. Whereas a probabilistic transition rule has been defined as , a mixture rule is , where represents a distribution over all possible lists of input references (and similarly for and ), of which there are a finite number, since the set of available reference functions is finite, and there is an upper bound on the maximum number of references may contain. For simplicity of terminology, we refer to each possible list of references as a shell, so is a distribution over possible shells. Finally, is a collection of transition rules (i.e., predictors , each with an associated , , and ). To make predictions for a sample using a mixture rule, predictions from each of the mixture rule’s transition rules are combined according to the probabilities that and assign to each transition rule’s and . Rather than having our EM approach learn transition rules, we instead learn mixture rules, as the distributions and allow for smoother sorting of experience samples into clusters corresponding to the different rules, in contrast to the discrete and of regular transition rules.

As before, we focus on the case where for each mixture rule, , , and as well. Our EM-like algorithm is then as follows:

  1. For each , initialize distributions for mixture rule as follows. First, use the algorithm in Section 4.2 to learn a transition rule on the weighted experience samples with weights equal to the membership probabilities . In the process of greedily assembling reference lists , data likelihood loss function values are computed for multiple explored shells, in addition to the shell that was ultimately selected. Initialize to distribute weight proportionally, according to data likelihood, for these explored shells: , where is the spare model with a single transition rule , and , with the summation taken over all explored shells , is a normalization factor so that the total weight assigned by to explored shells is . The remaining probability weight is distributed uniformly across unexplored shells.

  2. For each , let , where we have dropped subscripting according to for notational simplicity:

    1. For , train predictor using the procedure in Section 4.2 on the weighted experience samples , where we choose to be the list of references with -th highest weight according to .

    2. Update by redistributing weight among the top shells according to a voting procedure where each training sample “votes” for the shell whose predictor minimizes the validation loss for that sample. In other words, the -th experience sample votes for mixture rule for . Then, shell weights are assigned to be proportional to the sum of the sample weights (i.e., membership probability of belonging to this rule) of samples that voted for each particular shell: the number of votes received by the -th shell is , for indicator function and . Then, , the current -th highest value of , is updated to become , where is a normalization factor to ensure that

      remains a valid probability distribution. (Specifically,


    3. Repeat Step 2a, in case the shells with highest values have changed, in preparation for using the mixture rule to make predictions in the next step.

  3. Update membership probabilities by scaling by data likelihoods from using each of the rules to make predictions: , where is the data likelihood from using mixture rule to make predictions for the -th experience sample , and is a normalization factor to maintain .

  4. Repeat Steps 2 and 3 some fixed number of times, or until convergence.

5 Experiments

Figure 3: Representative problem instances sampled from the domain.

We apply our approach, spare, to a challenging problem of predicting pushing stacks of blocks on a cluttered table top. In this section, we describe our domain, the baseline that we compare to and report our results.

5.1 Object manipulation domain

In our domain , the object universe is composed of blocks of different sizes and weight, the property set includes shapes of the blocks (width, length, height) and the position of the block ( location relative to the table). We have one action template, push, which pushes toward a target object with parameters , where is the 3D position of the gripper before the push starts and is the distance of the push. The orientation of the gripper and the direction of the push are computed from the gripper location and the target object location. We simulate this 3D domain using the physically realistic PyBullet (Coumans & Bai, 2016–2018) simulator. In real-world scenarios, an action cannot be executed with the exact action parameters due to the inaccuracy in the motor and hence in our simulation, we add Gaussian noise on the action parameters during execution to imitate this effect.

We consider the following deictic references in the reference collection : (1) above (O), which takes in an object and returns the object immediately above ; (2) above* (O), which takes in an object and returns all of the objects that are above ; (3) below (O), which takes in an object and returns the object immediately below ; (4) nearest (O), which takes in an object and returns the object that is closest to .

5.2 Baseline methods

The baseline method we compare to is a neural network function approximator that takes in as input the current state and action parameter , and outputs the next state . The list of objects that appear in each state is ordered: the target objects appear first and the remaining objects are sorted by their poses (first sort by coordinate, then , then ).

5.3 Results

Figure 4:

(a) In a simple 3-block pushing problem instance, data likelihood and learned default standard deviation both improve as more deictic references are added. (b) Comparing performance as a function of number of distractors with a fixed amount of training data. (c) Comparing sample efficiency of

spare to the baseline. Shaded regions represent confidence interval.

Effects of deictic rules

As a sanity check, we start from a simple problem where a gripper is pushing a stack of three blocks on an empty table. We randomly sampled 1250 problem instances by drawing random block shapes and locations from a uniform distribution within a range while satisfying the condition that the stack of blocks is stable. In each problem instance, we uniformly randomly sample the action parameters and obtain the training data, a collection of tuples of state, action and next state, where the target object of the push action is always the one at the bottom of the stack. We held out

of the training data as the validation set. We found that our approach is able to reliably select the correct combinations of the references that select all the blocks in the problem instance to construct inputs and outputs. In Fig. 4(a), we show how the performance varies as deictic references are added during a typical run of this experiment. The solid blue curves show training performance, as measured by data likelihood on the validation set, while the dashed blue curve shows performance on a held-out test set with 250 unseen problem instances. As expected, performance improves noticeably from the addition of the first two deictic references selected by the greedy selection procedure, but not from the fourth as we only have 3 blocks and the fourth selected object is extraneous. The red curve shows the learned default standard deviations, used to compute data likelihoods for features of objects not explicitly predicted by the rule. As expected, the learned default standard deviation drops as deictic references are added, until it levels off after the third reference is added since at that point the set of references captures all moving objects in the scene.

Sensitivity analysis on the number of objects

We compare our approach to the baseline in terms of how sensitive the performance is to the number of objects that exist in the problem instance. We continue the simple example where a stack of three blocks lie on a table, with extra blocks that may affect the prediction of the next state. Figure 4(b) shows the performance, as measured by the log data likelihood, as a function of the number of extra blocks. For each number of extra blocks, we used 1250 training problem instances with (15% as the validation set) and 250 testing problem instances. When there are no extra blocks, spare learns a single rule whose and contain the same information as the inputs and outputs for the baseline method. As more objects are added to the table, baseline performance drops as the presence of these additional objects appear to complicate the scene and the baseline is forced to consider more objects when making its predictions. However, performance of the spare approach remains unchanged, as deictic references are used to select just the three blocks in the stack as input in all cases, regardless of the number of extra blocks on the table.

Note that in the interest of fairness of comparison, the plotted data likelihoods are averages over only the three blocks in the stack. This is because the spare approach assumes that objects for which no explicit predictions are made do not move, which happens always to be true in these examples, thus providing spare an unfair advantage if the data likelihood is averaged over all blocks in the scene. We show the complete results in the appendix. Note also that, performance aside, the baseline is limited to problems for which the number of objects in the scenes is fixed, as the it requires a fixed-size input vector containing information about all objects. Our spare approach does not have this limitation, and could have been trained on a single, large dataset that is the combination of the datasets with varying numbers of extra objects, benefiting from a larger training set. However, we did not do this in our experiments for the sake of providing a more fair comparison against the baseline.

Sample efficiency

We evaluate our approach on more challenging problem instances where the robot gripper is pushing blocks on a cluttered table top and there are additional blocks on the table that do not interfere or get affected by the pushing action. Fig. 4(c) plots the data likelihood as a function of the number of training samples. We evaluate with training samples varying from to and in each setting, the test dataset has samples. Both our approach and the baseline benefit from having more training samples, but our approach is much more sample efficient and achieves good performance within only a thousand training samples, while even training the baseline with training samples does not match what our approach gets with only training samples.

Figure 5: (a) Shell weights per iteration of our EM-like algorithm. (b) Membership probabilities of training samples per iteration.

Learning multiple transition rules

Now we put our approach in a more general setting where multiple transition rules need to be learned for prediction of the next state. Our approach adopts an EM-like procedure to assign each training sample its distribution on the transition rules and learn each transition rule with re-weighted training samples. First, we construct a training dataset and of it is on pushing -block stack. Our EM approach is able to concentrate to the 4-block case as shown in Fig. 5(a).

Fig. 5(b) tracks the assignment of samples to rules over the same five runs of our EM procedure. The three curves correspond to the three stack heights in the original dataset, and each shows the average weight assigned to the “target” rule among samples of that stack height, where the target rule is the one that starts with a high concentration of samples of that particular height. At iteration 0, we see that the rules were initialized such that samples were assigned 70% probability of belonging to specific rules, based on stack height. As the algorithm progresses, the samples separate further, suggesting that the algorithm is able to separate samples into the correct groups.


These results demonstrate the power of combining relational abstraction with neural networks, to learn probabilistic state transition models for an important class of domains from very little training data. In addition, the structural nature of the learned models will allow them to be used in factored search-based planning methods that can take advantage of sparsity of effects to plan efficiently.


  • Agre & Chapman (1987) Philip E Agre and David Chapman. Pengi: An implementation of a theory of activity. In AAAI, volume 87, pp. 286–272, 1987.
  • Battaglia et al. (2018) P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, C. Gulcehre, F. Song, A. Ballard, J. Gilmer, G. Dahl, A. Vaswani, K. Allen, C. Nash, V. Langston, C. Dyer, N. Heess, D. Wierstra, P. Kohli, M. Botvinick, O. Vinyals, Y. Li, and R. Pascanu.

    Relational inductive biases, deep learning, and graph networks.

    ArXiv e-prints, June 2018.
  • Benson (1997) Scott S Benson. Learning action models for reactive autonomous agents. Technical report, Stanford University, Stanford, CA, USA, 1997.
  • Chua et al. (2017) Kurtland Chua, Roberto Calandra, and Sergey Levine. On the importance of uncertainty for control with deep dynamics models. In NIPS Workshop on Acting and Interacting in the Real World, 2017.
  • Coumans & Bai (2016–2018) Erwin Coumans and Yunfei Bai.

    Pybullet, a python module for physics simulation for games, robotics and machine learning.

    http://pybullet.org, 2016–2018.
  • Drescher (1991) Gary L Drescher. Made-up minds: a constructivist approach to artificial intelligence. MIT press, 1991.
  • Kansky et al. (2017) Ken Kansky, Tom Silver, David A. Mély, Mohamed Eldawy, Miguel Lázaro-Gredilla, Xinghua Lou, Nimrod Dorfman, Szymon Sidor, D. Scott Phoenix, and Dileep George. Schema networks: Zero-shot transfer with a generative causal model of intuitive physics. In Proceedings of the 34th International Conference on Machine Learning, pp. 1809–1818, 2017.
  • Lang & Toussaint (2010) Tobias Lang and Marc Toussaint. Planning with noisy probabilistic relational rules. Journal of Artificial Intelligence Research, 39:1–49, 2010.
  • Mourão (2014) Kira Mourão. Learning probabilistic planning operators from noisy observations. In Proceedings of the Workshop of the UK Planning and Scheduling Special Interest Group, 2014.
  • Mourão et al. (2012) Kira Mourão, Luke S. Zettlemoyer, Ronald P. A. Petrick, and Mark Steedman. Learning STRIPS operators from noisy and incomplete observations. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, pp. 614–623, 2012.
  • Pasula et al. (2007) Hanna M Pasula, Luke S Zettlemoyer, and Leslie Pack Kaelbling. Learning symbolic models of stochastic domains. Journal of Artificial Intelligence Research, 29:309–352, 2007.

Appendix A Discussions and future work

In this work, we made an attempt to use deictic references to construct transition rules that can generalize over different problem instances with varying numbers of objects. Our approach can be viewed as constructing “convolutional” layers for a neural network that operates on objects. Similar to the convolutional layer operating on different sizes of images, we use deictic rules to find relations among objects and construct features for learning without a constraint on the scale of our problem or how many objects there are. Our approach is to find structures in existing data and learn the construction of deictic rules, each of which uses a specific strategy to filter out objects that are important to an action.

Many possible improvements and extensions can be made to this work, including experimentation with different template structures, more powerful deictic references, more refined feature selection, and integration with planning.

This work focused on using deictic references to select input and output objects for a predictor. A natural extension is to select not just objects, but also features of objects to obtain more refined templates and even smaller predictors, especially in cases where objects have many different properties, only some of which are relevant to the prediction problem at hand. There is also room for experimentation with different types of aggregators for combining features of objects when deictic references select multiple objects in a set.

Finally, the end purpose of obtaining a transition model for robotic actions is to enable planning to achieve high-level goals. Due to time constraints, this work did not assess the effectiveness of learned template-based models in planning, but this is a promising area of future work as the assumption that features of objects for which a template makes no explicit prediction do not change meshes well with STRIPS-style planners, as they make similar assumptions.

Appendix B More empirical results

We here provide details on our experiments and more results in experiments.

Experimental details

All experiments in this paper used predictors consisting of two neural networks, one for making mean predictions and one for making variance predictions, described in Section 4.1

. Each network of the predictor was implemented as a fully-connected network with two hidden layers of 150 nodes each in Keras, used ReLU activations between layers, and the Adam optimizer with default parameters. Predictors for the templates approach were trained for 100 epochs each, alternating between the mean and variance predictor for 25 epochs at a time, four a total of four times, followed by training the mean predictor for another 25 epochs at the time. The baseline predictor was implemented in exactly the same way, but trained for a total of 300 epochs, rather than 100.

States were parameterized by the pose of each object in the scene, ordered such that the target object of the action always appeared first, and other objects appeared in random order (except for the baseline). Action parameters included the starting pose of the robotic gripper, as well as a “push distance” parameter that controls how long the push action lasts. Actions were implemented to be stochastic by adding some amount of noise to the target location of each push, potentially reflecting inaccuries in robotic control.

Sensitivity analysis on the number of objects

In the main paper, we analyzed the sensitivity of the performance to the number of objects that exist in the problem instance and showed the log data likelihood only on objects that moved (the stack). This is for fair comparison to the baseline. In Figure 6, we show the results of the log data likelihood on all objects in the problem instances. As the number of extra objects increases, the baseline approach perform much worse than spare.

Figure 6: Comparing the data likelihood on all objects.

Initialization of Membership Probabilities

We use the clustering-based approaches for initializing membership probabilities presented in Section 4.3. In this section, we how well our clustering approach performs.

Table 1 shows the sample separation achieved by the discrete clustering approach, where samples are assigned solely to their associated clusters found by -means, on the push dataset for stacks of varying height. Each column corresponds to the one-third of samples which involve stacks of a particular height. Entries in the table show the proportion of samples of that stack height that have been assigned to each of the three clusters, where in generating these data the clusters were ordered so that the predominantly 2-block sample cluster came first, followed by the predominantly 3-block cluster, then the 4-block cluster. Values in parentheses are standard deviations across three runs of the experiment. As seen in the table, separation of samples into clusters is quite good, though 22.7% of 3-block samples were assigned to the predominantly 4-block cluster, and 11.5% of 2-block samples were assigned to the predominantly 3-block cluser.

Num objects in sample
2 3 4
Cluster index 1 0.829 (0.042) 0.022 (0.030) 0.089 (0.126)
2 0.115 (0.081) 0.751 (0.028) 0.038 (0.025)
3 0.056 (0.039) 0.227 (0.012) 0.872 (0.102)
Table 1: Sample separation from discrete clustering approach. Standard deviations are reported in parentheses.

The sample separation evidenced in Table 1 is enough such that templates trained on the three clusters of samples reliably select deictic references that consider the correct number of blocks, i.e., the 2-block cluster reliably learns a template which considers the target object and the object above the target, and similiarly for the other clusters and their respective stack heights. However, initializing samples to belong solely to a single cluster, rather than initializing membership probabilities, is unlikely to be robust in general, so we turn to the proposed clustering-based approaches for initializing membership probabilities instead.

Table 2 is analogous to Table 1 in structure, but shows sample separation results for sample membership probabilities initialized to be proportional to the inverse distance from the sample to each of the cluster centers found by -means. Table 3 is the same, except with membership probabilities initialized to be proportional to the square of the inverse distance to cluster centers.

Num objects in sample
2 3 4
Cluster index 1 0.595 (0.010) 0.144 (0.004) 0.111 (0.003)
2 0.259 (0.005) 0.551 (0.007) 0.286 (0.005)
3 0.146 (0.006) 0.305 (0.003) 0.602 (0.008)
Table 2: Sample separation from clustering-based initialization of membership probabilities, where probabiliites are assigned to be proportional to inverse distance to cluster centers. Standard deviations are reported in parentheses.
Num objects in sample
2 3 4
Cluster index 1 0.730 (0.009) 0.065 (0.025) 0.118 (0.125)
2 0.149 (0.079) 0.665 (0.029) 0.171 (0.056)
3 0.121 (0.076) 0.270 (0.005) 0.716 (0.069)
Table 3: Sample separation from clustering-based initialization of membership probabilities, where probabiliites are assigned to be proportional to squared inverse distance to cluster centers. Standard deviations are reported in parentheses.

Sample separation is better in the case of squared distances than non-squared distances, but it’s unclear whether this result generalizes to other datasets. For our specific problem instance, the log data likelihood feature turns out to be very important for the success of these clustering-based initialization approaches. For example, Table 4 shows results analogous to those in Table 3, where the only difference is that all log data likelihoods were multiplied by five before being passed as input to the -means clustering algorithm. Comparing the two tables, this scaling of data likelihood to become relatively more important as a feature results in better data separation. This suggests that the relative importance between log likelihood and the other input features is a parameter of these clustering approaches that should be tuned.

Num objects in sample
2 3 4
Cluster index 1 0.779 (0.017) 0.020 (0.001) 0.016 (0.001)
2 0.174 (0.014) 0.744 (0.006) 0.118 (0.014)
3 0.047 (0.005) 0.236 (0.005) 0.866 (0.015)
Table 4: Sample separation from clustering-based initialization of membership probabilities, where probabiliites are assigned to be proportional to squared inverse distance to cluster centers, and log data likelihood feature used as part of -means clustering has been multiplied by a factor of five. Standard deviations are reported in parentheses.

Effect of object ordering on baseline performance

The single-predictor baseline used in our experiments receives all objects in the scene as input, but this leaves open the question of in what order these objects should be presented. Because the templates approach has the target object of the action specified, in the interest of fairness this information is also provided to the baseline by having the target object always appear first in the ordering. As there is in general no clear ordering for the remainder of the objects, we could present them in a random order, but perhaps sorting the objects according to position (first by -coordinate, then , then ) could result in better predictions than if objects are completely randomly ordered.333As this suspicion turns out to be true, the baseline experiments order blocks in sorted order by position.

To analyze the effect of object ordering on baseline performance, we run the same experiment where a push action is applied to the bottom-most of a stack of three blocks, and there exist some number of additional objects on the table that do not interfere with the action in any way. Figure 7 shows our results. We test three object orderings: random (“none”), sorted according to object position (“xtheny”), and an ideal ordering where the first three objects in the ordering are exactly the three objects in the stack ordered from bottom up (“stack”). As expected, in all cases, predicted log likelihood drops as more extra blocks are added to the scene, and the random ordering performs worst while the ideal ordering performs best.

Figure 7: Effect of object ordering on baseline performance, on task of pushing a stack of three blocks on a table top, where there are extra blocks on the table that do not interfere with the push.