1 Introduction
Many complex domains are appropriately described in terms of sets of objects, properties of those objects, and relations among them. We are interested in the problem of taking actions to change the state of such complex systems, in order to achieve some objective. To do this, we require a transition model, which describes the system state that results from taking a particular action, given the previous system state. In many important domains, ranging from interacting with physical objects to managing the operations of an airline, actions have localized effects: they may change the state of the object(s) being directly operated on, as well as some objects that are related to those objects in important ways, but will generally not affect the vast majority of other objects.
In this paper, we present a strategy for learning statetransition models that embodies these assumptions. We structure our model in terms of rules, each of which only depends on and affects the properties and relations among a small number of objects in the domain, and only very few of which may apply for characterizing the effects of any given action. Our primary focus is on learning the kernel of a rule: that is, the set of objects that it depends on and affects. At a moderate level of abstraction, most actions taken by an intentional system are inherently directly parametrized by at least one object that is being operated on: a robot pushes a block, an airport management system reschedules a flight, an automated assistant commits to a venue for a meeting. It is clear that properties of these “direct” objects are likely to be relevant to predicting the action’s effects and that some properties of these objects will be changed. But how can we characterize which other objects, out of all the objects in a household or airline network, are relevant for prediction or likely to be affected?
To do so, we make use of the notion of a deictic reference. In linguistics, a deictic (literally meaning “pointing”) reference, is a way of naming an object in terms of its relationship to the current situation rather than in global terms. So, “the object I am pushing,” “all the objects on the table nearest me,” and “the object on top of the object I am pushing” are all deictic references. This style of reference was introduced as a representation strategy for AI systems by Agre & Chapman (1987), under the name indexicalfunctional representations, for the purpose of compactly describing policies for a videogame agent, and has been in occasional use since then.
We will learn a set of deictic references, for each rule, that characterize, relative to the object(s) being operated on, which other objects are relevant. Given this set of relevant objects, the problem of describing the transition model on a large, variablesize domain, reduces to describing a transition model on fixedlength vectors characterizing the relevant objects and their properties and relations, which we represent and learn using standard feedforward neural networks.
In the following sections, we briefly survey related work, describe the problem more formally, and then provide an algorithm for learning both the structure, in terms of deictic references, and parameters, in terms of neural networks, of a sparse relational transition model. We go on to demonstrate this algorithm in a simulated robotmanipulation domain in which the robot pushes objects on a cluttered table.
2 Related work
Rule learning has a long history in artificial intelligence. The novelty in our approach is the combination of learning discrete structures with flexible parametrized models in the form of neural networks.
Rule learning We are inspired by very early work on rule learning by Drescher (1991), which sought to find predictive rules in simple noisy domains, using Boolean combinations of binary input features to predict the effects of actions. This approach has a modern reinterpretation in the form of schema networks (Kansky et al., 2017). The rules we learn are lifted, in the sense that they can be applied to objects, generally, and are not tied to specific bits or objects in the input representation and probabilistic, in the sense that they make a distributional prediction about the outcome. In these senses, this work is similar to that of Pasula et al. (2007) and methods that build on it ((Mourão et al., 2012), (Mourão, 2014), (Lang & Toussaint, 2010).) In addition, the approach of learning to use deictic expressions was inspired by Pasula et al. and used also by Benson (1997). Our representation and learning algorithm improves on the Pasula et al. strategy by using the power of feedforward neural networks as a local transition model, which allows us to address domains with realvalued properties and much more complex dependencies. In addition, our EMbased learning algorithm presents a much smoother space in which to optimize, making the overal learning faster and more robust. We do not, however, construct new functional terms during learning; that would be an avenue for future work for us.
Graph network models There has recently been a great deal of work on learning graphstructured (neural) network models (Battaglia et al. (2018) provide a good survey). There is a way in which our rulebased structure could be interpreted as a kind of graph network, although it is fairly nonstandard. We can understand each object as being a node in the network, and the deictic functions as being labeled directed hyperedges (between sets of objects). Unlike the graph network models we are aware of, we do not condition on a fixed set of neighboring nodes and edges to compute the next value of a note; in fact, a focus of our learning method is to determine which neighbors (and neighbors of neighbors, etc.) to condition on, depending on the current state of the edge labels. This means that the relevant neighborhood structure of any node changes dynamically over time, as the state of the system changes. This style of graph network is not inherently better or worse than others: it makes a different set of assumptions (including a strong default that most objects do not change state on any given step and the dynamic nature of the neighborhoods) which are particularly appropriate for modeling an agent’s interactions with a complex environment using actions that have relatively local effects.
3 Problem formulation
We assume we are working on a class of problems in which the domain is appropriately described in terms of objects. This method might not be appropriate for a single highdimensional system in which the transition model is not sparse or factorable, or can be factored along different lines (such as a spatial decomposition) rather than along the lines of objects and properties. We also assume a set of primitive actions
defined in terms of control programs that can be executed to make actual changes in the world state and then return. These might be robot motor skills (grasping or pushing an object) or virtual actions (placing an order or announcing a meeting). In this section, we formalize this class of problems, define a new rule structure for specifying probabilistic transition models for these problems, and articulate an objective function for estimating these models from data.
3.1 Relational domain
A problem domain is given by tuple where is a countably infinite universe of possible objects, is a finite set of properties , and is a finite set of deictic reference functions where denotes the powerset of . Each function maps from an ordered list of objects to a set of objects, and we define it as
where the relation is defined in terms of the object properties in . For example, if we have a location property and , we can define so that the function associated with maps from one object to the set of objects that are within distance of its center; here is an indicator function. Finally, is a set of action templates , where is the space of executable control programs. Each action template is a function parameterized by continuous parameters and a tuple of objects that the action operates on. In this work, we assume that and are given.^{1}^{1}1There is a direct extension of this formulation in which we encode relations among the objects as well. Doing so complicates notation and adds no new conceptual ideas, and in our example domain it suffices to compute spatial relations from object properties so there is no need to store relational information explicitly, so we omit it from our treatment.
A problem instance is characterized by , where is a domain defined above and is a finite universe of objects with . For simplicity, we assume that, for a particular instance, the universe of objects remains constant over time. In the problem instance , we characterize a state in terms of the concrete values of all properties in on all objects in ; that is, . A problem instance induces the definition of its action space , constructed by applying every action template to all tuples of elements in and all assignments to the continuous parameters; namely, .
3.2 Sparse relational transition models
In many domains, there is substantial uncertainty, and the key to robust behavior is the ability to model this uncertainty and make plans that respect it. A sparse relational transition model (spare) for a domain , when applied to a problem instance
for that domain, defines a probability density function on the resulting state
resulting from taking action in state . Our objective is to specify this function in terms of domain elements , , and in such a way that it will apply to any problem instance, independent of the number and properties of the objects in its universe. We achieve this by defining the transition model in terms of a set of transition rules, and a score function . The score function takes in as input a state and a rule , and outputs a nonnegative integer. If the output is , the rule does not apply; otherwise, the rule can predict the distribution of the next state to be . The final prediction of spare is(1) 
where and the matrix is the default predicted covariance for any state that is not predicted to change, so that our problem is wellformed in the presence of noise in the input. Here
is an identity matrix of size
, and represents a square diagonal matrix withon the main diagonal, denoting the default variance for property
if no rule applies. In the rest of this section, we formalize the definition of transition rules and the score function.Transition rule is characterized by an action template , two ordered lists of deictic references and of size and , a predictor and the default variances for each property under this rule. The action template is defined as operating on a tuple of object variables, which we will refer to as . A reference list uses functions to designate a list of additional objects or sets of objects, by making deictic references based on previously designated objects. In particular, generates a list of objects whose properties affect the prediction made by the transition rule, while generates a list of objects whose properties are affected after taking an action specified by the action template .
We begin with the simple case in which every function returns a single object, then extend our definition to the case of sets. Concretely, for the th element in (), where is a deictic reference function in the domain, is the arity of that function, and integer specifies that object in the object list can be determined by applying function to objects . Thus, we get a new list of objects, . So, reference can only refer to the objects that are named in the action, and determines an object . Then, reference can refer to objects named in the action or those that were determined by reference , and so on.
When the function in returns a set of objects rather than a single object, this process of adding more objects remains almost the same, except that the may denote sets of objects, and the functions that are applied to them must be able to operate on sets. In the case that a function returns a set, it must also specify an aggregator, , that can return a single value for each property , aggregated over the set. Examples of aggregators include the mean or maximum values or possibly simply the cardinality of the set.
For example, consider the case of pushing the bottom (block ) of a stack of 4 blocks, depicted in Figure 1. Suppose the deictic reference is above, which takes one object and returns a set of objects immediately on top of the input object. Then, by applying above starting from the initial set , we get an ordered list of sets of objects where .
Returning to the definition of a transition rule, we now can see informally that if the parameters of action template are instantiated to actual objects in a problem instance, then and can be used to determine lists of input and output objects (or sets of objects). We can use these lists, finally, to construct input and output vectors. The input vector consists of the continuous action parameters of action and property for all properties and objects that are selected by in arbitrary but fixed order. In the case that is a set of size greater than one, the aggregator associated with the function that computed the reference is used to compute . Similar for the desired output construction, we use the references in the list , initialize , and gradually add more objects to construct the output set of objects . The output vector is where if is a set of objects, we apply a mean aggregator on the properties of all the objects in .
The predictor is some functional form (such as a feedforward neural network) with parameters (weights) that will take values as input and predict a distribution for the output vector
. It is difficult to represent arbitrarily complex distributions over output values. In this work, we restrict ourselves to representing a Gaussian distributions on all property values in
, encoded with a mean and independent variance for each dimension.Now, we describe how a transition rule can be used to map a state and action into a distribution over the new state. A transition rule applies to a particular stateaction pair if is an instance of and if none of the elements of the input or output object lists is empty. To construct the input (and output) list, we begin by assigning the actual objects to the object variables in action instance , and then successively computing references based on the previously selected objects, applying the definition of the deictic reference in each to the actual values of the properties as specified in the state . If, at any point, a or returns an empty set, then the transition rule does not apply. If the rule does apply, and successfully selects input and output object lists, then the values of the input vector can be extracted from , and predictions are made on the mean and variance values .
Let be the vector entry corresponding to the predicted Gaussian parameters of property of th output object set and denote as the property of object in state , for all . The predicted distribution of the resulting state is computed as follows:
where is the default variance of property in rule . There are two important points to note. First, it is possible for the same object to appear in the objectlist more than once, and therefore for more than one predicted distribution to appear for its properties in the output vector. In this case, we use the mixture of all the predicted distributions with uniform weights. Second, when an element of the output object list is a set, then we treat this as predicting the same single property distribution for all elements of that set. This strategy has sufficed for our current set of examples, but an alternative choice would be to make the predicted values be changes to the current property value, rather than new absolute values. Then, for example, moving all of the objects on top of a tray could easily specify a change to each of their poses. We illustrate how we can use transition rules to build a spare in Fig. 2.
For each transition rule and state , we assign the score function value to be if does not apply to state . Otherwise, we assign the total number of deictic references plus one, , as the score. The more references there are in a rule that is applicable to the state, the more detailed the match is between the rules conditions and the state, and the more specific the predictions we expect it to be able to make.
3.3 Learning spares from data
We frame the problem of learning a transition model from data in terms of conditional likelihood. The learning problem is, given a problem domain description and a set of experience tuples, , find a spare
that minimizes the loss function:
(2) 
Note that we require all of the tuples in to belong to the same domain , and require for any that and belong to the same problem instance, but individual tuples may be drawn from different problem instances (with, for example, different numbers and types of objects). In fact, to get good generalization performance, it will be important to vary these aspects across training instances.
4 Learning algorithm
We describe our learning algorithm in three parts. First, we introduce our strategy for learning , which predicts a Gaussian distribution on , given . Then, we describe our algorithm for learning reference lists and for a single transition rule, which enable the extraction of and from . Finally, we present an EM method for learning multiple rules.
4.1 Distributional prediction
For a particular transition rule with associated action template , once and have been specified, we can extract input and output features and from a given set of experience samples . From and , we would like to learn the transition rule’s predictor to minimize Eq. (2). Our predictor takes the form . We use the method of Chua et al. (2017), where two neuralnetwork models are learned as the function approximators, one for the mean prediction parameterized by and the other for the diagonal variance prediction , parameterzied by . We optimize the negative datalikelihood loss function
by alternatively optimizing and . That is, we alternate between first optimizing with fixed covariance, and then optimizing with fixed mean.
Let be the set of experience tuples to which rule applies. Then once we have , we can optimize the default variance of the rule by optimizing It can be shown that these lossminimizing values for the default predicted variances are the empirical averages of the squared deviations for all unpredicted objects (i.e., those for which does not explicitly make predictions), where averages are computed separately for each object property.
We use to refer to this learning and optimization procedure for the predictor parameters and default variance.
4.2 Single rule
In the simple setting where only one transition rule exists in our domain , we show how to construct the input and output reference lists and that will determine the vectors and . Suppose for now that and are fixed, and we wish to learn . Our approach is to incrementally build up by adding tuples one at a time via a greedy selection procedure. Specifically, let be the universe of possible , split the experience samples into a training set and a validation set , and initialize the list to be . For each , compute , where in Eq. (2) evaluates a spare with a single transition rule , where and are computed using the LearnDist described in Section 4.1^{2}^{2}2When the rule does not apply to a training sample, we use for its loss the loss that results from having empty reference lists in the rule. Alternatively, we can compute the default variance to be the empirical variances on all training samples that cannot use rule . . If the value of the loss function is less than the value of , then we let and continue. Else, we terminate the greedy selection process with , since further growing the list of deictic references hurts the loss. We also terminate the process when exceeds some predetermined maximum allowed number of input deictic references, . Pseudocode for this algorithm is provided in Algorithm 1.
In our experiments we set and construct the lists of deictic references using a single pass of the greedy algorithm described above. This simplification is reasonable, as the set of objects that are relevant to predicting the transition outcome often overlap substantially with the objects that are affected by the action. Alternatively, we could learn via an analogous greedy procedure nested around or, as a more efficient approach, interleaved with, the one for learning .
4.3 Multiple rules
Our training data in robotic manipulation tasks are likely to be best described by many rules instead of a single one, since different combinations of relations among objects could be present in different states. For example, we may have one rule for pushing a single object and another rule for pushing a stack of objects. We now address the case where we wish to learn rules from a single experience set , for . We do so via initial clustering to separate experience samples into clusters, one for each rule to be learned, followed by an EMlike approach to further separate samples and simultaneously learn rule parameters.
To facilitate the learning of our model, we will additionally learn membership probabilities , where represents the probability that the th experience sample is assigned to transition rule , and for all . We initialize membership probabilities via clustering, then refine them through EM.
Because the experience samples may come from different problem instances and involve different numbers of objects, we cannot directly run a clustering algorithm such as means on the samples themselves. Instead we first learn a single transition rule from using the algorithm in Section 4.2, use the resulting and to transform into and , and then run means clustering on the concatenation of , , and values of the loss function when is used to predict each of the samples. For each experience sample, the squared distance from the sample to each of the cluster centers is computed, and membership probabilities for the sample to each of the transition rules to be learned are initialized to be proportional to the (multiplicative) inverses of these squared distances.
Before introducing the EMlike algorithm that simultaneously improves the assignment of experience samples to transition rules and learns details of the rules themselves, we make a minor modification to transition rules to obtain mixture rules. Whereas a probabilistic transition rule has been defined as , a mixture rule is , where represents a distribution over all possible lists of input references (and similarly for and ), of which there are a finite number, since the set of available reference functions is finite, and there is an upper bound on the maximum number of references may contain. For simplicity of terminology, we refer to each possible list of references as a shell, so is a distribution over possible shells. Finally, is a collection of transition rules (i.e., predictors , each with an associated , , and ). To make predictions for a sample using a mixture rule, predictions from each of the mixture rule’s transition rules are combined according to the probabilities that and assign to each transition rule’s and . Rather than having our EM approach learn transition rules, we instead learn mixture rules, as the distributions and allow for smoother sorting of experience samples into clusters corresponding to the different rules, in contrast to the discrete and of regular transition rules.
As before, we focus on the case where for each mixture rule, , , and as well. Our EMlike algorithm is then as follows:

For each , initialize distributions for mixture rule as follows. First, use the algorithm in Section 4.2 to learn a transition rule on the weighted experience samples with weights equal to the membership probabilities . In the process of greedily assembling reference lists , data likelihood loss function values are computed for multiple explored shells, in addition to the shell that was ultimately selected. Initialize to distribute weight proportionally, according to data likelihood, for these explored shells: , where is the spare model with a single transition rule , and , with the summation taken over all explored shells , is a normalization factor so that the total weight assigned by to explored shells is . The remaining probability weight is distributed uniformly across unexplored shells.

For each , let , where we have dropped subscripting according to for notational simplicity:

For , train predictor using the procedure in Section 4.2 on the weighted experience samples , where we choose to be the list of references with th highest weight according to .

Update by redistributing weight among the top shells according to a voting procedure where each training sample “votes” for the shell whose predictor minimizes the validation loss for that sample. In other words, the th experience sample votes for mixture rule for . Then, shell weights are assigned to be proportional to the sum of the sample weights (i.e., membership probability of belonging to this rule) of samples that voted for each particular shell: the number of votes received by the th shell is , for indicator function and . Then, , the current th highest value of , is updated to become , where is a normalization factor to ensure that
remains a valid probability distribution. (Specifically,
.) 
Repeat Step 2a, in case the shells with highest values have changed, in preparation for using the mixture rule to make predictions in the next step.


Update membership probabilities by scaling by data likelihoods from using each of the rules to make predictions: , where is the data likelihood from using mixture rule to make predictions for the th experience sample , and is a normalization factor to maintain .
5 Experiments
We apply our approach, spare, to a challenging problem of predicting pushing stacks of blocks on a cluttered table top. In this section, we describe our domain, the baseline that we compare to and report our results.
5.1 Object manipulation domain
In our domain , the object universe is composed of blocks of different sizes and weight, the property set includes shapes of the blocks (width, length, height) and the position of the block ( location relative to the table). We have one action template, push, which pushes toward a target object with parameters , where is the 3D position of the gripper before the push starts and is the distance of the push. The orientation of the gripper and the direction of the push are computed from the gripper location and the target object location. We simulate this 3D domain using the physically realistic PyBullet (Coumans & Bai, 2016–2018) simulator. In realworld scenarios, an action cannot be executed with the exact action parameters due to the inaccuracy in the motor and hence in our simulation, we add Gaussian noise on the action parameters during execution to imitate this effect.
We consider the following deictic references in the reference collection : (1) above (O), which takes in an object and returns the object immediately above ; (2) above* (O), which takes in an object and returns all of the objects that are above ; (3) below (O), which takes in an object and returns the object immediately below ; (4) nearest (O), which takes in an object and returns the object that is closest to .
5.2 Baseline methods
The baseline method we compare to is a neural network function approximator that takes in as input the current state and action parameter , and outputs the next state . The list of objects that appear in each state is ordered: the target objects appear first and the remaining objects are sorted by their poses (first sort by coordinate, then , then ).
5.3 Results
(a) In a simple 3block pushing problem instance, data likelihood and learned default standard deviation both improve as more deictic references are added. (b) Comparing performance as a function of number of distractors with a fixed amount of training data. (c) Comparing sample efficiency of
spare to the baseline. Shaded regions represent confidence interval.Effects of deictic rules
As a sanity check, we start from a simple problem where a gripper is pushing a stack of three blocks on an empty table. We randomly sampled 1250 problem instances by drawing random block shapes and locations from a uniform distribution within a range while satisfying the condition that the stack of blocks is stable. In each problem instance, we uniformly randomly sample the action parameters and obtain the training data, a collection of tuples of state, action and next state, where the target object of the push action is always the one at the bottom of the stack. We held out
of the training data as the validation set. We found that our approach is able to reliably select the correct combinations of the references that select all the blocks in the problem instance to construct inputs and outputs. In Fig. 4(a), we show how the performance varies as deictic references are added during a typical run of this experiment. The solid blue curves show training performance, as measured by data likelihood on the validation set, while the dashed blue curve shows performance on a heldout test set with 250 unseen problem instances. As expected, performance improves noticeably from the addition of the first two deictic references selected by the greedy selection procedure, but not from the fourth as we only have 3 blocks and the fourth selected object is extraneous. The red curve shows the learned default standard deviations, used to compute data likelihoods for features of objects not explicitly predicted by the rule. As expected, the learned default standard deviation drops as deictic references are added, until it levels off after the third reference is added since at that point the set of references captures all moving objects in the scene.Sensitivity analysis on the number of objects
We compare our approach to the baseline in terms of how sensitive the performance is to the number of objects that exist in the problem instance. We continue the simple example where a stack of three blocks lie on a table, with extra blocks that may affect the prediction of the next state. Figure 4(b) shows the performance, as measured by the log data likelihood, as a function of the number of extra blocks. For each number of extra blocks, we used 1250 training problem instances with (15% as the validation set) and 250 testing problem instances. When there are no extra blocks, spare learns a single rule whose and contain the same information as the inputs and outputs for the baseline method. As more objects are added to the table, baseline performance drops as the presence of these additional objects appear to complicate the scene and the baseline is forced to consider more objects when making its predictions. However, performance of the spare approach remains unchanged, as deictic references are used to select just the three blocks in the stack as input in all cases, regardless of the number of extra blocks on the table.
Note that in the interest of fairness of comparison, the plotted data likelihoods are averages over only the three blocks in the stack. This is because the spare approach assumes that objects for which no explicit predictions are made do not move, which happens always to be true in these examples, thus providing spare an unfair advantage if the data likelihood is averaged over all blocks in the scene. We show the complete results in the appendix. Note also that, performance aside, the baseline is limited to problems for which the number of objects in the scenes is fixed, as the it requires a fixedsize input vector containing information about all objects. Our spare approach does not have this limitation, and could have been trained on a single, large dataset that is the combination of the datasets with varying numbers of extra objects, benefiting from a larger training set. However, we did not do this in our experiments for the sake of providing a more fair comparison against the baseline.
Sample efficiency
We evaluate our approach on more challenging problem instances where the robot gripper is pushing blocks on a cluttered table top and there are additional blocks on the table that do not interfere or get affected by the pushing action. Fig. 4(c) plots the data likelihood as a function of the number of training samples. We evaluate with training samples varying from to and in each setting, the test dataset has samples. Both our approach and the baseline benefit from having more training samples, but our approach is much more sample efficient and achieves good performance within only a thousand training samples, while even training the baseline with training samples does not match what our approach gets with only training samples.
Learning multiple transition rules
Now we put our approach in a more general setting where multiple transition rules need to be learned for prediction of the next state. Our approach adopts an EMlike procedure to assign each training sample its distribution on the transition rules and learn each transition rule with reweighted training samples. First, we construct a training dataset and of it is on pushing block stack. Our EM approach is able to concentrate to the 4block case as shown in Fig. 5(a).
Fig. 5(b) tracks the assignment of samples to rules over the same five runs of our EM procedure. The three curves correspond to the three stack heights in the original dataset, and each shows the average weight assigned to the “target” rule among samples of that stack height, where the target rule is the one that starts with a high concentration of samples of that particular height. At iteration 0, we see that the rules were initialized such that samples were assigned 70% probability of belonging to specific rules, based on stack height. As the algorithm progresses, the samples separate further, suggesting that the algorithm is able to separate samples into the correct groups.
Conclusion
These results demonstrate the power of combining relational abstraction with neural networks, to learn probabilistic state transition models for an important class of domains from very little training data. In addition, the structural nature of the learned models will allow them to be used in factored searchbased planning methods that can take advantage of sparsity of effects to plan efficiently.
References
 Agre & Chapman (1987) Philip E Agre and David Chapman. Pengi: An implementation of a theory of activity. In AAAI, volume 87, pp. 286–272, 1987.

Battaglia et al. (2018)
P. W. Battaglia, J. B. Hamrick, V. Bapst, A. SanchezGonzalez,
V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro,
R. Faulkner, C. Gulcehre, F. Song, A. Ballard, J. Gilmer,
G. Dahl, A. Vaswani, K. Allen, C. Nash, V. Langston, C. Dyer,
N. Heess, D. Wierstra, P. Kohli, M. Botvinick, O. Vinyals, Y. Li,
and R. Pascanu.
Relational inductive biases, deep learning, and graph networks.
ArXiv eprints, June 2018.  Benson (1997) Scott S Benson. Learning action models for reactive autonomous agents. Technical report, Stanford University, Stanford, CA, USA, 1997.
 Chua et al. (2017) Kurtland Chua, Roberto Calandra, and Sergey Levine. On the importance of uncertainty for control with deep dynamics models. In NIPS Workshop on Acting and Interacting in the Real World, 2017.

Coumans & Bai (2016–2018)
Erwin Coumans and Yunfei Bai.
Pybullet, a python module for physics simulation for games, robotics and machine learning.
http://pybullet.org, 2016–2018.  Drescher (1991) Gary L Drescher. Madeup minds: a constructivist approach to artificial intelligence. MIT press, 1991.
 Kansky et al. (2017) Ken Kansky, Tom Silver, David A. Mély, Mohamed Eldawy, Miguel LázaroGredilla, Xinghua Lou, Nimrod Dorfman, Szymon Sidor, D. Scott Phoenix, and Dileep George. Schema networks: Zeroshot transfer with a generative causal model of intuitive physics. In Proceedings of the 34th International Conference on Machine Learning, pp. 1809–1818, 2017.
 Lang & Toussaint (2010) Tobias Lang and Marc Toussaint. Planning with noisy probabilistic relational rules. Journal of Artificial Intelligence Research, 39:1–49, 2010.
 Mourão (2014) Kira Mourão. Learning probabilistic planning operators from noisy observations. In Proceedings of the Workshop of the UK Planning and Scheduling Special Interest Group, 2014.
 Mourão et al. (2012) Kira Mourão, Luke S. Zettlemoyer, Ronald P. A. Petrick, and Mark Steedman. Learning STRIPS operators from noisy and incomplete observations. In Proceedings of the TwentyEighth Conference on Uncertainty in Artificial Intelligence, pp. 614–623, 2012.
 Pasula et al. (2007) Hanna M Pasula, Luke S Zettlemoyer, and Leslie Pack Kaelbling. Learning symbolic models of stochastic domains. Journal of Artificial Intelligence Research, 29:309–352, 2007.
Appendix A Discussions and future work
In this work, we made an attempt to use deictic references to construct transition rules that can generalize over different problem instances with varying numbers of objects. Our approach can be viewed as constructing “convolutional” layers for a neural network that operates on objects. Similar to the convolutional layer operating on different sizes of images, we use deictic rules to find relations among objects and construct features for learning without a constraint on the scale of our problem or how many objects there are. Our approach is to find structures in existing data and learn the construction of deictic rules, each of which uses a specific strategy to filter out objects that are important to an action.
Many possible improvements and extensions can be made to this work, including experimentation with different template structures, more powerful deictic references, more refined feature selection, and integration with planning.
This work focused on using deictic references to select input and output objects for a predictor. A natural extension is to select not just objects, but also features of objects to obtain more refined templates and even smaller predictors, especially in cases where objects have many different properties, only some of which are relevant to the prediction problem at hand. There is also room for experimentation with different types of aggregators for combining features of objects when deictic references select multiple objects in a set.
Finally, the end purpose of obtaining a transition model for robotic actions is to enable planning to achieve highlevel goals. Due to time constraints, this work did not assess the effectiveness of learned templatebased models in planning, but this is a promising area of future work as the assumption that features of objects for which a template makes no explicit prediction do not change meshes well with STRIPSstyle planners, as they make similar assumptions.
Appendix B More empirical results
We here provide details on our experiments and more results in experiments.
Experimental details
All experiments in this paper used predictors consisting of two neural networks, one for making mean predictions and one for making variance predictions, described in Section 4.1
. Each network of the predictor was implemented as a fullyconnected network with two hidden layers of 150 nodes each in Keras, used ReLU activations between layers, and the Adam optimizer with default parameters. Predictors for the templates approach were trained for 100 epochs each, alternating between the mean and variance predictor for 25 epochs at a time, four a total of four times, followed by training the mean predictor for another 25 epochs at the time. The baseline predictor was implemented in exactly the same way, but trained for a total of 300 epochs, rather than 100.
States were parameterized by the pose of each object in the scene, ordered such that the target object of the action always appeared first, and other objects appeared in random order (except for the baseline). Action parameters included the starting pose of the robotic gripper, as well as a “push distance” parameter that controls how long the push action lasts. Actions were implemented to be stochastic by adding some amount of noise to the target location of each push, potentially reflecting inaccuries in robotic control.
Sensitivity analysis on the number of objects
In the main paper, we analyzed the sensitivity of the performance to the number of objects that exist in the problem instance and showed the log data likelihood only on objects that moved (the stack). This is for fair comparison to the baseline. In Figure 6, we show the results of the log data likelihood on all objects in the problem instances. As the number of extra objects increases, the baseline approach perform much worse than spare.
Initialization of Membership Probabilities
We use the clusteringbased approaches for initializing membership probabilities presented in Section 4.3. In this section, we how well our clustering approach performs.
Table 1 shows the sample separation achieved by the discrete clustering approach, where samples are assigned solely to their associated clusters found by means, on the push dataset for stacks of varying height. Each column corresponds to the onethird of samples which involve stacks of a particular height. Entries in the table show the proportion of samples of that stack height that have been assigned to each of the three clusters, where in generating these data the clusters were ordered so that the predominantly 2block sample cluster came first, followed by the predominantly 3block cluster, then the 4block cluster. Values in parentheses are standard deviations across three runs of the experiment. As seen in the table, separation of samples into clusters is quite good, though 22.7% of 3block samples were assigned to the predominantly 4block cluster, and 11.5% of 2block samples were assigned to the predominantly 3block cluser.
Num objects in sample  

2  3  4  
Cluster index  1  0.829 (0.042)  0.022 (0.030)  0.089 (0.126) 
2  0.115 (0.081)  0.751 (0.028)  0.038 (0.025)  
3  0.056 (0.039)  0.227 (0.012)  0.872 (0.102) 
The sample separation evidenced in Table 1 is enough such that templates trained on the three clusters of samples reliably select deictic references that consider the correct number of blocks, i.e., the 2block cluster reliably learns a template which considers the target object and the object above the target, and similiarly for the other clusters and their respective stack heights. However, initializing samples to belong solely to a single cluster, rather than initializing membership probabilities, is unlikely to be robust in general, so we turn to the proposed clusteringbased approaches for initializing membership probabilities instead.
Table 2 is analogous to Table 1 in structure, but shows sample separation results for sample membership probabilities initialized to be proportional to the inverse distance from the sample to each of the cluster centers found by means. Table 3 is the same, except with membership probabilities initialized to be proportional to the square of the inverse distance to cluster centers.
Num objects in sample  

2  3  4  
Cluster index  1  0.595 (0.010)  0.144 (0.004)  0.111 (0.003) 
2  0.259 (0.005)  0.551 (0.007)  0.286 (0.005)  
3  0.146 (0.006)  0.305 (0.003)  0.602 (0.008) 
Num objects in sample  

2  3  4  
Cluster index  1  0.730 (0.009)  0.065 (0.025)  0.118 (0.125) 
2  0.149 (0.079)  0.665 (0.029)  0.171 (0.056)  
3  0.121 (0.076)  0.270 (0.005)  0.716 (0.069) 
Sample separation is better in the case of squared distances than nonsquared distances, but it’s unclear whether this result generalizes to other datasets. For our specific problem instance, the log data likelihood feature turns out to be very important for the success of these clusteringbased initialization approaches. For example, Table 4 shows results analogous to those in Table 3, where the only difference is that all log data likelihoods were multiplied by five before being passed as input to the means clustering algorithm. Comparing the two tables, this scaling of data likelihood to become relatively more important as a feature results in better data separation. This suggests that the relative importance between log likelihood and the other input features is a parameter of these clustering approaches that should be tuned.
Num objects in sample  

2  3  4  
Cluster index  1  0.779 (0.017)  0.020 (0.001)  0.016 (0.001) 
2  0.174 (0.014)  0.744 (0.006)  0.118 (0.014)  
3  0.047 (0.005)  0.236 (0.005)  0.866 (0.015) 
Effect of object ordering on baseline performance
The singlepredictor baseline used in our experiments receives all objects in the scene as input, but this leaves open the question of in what order these objects should be presented. Because the templates approach has the target object of the action specified, in the interest of fairness this information is also provided to the baseline by having the target object always appear first in the ordering. As there is in general no clear ordering for the remainder of the objects, we could present them in a random order, but perhaps sorting the objects according to position (first by coordinate, then , then ) could result in better predictions than if objects are completely randomly ordered.^{3}^{3}3As this suspicion turns out to be true, the baseline experiments order blocks in sorted order by position.
To analyze the effect of object ordering on baseline performance, we run the same experiment where a push action is applied to the bottommost of a stack of three blocks, and there exist some number of additional objects on the table that do not interfere with the action in any way. Figure 7 shows our results. We test three object orderings: random (“none”), sorted according to object position (“xtheny”), and an ideal ordering where the first three objects in the ordering are exactly the three objects in the stack ordered from bottom up (“stack”). As expected, in all cases, predicted log likelihood drops as more extra blocks are added to the scene, and the random ordering performs worst while the ideal ordering performs best.
Comments
There are no comments yet.