Disentangled Relational Representations for Explaining and Learning from Demonstration

by   Yordan Hristov, et al.

Learning from demonstration is an effective method for human users to instruct desired robot behaviour. However, for most non-trivial tasks of practical interest, efficient learning from demonstration depends crucially on inductive bias in the chosen structure for rewards/costs and policies. We address the case where this inductive bias comes from an exchange with a human user. We propose a method in which a learning agent utilizes the information bottleneck layer of a high-parameter variational neural model, with auxiliary loss terms, in order to ground abstract concepts such as spatial relations. The concepts are referred to in natural language instructions and are manifested in the high-dimensional sensory input stream the agent receives from the world. We evaluate the properties of the latent space of the learned model in a photorealistic synthetic environment and particularly focus on examining its usability for downstream tasks. Additionally, through a series of controlled table-top manipulation experiments, we demonstrate that the learned manifold can be used to ground demonstrations as symbolic plans, which can then be executed on a PR2 robot.



There are no comments yet.


page 2

page 5

page 6

page 12


Interpretable Latent Spaces for Learning from Demonstration

Effective human-robot interaction, such as in robot learning from human ...

PUMICE: A Multi-Modal Agent that Learns Concepts and Conditionals from Natural Language and Demonstrations

Natural language programming is a promising approach to enable end users...

Combining Context Awareness and Planning to Learn Behavior Trees from Demonstration

Fast changing tasks in unpredictable, collaborative environments are typ...

Learning from Demonstration with Weakly Supervised Disentanglement

Robotic manipulation tasks, such as wiping with a soft sponge, require c...

Video2Skill: Adapting Events in Demonstration Videos to Skills in an Environment using Cyclic MDP Homomorphisms

Humans excel at learning long-horizon tasks from demonstrations augmente...

Verbal Focus-of-Attention System for Learning-from-Demonstration

The Learning-from-Demonstration (LfD) framework aims to map human demons...

Improving Task-Parameterised Movement Learning Generalisation with Frame-Weighted Trajectory Generation

Learning from Demonstration depends on a robot learner generalising its ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As an increasing number of robots become deployed in field applications, where they must interact in customized ways with human co-workers, there is a need for these robots to represent and reason about their tasks in ways that accord with corresponding human concepts. Ideally, the human’s and robot’s conceptualizations of the working environment must be able to align so that the robot can adapt to the specific needs of the user. For example, in a table-top manipulaton scenario, in order for the agent to correctly respond to instructions regarding stacking or clustering a set of objects, it should be able to comprehend concepts like an object being close to or on another one-see Figure 1.

This motivates the need for a robot to be able to acquire and tune a domain model via interactions with the human user. Moreover, people who are not robotics experts find it easier to provide the necessary inductive bias in the form of demonstrations of the task rather than explicit specifications of the same task. It is well understood that reward specification is not only hard, but prone to exploitation by the agent [2]. We can therefore use a Learning from Demonstration (LfD) [3] method, together with providing high-level guidance using language. This guidance is necessarily more abstract than the level of the robot’s sensor stream or native action representation. So, we need to induce alternate latent representations from the low-level sensory data, that allow for subsequent tasks to be grounded in this abstracted space.

Figure 1: Example data used from a (a) photo-realistic blocks world, and (b) table-top object manipulation while teleoperating a 7 DoF arm of a PR2 robot.

Forming a series of hierarchical abstractions about the world that we share with each other—e.g. the notions of color, shape, size, direction, objects’ relative position — is essential for humans to communicate with one another. We would like our robots to also use these human-interpretable concepts as representations that underpin LfD. To achieve this, we work in the setting of interactive task learning [22], starting with the question of how best to align a learning agent’s representations (in this paper, regarding inter-object relationships) with corresponding human labels. A specific aspect of this problem is the issue of physical symbol grounding, [15, 31], i.e., how should a learning agent make inferences about the relationship between symbolic labels and their manifestation in the richer sensory feed of the robot.

In this paper, we propose a framework which allows human operators to teach a PR2 robot about spatial relations and inter-object arrangements on a table top. Our main contributions are:

  • [topsep=0pt]

  • A disentangled representation learning method in which inter-object relationships, manifested in a high-dimensional sensory input, can be grounded in a learned low-dimensional latent manifold. We explicitly optimize for the latent manifold to align with human ‘common sense’ notions, e.g. left and right are mutually exclusive and independent from front and behind which are also mutually exclusive.

  • Evaluating the learned representations in an ‘Explain-n-Repeat’ setup—see Figure 2 (b)—in which discrete symbolic specifications, grounded in the learned manifolds, can be derived from the latent projections of user demonstrations. The demonstrations are third person observations of object manipulation in a table-top environment. We show that we can infer both what is moved after what and how each object is manipulated from this set of demonstrations. We further demonstrate that end effector poses can be predicted from the steps of such inferred plans, and associated sensory data, see Figure 2 (c).

Figure 2: Overall setup: (a) during training, the agent receives observations from the environment and weak annotation from the human expert as to how different objects relate to each other, at each time step. (b) At test time, the agent uses the learned representations in order to explain how the objects in the environment relate to each other, through time, with the explanation being structured in the form of a plan; (c) each instruction from the plan can then be mapped to end-effector actions.

2 Related Work

Prior work in psycholinguistics has empirically shown that humans communicate more efficiently and effectively with each other by aligning language and its use at all levels of linguistic processing (e.g., [13, 25]). One aspect of the problem is learning how to physically ground symbols in visual input. The INGRESS framework [29] uses a multi-step process to learn a representation of objects within the scene, including when objects are referred to within dialogue with a human. Doing this in a 0-shot, 1-shot [31, 15], or meta-learning framework [30] requires minimizing the number of examples needed for generalization. An extension of the above work is the ability to combine the learned symbols in a compositional manner [23] such that novel instructions could be formulated.

Learning relationships between objects from raw sensory input can be achieved through the use of high-capacity models like neural networks

[28, 26] or with SVMs [27]. However, this can often require large quantities fully-labelled of data and computational resources (e.g., the CLEVR data set [18]) and the learned models are often treated as black boxes.

Splitting the factors of variation in an unsupervised way is well studied in the representation learning literature as a form for making the learned models more interpretable. This has been demonstrated using both generative models –InfoGAN [9], which can be unstable in training and needs specification of the distribution over the latent representation, and variational models of images—-VAE [16], -TCVAE [8], oi-VAE [1] or of video [11]. As these models are trained in an unsupervised way, the resulting embeddings for the factors of variation within the data set do not necessarily map to the variation that is necessary for the discrimination of the task at hand. In [17] the authors employ a

-VAE representation for grounding of symbols in a semi-supervised way and achieve alignment between the defined semantic concept groups and orthogonal latent vector space representing them. Our work follows this weakly-supervised method of aligning the representations, but differs in that we use the representations to help solve more complex downstream tasks. Moreover, we deal with the segmentation problem when multiple objects are present in the scene. MONet

[7] and IODINE [14] present methods for performing iterative multi-object scene decomposition using deep variational inference models. Both approaches choose to solve the scene segmentation and representation learning problems compositionally in an end-to-end fashion, only using unlabelled data. However, it is not clear what these representations could be used for.

Lázaro-Gredilla et al. present the Visual Cognitive Computer (VCC) [14] does show how representations that align with human notions and concepts can be learned and then used for a robotic manipulation task. However, the authors assume they have access to a model of the environment and its dynamics, together with a deterministic mapping from sensory inputs to discrete symbols and full plans for each interaction.

On the topic of bridging neural networks and logical plans, Asai et. al [5, 4] present FOSAE - a method for learning how to extract first-order logic predicates and plans from raw sensory observations which can later be composed in a sequential plan. However, the authors claim that the method sacrifices the interpretability of the learned representations for the potential benefit of greater autonomy in the system - which for us is an orthogonal goal, our primary focus being on richer forms of human-robot interaction to help robots acquire customized skills.

3 Problem Formulation

3.1 Representation Learning Step

We work with user descriptions which come as natural language sentences of the [target relations referent] form, where target is the object that is manipulated, referent is the object that acts as a reference point and relations describes the configuration which the target should satisfy with respect to referent.

Our aim is to efficiently learn how to compress a pair of high-dimensional inputs , to a low-dimensional vector space , where , by optimizing a set of functions , and .

The weak labelling over an observed scene consists of a set of conceptual groups = that aim to describe different notions that are represented in the environment, e.g. alignment along the spatial X/Y/Z axes, containment, support, etc. Each group is a set of mutually exclusive discrete labels: (e.g. the conceptual group of alignment along Y can have the labels left and right, etc.) Additionally, we have a set of object-centered conceptual groups which represent notions like color, shape, size, etc and are extracted from the target and referent part of the given instructions. Such labels associated with either the target or reference object are designated as and respectively. Let be a set of observation. , ; of the observations are given at least one relational label while the rest are passively gathered as unknown. We don’t treat the unknown value as a labels class during training later. Each corresponds to a (target, referent) image pair and corresponds to a relations term from the semantically parsed descriptions above. For example, a scene with 3 objects would result in 6 possible bi-object configurations and 6 pairs respectively. Again note that we expect a proportion of the labels to be unknownunlabelled, due to ambiguity in the scenes, e.g. in Figure 1 (second image) the green cylinder is neither left nor right of the blue cylinder. For more details on how linguistic instructions are parsed to labels and how input images are semantically segmented consult Appendix A.

We explicitly optimize the vectors in to preserve specific semantic concepts expressed over the tuples (, ) and whose meaning is commonly agreed-upon, e.g. relative spatial positions. The latter is achieved by using the vectors in to predict the set of labels in each group . Additionally, a subset of the dimensions of each object-centered latent vector is forced to predict the values in and respectively.

3.2 The ‘Explain-n-Repeat’ Step

At test time the agent receives a demonstration in the form of a sequence of observations . We aim to find a corresponding sequence of instructions that is expressed through the symbols that we have learned how to ground in .

To close the loop, when performing the demonstrations on the robot, apart from recording pairs, we also record the 6 DoF pose for the end effector of the arm that is performing the object manipulation. We can thus learn how to regress from an initial image of the scene and a relational specification vector , describing the end state of the two objects, to a valid pose which satisfies of . The predicted pose is fed to a MoveIt! motion planner [10]. We do not address the grasping problem - we assume the robot is already holding the object to be moved.

4 Methodology

4.1 Learning Disentangled Relational Embeddings

The overall architecture is inspired by the MONet model [7]—augmenting the reconstruction loss term in order to achieve better disentanglement in . We do not learn the segmentation process but use already segmented masks. Similar to Hristov et. al [17], we explore the effects of adding auxiliary classification losses to a Siamese Neural Network [21] which uses a -VAE [16, 6, 20] as a base architecture. It consists of a convolutional encoder network , parametrized by which takes an input and produces a vector —red and green object embeddings in Figure 3 (a). Each is fed into a spatial broadcast decoder network [32], parametrized by —Figure 3 (b). A set of variational operators , parametrized by , take the concatenation of the vectors and produce a single vector —yellow relationship embedding in Figure 3 (b). The resultant vector,

, is fed into a set of linear classifiers

, one per label group

, each with a softmax activation function predicting a set of labels. Additionally, each

is fed into a set of linear classifiers , one per latent dimension.

The rationale behind the combination of all of these losses—reconstruction , variational and multiple classification terms —is that they utilize different parts of the dataset in order to achieve the overall goal of learning representations that are factorized and aligned with abstract human notions. The latter is mostly enforced by the Softmax cross-entropy classification terms since they force the latent vectors along each axis to be useful for predicting the labels for a particular concept group. At the same time, the reconstruction loss makes use of all data points, labelled and unlabelled, forcing the same latent vectors to be also useful for recreating the original inputs. As shown in [7], masking forces the encoder network to produce which are more factorized.

This, combined with optimizing the Kullback-Leibler divergence between the distribution of values in


and a prior isotropic normal distribution, incentivises

and to be smoother [6] and for similar data pairs to be projected to the same regions of the manifold.

Additional parameters— for the reconstruction term, for the Kullback-Leibler divergence term, for the cross-entropy terms—are used to scale the term in the overall loss—see Equation (1).


In order to evaluate the architecture we perform an ablation study consisting of disabling parts of the full model—e.g. disable the classification part of the network for predicting the object labels and only train the rest. The set of models used in experiments is as follows:

  • No , No : ()

  • No , With : ()

  • With , No : ()

  • With , With : ()

Figure 3: (a) Overall architecture - two object-centric embeddings are produced for each masked RGBD input. From their concatenation a relationship-centric embedding is produced. Parts of all embeddings are fed through a set of linear classifiers in order to predict a set of discrete labels - one group of labels per latent axis. Additionally the object-centric embeddings are used to reconstruct the original input. (b) VAE with a spatial broadcast decoder and masked reconstruction loss, similar to the Component VAE in [7]. (c) Fully connected operator for each relational concept group producing a 1D space in which the symbols from the particular group are grounded.

4.2 Inferring Symbolic Plans from Demonstration

Given the continuous manifold in which inter-object relational discrete labels can be grounded, we look into whether that feature space can be used in an LfD context. In particular, we investigate whether the learned manifold allows us to segment the latent projections of user demonstrations for moving objects. Inferring the symbolic plans that are underpinning these demonstrations is the focus of the this section.

Plan Segmentation - Algorithm 1 outlines the steps necessary to segment a projection of a demonstration——into a movement prescription sequence which designates when an object is manipulated and when not. Using the methods described in Section 4.1 we identify the different target (moved) objects—green and blue in Figure 4 (a). Then for each pair of target and a given reference object—the red cube—we extract the corresponding traces of relational embeddings. Checking whether the particular target object moves with respect to the reference object at each timestep consists of performing a likelihood ratio test with two candidate normal distributions, parametrized by and , —see lines 9 to 12 in Algorithm 1. However, in a given set of demonstrations we are not only interested in identifying when one objects stops moving and another starts. We are also interested in how the relationships between them change over time. More specifically, we are interested in being able to identify an invariant symbolic plan that underlies a set of demonstrations, all which are meant to demonstrate the same task.

Input: Sequence of observations
Input: Referent object labels of
Input: Encoder network ,
Output: Movement prescription sequence
1 ;
2 segment();
3 preproc();
4 for each object pair in  do
5       & );
6       ;
7       ;
8       for each ( in zip() do
9             if  then
10                   Append to ;
12            else
13                   Append to ;
15      Append to ;
17return ;
Algorithm 1 Movement Prescription Seq Identification

Task Essence Extraction - Task essence identification from a set of demonstrations is performed in a similar fashion to the plan segmentation step described in Algorithm 1. However, in this case we are working with the embedding trace for a single

pair and are using a set of per-label estimated 1D normal distributions for each label in each conceptual group: K =

in order to perform label-oriented likelihood ratio tests (as compared to the moving ones in the prev paragraph). As a result we can go over the latent trace for each target object and only add steps to an eventual symbolic plan , for each target object , if they are part of the identified task essence. For more details refer to the supplementary materials111https://sites.google.com/view/explain-n-repeat/ or to Appendix B

From symbolic plans to end effector poses - Predicting end effector poses of the robotic arm is treated as a fully-supervised problem. From an observation of the environment—an image showing a grasped object () and a static object () on a table top—we extract the object-centric embedding corresponding to the target object—. Additionally, given a relational vector , arising from , describing the desired eventual state of the two objects, we sample a relational embedding , by using the fitted parametric distributions K (see previous paragraph). Given the concatenation of and , we use an MLP with two hidden layers in order to regress to a pose vector .

5 Experiments, Evaluation and Results

For learning the relational embeddings a set of standard objects is used, as shown in Figure 1. The set of spatial prepositions and their semantic grouping that are given in the user-scene descriptions during the demonstrations are outlined in Table 1.

Figure 4: Example testing data for (a) Repetitive motion along a single concept group—e.g. left to right (row 1)—and (b) Chained motion along different concept groups—e.g. perform a C-shape-sequentially from front to behind to right to front (row 1).

Photorealistic BlocksWorld - This synthetic dataset consists of 6,000 scenes, each containing 4 objects in a random configuration. The objects’ attributes are the defaults from the original CLEVR dataset [18], together with an additional gray tray object. Given the 6 concept groups—Table 1 (top)—this results in 72,000 possible inter-object relationships, 40% of which are unlabelled.

left, right
front, behind
above, below
close, far
on, off
out of, in
off, on
not facing, facing
out, in
Table 1: User-defined spatial relations

It is worth noting that the different concept groups have a different split between labelled and unlabelled data points as an artefact of resolving the inherent ambiguity of some of the prepositions when procedurally generating the different scenes. For example, if an object is slightly above a tray, the pair is labelled as unknown along the in/out concept group. The proportion of unlabelled data points across the 6 concept groups is 28%, 31%, 41%, 36%, 32%, 90% respectively.

For evaluating the efficacy of plan segmentation using the learned relation embeddings, two types of moving scenes are generated - 6 repetitive behaviours of multiple target objects sequentially moving along a specific concept group (5 demos per type) and 3 chained behaviours of the same target object moving along different concept groups (8 demos per type)—see Figure 4. Accuracy is reported for each identified and edit distance is reported for each symbolic plan—see Equations 2 and 3.


PR2 Robot Experiment - 3 tasks are demonstrated by teleoperating a PR2 robot with an HTC Vive controller—putting a red cube on a purple cup, making two cups face each other (as an example of a necessary pre-pouring step), placing a yellow cube in a purple bowl—see Figure 1 (b). The spatial inter-object prepositions that were learned from each of the 3 tasks are summarized in Table 1 (bottom). Each demonstration is temporally aligned such that one of the labels for the corresponding concept group is satisfied at the beginning of the demonstration and the other at the end. Everything in between is labelled as an unknown relationship. For each task there are 20 demonstrations performed, with variations in the position of the reference object in the scene and initial end effector poses. In total this results in 2,400 labelled and 6,000 unlabelled object pairs.

For evaluating how well we can predict an end effector pose from a given input image and a relational spec vector, we record 10 additional demonstrations for each task. The mean absolute error along each of the 6 axes of the end effector is reported between the inferred set of poses and the ground truth ones, measured in meters for X/Y/Z and radians for Roll/Pitch/Yaw.

Model left-right front-behind below-above far-close off-on out-in
No , No 0.50 0.64 0.54 0.56 0.49 0.66
No , With 0.53 0.68 0.68 0.63 0.65 0.62
With , No 0.70 0.73 0.69 0.68 0.64 0.78
With , With 0.80 0.88 0.91 0.86 0.76 0.56
Model C-shape off-on-off jump over
All models 1 0.74 1
Table 2: Plan segmentation Acc-what moves when-for (top) repetitive and (bottom) chained demos.

The performed experiments demonstrate that the learned feature space can be reliably used by the agent in order to produce symbolic plans, using the dictionary of symbols it has been taught. Table 3 (a) shows that the model which incorporates both and performs best at identifying the movement prescription sequences in the repetitive demonstrations. This supports our hypothesis that by enforcing object label classification and by utilizing the full dataset through the reconstruction loss, we learn smoother and more factorized vectors and . This in turn allows for the task segmentation process to be more robust. Further analysis is provided in Appendix C. As far as the chained movement demonstrations are concerned, all models perform in an equal manner, which is expected, since these sequences only involve a single object moving.

Figure 5: (top): edit distance statistics as a function of how many demonstrations the agent has seen. (bottom) plan length statistics for the inferred plans as a function of how many demonstrations the agent has seen for all three chained behaviours—(a) C-shape, (b) off-on-off and (c) jump over;

The best performing model from Table 3 (a) is used on the symbolic plan inference task over the demonstrated chained behaviour (where the underlying plan is over a single target object and is a multi-step one). Figure 5 reports the average edit distance for the inferred plan over all demonstrations for a given task (top row). Additionally, the average plan lengths are also reported. Both quantities are plotted as a function of the number of demonstrations used to infer the task essence which in turn is used to infer the step-by-step plan for each demonstration. As expected, the more demonstrations we see per task, the closer the inferred plans get to the ground truth ones . The reason why some of the plots do not converge to the ground-truth numbers (red line across all plots in the figure) can be attributed to the fact that some demonstrations contain object occlusions, making it hard to reliably infer the true plan without noise.

Figure 6: Mean Absolute Error between inferred poses and commanded poses during teleoperation for (a) placing on, (b) facing cups, (c) placing in. The reported error values are across 10 demonstrations (X-axis) not seen during training.

Lastly we demonstrate that using the learned latent grounding of the taught linguistic symbols we can regress end effector positions which capture the meaning behind the symbol (and its associated task. Figure 6 reports the mean absolute error between inferred poses and the true demonstrated ones for all three teleoperated tasks. The plots reflect that for certain tasks the model learns to predict more reliably only along the end effector axes that matter for the success of the task (in the way it has been demonstrated)—e.g. for placing on and in we get lower error across X/Y/Z as compared to when making the cups face each other. Respectively, the facing task puts more weight on the Roll and Pitch axes (which matter for the cups to have the right orientation) and less weight on the Yaw or on the translational X/Y/Z axes of the end effector.

6 Conclusion

Effective human-robot collaboration requires shared task representations that are both interpretable and suitable for task completion. We present a framework which allows human demonstrators to teach how to ground high-level spatial concepts in their sensory input. We show that while interpretable to the human, due to the disentanglement we explicitly optimize for, the learned latent space is also useful to tasks downstream. In particular, using photorealistic synthetic data we show how such a feature space can be used by an agent to derive explanations for a set of demonstrations, using the symbols it has been taught a priori. We also show how such discrete symbolic representations can be used as a building block for primitive action policies in the context of a robotic agent performing a table-top manipulation task.

This work is partly supported by funding from the Turing Institute, as part of the Safe AI for surgical assistance project


  • [1] S. K. Ainsworth, N. J. Foti, A. K. C. Lee, and E. B. Fox (2018-10–15 Jul) Oi-VAE: output interpretable VAEs for nonlinear group factor analysis. In

    Proceedings of the 35th International Conference on Machine Learning

    , J. Dy and A. Krause (Eds.),
    Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 119–128. External Links: Link Cited by: §2.
  • [2] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané (2016) Concrete problems in ai safety. arXiv preprint arXiv:1606.06565. Cited by: §1.
  • [3] B. D. Argall, S. Chernova, M. Veloso, and B. Browning (2009) A survey of robot learning from demonstration. Robotics and autonomous systems 57 (5), pp. 469–483. Cited by: §1.
  • [4] M. Asai and A. Fukunaga (2018) Classical planning in deep latent space: bridging the subsymbolic-symbolic boundary. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §2.
  • [5] M. Asai (2019) Unsupervised grounding of plannable first-order logic representation from images. arXiv preprint arXiv:1902.08093. Cited by: §2.
  • [6] C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner (2018) Understanding disentangling in beta-vae. arXiv preprint arXiv:1804.03599. Cited by: §4.1, §4.1.
  • [7] C. P. Burgess, L. Matthey, N. Watters, R. Kabra, I. Higgins, M. Botvinick, and A. Lerchner (2019) MONet: unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390. Cited by: §2, Figure 3, §4.1, §4.1.
  • [8] T. Q. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud (2018)

    Isolating sources of disentanglement in variational autoencoders

    In Advances in Neural Information Processing Systems, pp. 2610–2620. Cited by: §2.
  • [9] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2172–2180. Cited by: §2.
  • [10] S. Chitta, I. Sucan, and S. Cousins (2012) Moveit![ros topics]. IEEE Robotics & Automation Magazine 19 (1), pp. 18–19. Cited by: §3.2.
  • [11] E. L. Denton et al. (2017) Unsupervised learning of disentangled representations from video. In Advances in Neural Information Processing Systems, pp. 4417–4426. Cited by: §2.
  • [12] D. Flickinger, E. M. Bender, and S. Oepen (2014) ERG semantic documentation. Note: Accessed on 2017-06-15 External Links: Link Cited by: Appendix A.
  • [13] S. Garrod and A. Anderson (1987) Saying what you mean in dialogue: a study in conceptual and semantic coordination. Cognition 27, pp. 181–218. Cited by: §2.
  • [14] K. Greff, R. L. Kaufmann, R. Kabra, N. Watters, C. Burgess, D. Zoran, L. Matthey, M. Botvinick, and A. Lerchner (2019) Multi-object representation learning with iterative variational inference. arXiv preprint arXiv:1903.00450. Cited by: §2, §2.
  • [15] S. Harnad (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42 (1-3), pp. 335–346. Cited by: §1, §2.
  • [16] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework.. ICLR 2 (5), pp. 6. Cited by: §2, §4.1.
  • [17] Y. Hristov, A. Lascarides, and S. Ramamoorthy (2018) Interpretable latent spaces for learning from demonstration. In Conference on Robot Learning, pp. 957–968. Cited by: §2, §4.1.
  • [18] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 2901–2910. Cited by: §2, §5.
  • [19] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix A.
  • [20] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §4.1.
  • [21] G. Koch, R. Zemel, and R. Salakhutdinov (2015) Siamese neural networks for one-shot image recognition. In

    ICML Deep Learning Workshop

    Vol. 2. Cited by: §4.1.
  • [22] J. E. Laird, K. Gluck, J. Anderson, K. D. Forbus, O. C. Jenkins, C. Lebiere, D. Salvucci, M. Scheutz, A. Thomaz, G. Trafton, et al. (2017) Interactive task learning. IEEE Intelligent Systems 32 (4), pp. 6–21. Cited by: §1.
  • [23] P. Liang and C. Potts (2015) Bringing machine learning and compositional semantics together. Annu. Rev. Linguist. 1 (1), pp. 355–376. Cited by: §2.
  • [24] S. Oepen, D. Flickinger, K. Toutanova, and C. D. Manning (2004) Research on Language and Computation 2 (4), pp. 575–596. Cited by: Appendix A.
  • [25] M. J. Pickering and S. Garrod (2004) Toward a mechanistic psychology of dialogue. Behavioral and Brain Sciences 27, pp. 169–225. Cited by: §2.
  • [26] D. Raposo, A. Santoro, D. Barrett, R. Pascanu, T. Lillicrap, and P. Battaglia (2017) Discovering objects and their relations from entangled scene representations. arXiv preprint arXiv:1702.05068. Cited by: §2.
  • [27] B. Rosman and S. Ramamoorthy (2011) Learning spatial relationships between objects. The International Journal of Robotics Research 30 (11), pp. 1328–1342. Cited by: §2.
  • [28] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap (2017) A simple neural network module for relational reasoning. In Advances in neural information processing systems, pp. 4967–4976. Cited by: §2.
  • [29] M. Shridhar and D. Hsu (2014) Interactive visual grounding of referring expressions for human-robot interaction. In Robotics: Science and systems, Cited by: §2.
  • [30] S. Thrun and L. Pratt (2012) Learning to learn. Springer Science & Business Media. Cited by: §2.
  • [31] P. Vogt (2002) The physical symbol grounding problem. Cognitive Systems Research 3 (3), pp. 429–457. Cited by: §1, §2.
  • [32] N. Watters, L. Matthey, C. P. Burgess, and A. Lerchner (2019) Spatial broadcast decoder: a simple architecture for learning disentangled representations in vaes. arXiv preprint arXiv:1901.07017. Cited by: Table 3, §4.1.

Appendix A Data Processing and Network Architecture

Preprocessing the gathered data consists of extracting the semantic masks, corresponding to each object in the scene, from the raw RGBD pixel-level channels of information and all object and relational labels associated with each pair of objects in a given scene. As issues of semantic segmentation are not the focus of our work, we start with a system that provides us the semantic masks for each object present in the scene from raw observation. In our robot experiments, the RGB part of the input is fed to a pre-trained Mask R-CNN model, which dictates the partial labelling afterwards. For the BlocksWorld we can extract the masks deterministically, since we have access to the full state of the scene.

Elementary Dependency Structures (EDS) [24] and the wide-coverage English Resource Grammar [12] are used to perform this step [24, 12]. The resultant [target relations referent] tuples are used to perform weak labelling over sequence of observations that comprise the demonstration.

For example, if we have [yellow_cube, {left, front} , blue_cube] as a parsed description and the semantic segmentation model detects a yellow_cube and a blue_cube present in the input image, this results in a single labelled data point () being added to , where and , , . Any segmented pair whose labels do not appear in the description is added to as an unlabelled data point.

The model architecture is implemented in the Chainer framework222https://docs.chainer.org/en/stable/. The encoder network takes as input a set of RGBD 128x128 pixel images, a 128x128 binary segmention mask, and a set of object and relational labels. It tries to reconstruct the same set of RGBD 128x128 pixel images, masked with the corresponding binary segmentation mask, and predict the all labels which are not unknown.

FC (2x8) Output LogNormal
FC (256)
Conv (k=3, s=2, p=1, c=64)
Conv (k=3, s=2, p=1, c=64)
Conv (k=3, s=2, p=1, c=64)
Conv (k=3, s=2, p=1, c=32)
Conv (k=3, s=2, p=1, c=32)
Input Image [128 x 128 x C]
(a) Encoder

Output Logits

Conv (k=3, s=2, p=1, c=C)
Conv (k=3, s=2, p=1, c=64)
Conv (k=3, s=2, p=1, c=64)
Conv (k=3, s=2, p=1, c=64)
append coord channels
tile (128, 128, 8)
Input Vector [8]
(b) Decoder
FC (2 x 6) Output Lognormal
FC (64)
FC (256)
Input Vector [2 x 8]
(c) Operator
Table 3: Network architectures used for the reported models. (a) and (c) are standard convilutuonal and fully-connected MLP networks, (b) is a spatial broadcast decoder, described in [32]

Across all experiments, training is performed for a fixed number of 50 epochs using a batch size of 32. The dimensionality of the latent space

= 8 across all experiments. The dimensionality of = 6 for the BlocksWorld experiments and = 3 for the robot teleoperation experiments. The Adam optimizer [19] is used through the learning process with the following values for its parameters—()

For all experiments, the values (unless when set to 0) for the three coefficients from Equation 1 are:

The values are chosen empirically in a manner such that all the loss terms have similar magnitude and thus none of them overwhelms the gradient updates while training the full model.

Appendix B Plan Segmentation Elaboration

We use the trained model to convert the sequence of observations —images in Figure 7—into a trace of relational embeddings —colored blocks in Figure 7. In order to detect whether the two objects move with respect to each other, a likelihood ratio test with two normal distributions— and —is performed on every two sequential embeddings and . For the purpose of the experiments, both and are diagonal covariance matrices with being 1 and 0.1 respectively. Additionally, for each part of the trace where the objects are moving with respect to each other, we can use the parametrised distributions for each cluster in each group in (including ones for unlabelled relationships) for an additional likelihood-ratio test to decide how the objects move—see Figure 7 (b). The latter is equivallent to essentially checking when and object changes membership along each concept group with respect to the reference object in the scene. This allows us to go from a sequence of observations to what is essentially a symbolic plan . However, it is noted that such an approach might capture noisy steps that do not represent the intent of the demonstrator—e.g. we move an object from being left to right wrt to another object by going behind it in the intermediate states. The upper-described procedure would infer that the moved object being behind the static one is a valid substep when that is not actually part of the user’s intent. Thus, in the presence of more demonstrations, we filter steps from the plan that are not identified in all demonstrations, in order to produce the essence of the demonstrated task. The goal is to try to identify the most invariant plan that best explains a set of demonstrations that have the same underlying goal—e.g. if we have two demonstrations where an object is moved from left to right with respect to another static object, we aim to identify an explanation that ignores the fact that once we move in front and once behind that object.

Once the task essence is extracted we match it against all demonstrations for that task and get corresponding plans for each of them. Edit Distance between the inferred and the underlying ground truth one is reported.

Figure 7: Plan segmentation pipeline.(a) Infer a movement prescription sequence—what is moved after what—and (b) infer how is each object moved when it is manipulated. In this example the red, green and blue object are sequentially stacked on top of the yellow one. A change in the color shade corresponds to (a) an object being moved or (b) an object changing the way it relates to the reference object in the scene along one or more concept groups.

Appendix C Disentanglement Analysis of the Information Bottleneck

In order to brind additional clarity in the properties of the learned latent relational space, we provide violin plots for the distributions of data points from each concept group (X axis on each plot). We can observe that model which do not utilise object label information in training the object embeddings —Figure 8 and Figure 10—tend to learn relational embeddings which fall in a tighter region, centered around 0, due to the influence of the KL objective. We hypothesise that this is one of the reasons for these models to underperform in inferring the movement prescription sequence for the given demonstrations. With the latent clusters being projected closer, tuning the parameters of the distributions used in the movement likelihood ratio test might required.

Figure 8: Evaluation of the degree of disentanglement in the latent space for each concept group across the different baseline models used in the ablation study — No , No
Figure 9: Evaluation of the degree of disentanglement in the latent space for each concept group across the different baseline models used in the ablation study — No , With
Figure 10: Evaluation of the degree of disentanglement in the latent space for each concept group across the different baseline models used in the ablation study — With , No
Figure 11: Evaluation of the degree of disentanglement in the latent space for each concept group across the different baseline models used in the ablation study — With , With