Log In Sign Up

Action sequencing using visual permutations

by   Michael Burke, et al.

Humans can easily reason about the sequence of high level actions needed to complete tasks, but it is particularly difficult to instil this ability in robots trained from relatively few examples. This work considers the task of neural action sequencing conditioned on a single reference visual state. This task is extremely challenging as it is not only subject to the significant combinatorial complexity that arises from large action sets, but also requires a model that can perform some form of symbol grounding, mapping high dimensional input data to actions, while reasoning about action relationships. Drawing on human cognitive abilities to rearrange objects in scenes to create new configurations, we take a permutation perspective and argue that action sequencing benefits from the ability to reason about both permutations and ordering concepts. Empirical analysis shows that neural models trained with latent permutations outperform standard neural architectures in constrained action sequencing tasks. Results also show that action sequencing using visual permutations is an effective mechanism to initialise and speed up traditional planning techniques and successfully scales to far greater action set sizes than models considered previously.


page 7

page 11

page 12


Grounding Predicates through Actions

Symbols representing abstract states such as "dish in dishwasher" or "cu...

Learning Visually Guided Latent Actions for Assistive Teleoperation

It is challenging for humans – particularly those living with physical d...

See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction

Cognitive planning is the structural decomposition of complex tasks into...

Visual Robot Task Planning

Prospection, the act of predicting the consequences of many possible fut...

Deep set conditioned latent representations for action recognition

In recent years multi-label, multi-class video action recognition has ga...

Action similarity judgment based on kinematic primitives

Understanding which features humans rely on – in visually recognizing ac...

1 Introduction

Humans possess a remarkable ability to plan and select actions to rearrange scenes and form infinitely many newly imagined constructs, relying on visual information to guide the process. Taking inspiration from this ability, this paper considers the challenge of learning to sequence a set of high level actions to solve a task depicted in a single reference input image, using as few demonstrations as possible.

Figure 1: This paper considers the task of learning to predict a target sequence of actions , given a single reference input image , from as few demonstrations as possible . Looking at the towers above, it is easy to see that by rearranging the blocks on the left we can build the tower on the right. We apply this permutation perspective to the problem of neural action sequencing.

Specifically, we are interested in constrained action sequencing tasks where multiple actions can be selected to solve a task, but each action can only be chosen once. Constraints like these are particularly common in robotics, and typically present in all assembly tasks. For example, assembling a stool may require that a seat is attached to four identical legs, using 4 identical screws. The symmetry present in this task means that we have multiple ways of performing this assembly, resulting in ambiguity in action sequencing. This work shows that existing neural action sequencing approaches fail in this setting, and introduces a model that, by construction, copes with this ambiguity.

Traditionally, action sequencing of this form has been the domain of planning and reasoning [10]

, relying on pre-trained perception modules and known transition dynamics, with clearly defined symbolic rules and constraints. However, more recently, our community has focused on data driven methods relying on neural network universal function approximators to build policies trained using a range of mechanisms, from behaviour cloning

[35, 39]

to deep reinforcement learning

[30, 26]. Much of our focus here has been on training or learning mechanisms, and there has been arguably less emphasis in robotics on the effect of the architectures we use and the inductive biases therein.

However, the choice of neural architecture strongly dictates the solutions we may find. For example, consider the tower building task in Figure 1, where our robot is required to select and place blocks to build the tower depicted in an input image. One approach to solving this task may be to consider tower building as a process of arbitrarily selecting blocks from an existing set (a fully connected neural classification model). Alternatively, we could frame tower building as a sequential process, where blocks selected are conditioned on previous block selections (a neural classification model with a temporal output layer). Unfortunately, neither of these approaches enforce a particularly important constraint that is common to classical planning systems [6], but absent from modern neural models – once a block has been placed, it can no longer be used in future.

This common-sense constraint is obvious to humans, and it is reasonably clear that we do not build towers by classification, but rather by rearranging or reordering an existing set. This paper explores this permutation perspective of action sequencing, introducing a neural architecture with latent permutations that allows for constrained, variable-length action sequencing from high-dimensional pixel inputs.

We investigate this model using a series of experiments conducted in a behaviour cloning setting, and show that while augmenting existing neural classification models with post-hoc symbolic constraints is reasonably effective at dealing with action re-use constraints in small scale settings, we gain significant improvements by directly embedding these into neural models using latent permutations. A particularly important finding of this paper is that action sequencing using latent permutations scales to significantly larger action set sizes than standard neural models, and copes well with combinatorially complex settings. In summary, the contributions of this paper are:

  1. the formulation of constrained action sequencing as a problem of learning to permute a discrete set (Sec. 3.1);

  2. a latent permutation modelling approach that outperforms standard neural models on vision-based action sequencing tasks (Sec. 4, 5.1 and 5.5);

  3. the application of the latent permutation model to generalise to new concepts using previously encountered action subsets (Sec. 5.2 and 5.3);

  4. a demonstration of the potential performance gains that can be obtained by using vision-based action sequencing to initialise optimisation-based planning algorithms (Sec. 5.4);

2 Related Work

Robotics has traditionally made a distinction between higher-level symbolic or task-level planning, and control at a behavioural level [20]. The former typically assumes that a domain specific language (DSL) defining objects, predicates, actions or operations and goals is available. For example, in robotic assembly [25], the domain in which our work has most relevance, DSLs may include placement actions defined in a given object frame, with contact constraints. Similarly, in carpentry planning this may include materials and parts, along with associated tools and operations that can be applied to these [41]

. Here, planning is typically formulated as a constrained search or optimisation problem, which can quickly become computationally expensive in more complex task settings. Early approaches dealing with this search relied on linear programming-like possibility trees or knowledge graphs

[36, 37], which try to prune the search space over actions by taking constraints into account. More recently, Lázaro-Gredilla et al. [23] learn concepts as cognitive programs by searching for algorithms that could generate a demonstrated scene, but this approach is limited to very simple scenes due to the need for dedicated scene parsers. In more complex settings it can be particularly challenging and time-consuming to develop a DSL, and inferring states from partially-observable uncertain environments is non-trivial.

In contrast, data-driven approaches like behaviour cloning [35] or deep reinforcement learning [30] try to avoid the need to specify a DSL or carefully program a robot, generally relying on universal neural network function approximators and substantial amounts of training data [22] to produce suitable policies that act directly on high dimensional observations. These connectionist models fail to incorporate many of the symbolic or logical constraints that are typically present in robot task planning settings. Existing attempts to extend neural models to handle these types of constraints are often made in a post-hoc fashion (eg. action clipping [11], elaboration using auxiliary losses [17]). In this paper, we incorporate symbolic planning ideas around symmetries and permutations, which are often exploited in constraint programming to speed up search [9, 13, 34], into neural models through the use of latent permutations. In so doing, we gain generalisability and an improved ability to handle ambiguities in action selection at scale.

Connectionist policies and model-based symbolic planning systems are by no means incompatible. As a practical middleground, there has been increasing interest in pruning the search space to speed up planning by using neural networks as universal function approximators. For example, Dreamcoder [8] relies on a neural model to propose suitable program structures to speed up search in a program induction setting. A similar technique has been used to interpret transition system dynamics, iteratively refining a priority queue of candidate solutions [33]. Neural surrogate modelling has also been broadly applied to warm start general purpose optimisation procedures [2, 40]. Along these lines, and closest to this work, Driess et al. [7] propose an image conditioned recurrent model to predict sequences of up to 6 actions, which are then used to speed up symbolic robot planning. Our approach is similar, but, as shown in this work, action sequencing with latent permutations explicitly allows for learning in the presence of action constraints, and significantly outperforms temporal models in settings where action ambiguity may exist and when action set sizes are scaled.

Deep learning using latent permutations is a recent approach that has proven useful in differentiable sorting and ranking applications [5, 3, 29, 32]

, and in computer vision for a range of applications including semi-supervised learning

[29], captioning [4] and point cloud segmentation [38]. However, to the best of our knowledge, this work is the first to explore their use for sequence modelling.

3 Preliminaries

3.1 Problem formulation

This work considers a behaviour cloning setting, where we are required to learn an open-loop, image-conditioned action sequencing policy from demonstrations. More formally, assume that a robot is required to correctly order N action primitives, , to accomplish some task described by an image , depicting a reference state associated with the task. Our goal is to use behaviour cloning to learn to predict an action sequence that will reproduce (or deconstruct) the scene depicted in a query image using prior training examples comprising action sequences and reference images , .

3.2 Baselines: Action sequencing using behaviour cloning and Hungarian assignment

A naive approach to addressing the problem above would be to train a multi-class, multi-label feed-forward convolutional neural network

with parameters , to predict action sequences directly, using a cross-entropy classification loss,


Here, is a binary label indicating the use of action in the -th step of the demonstrated action sequence, while

is a logit predicted by the neural network.

Since action sequencing is crucial to obtain a desired goal state, failure to predict even a single action correctly will result in failure to complete the task. This makes action sequencing using the behaviour cloning approach described above particularly challenging. Moreover, this naive approach to behaviour cloning is limited as there are no constraints on the model preventing action re-use. This poses problems if action ambiguity is present. For example, in Fig. 1 there are two blocks of each colour, so it is possible that a model lacking the ability to reason about objects or actions already performed would attempt to call actions to pick and place the same object twice when attempting to assemble the tower.

Figure 2: Sequence modelling can capture temporal ordering information in action sequences, but does not explicitly prevent action re-use.

A standard approach to incorporating temporal information like this is to rely on sequence modelling, using recurrency [16, 14] or temporal convolutions [1, 24] in the output sequence prediction, as illustrated in Fig. 2. Models like these have recently been proposed for vision-based action sequencing [7]. However, these models do not explicitly incorporate constraints on action re-use, which are common in robotics.

The ability to reason about permutations is valuable across a wide range of tasks in robotics, and often studied as a balanced linear assignment problem, with the goal of identifying a permutation or assignment matrix that remaps some standard ordering ,


so as to minimise some assignment cost . The Hungarian algorithm [21, 31] is a well known technique to solve problems of this form in polynomial time. By using the logits of the network to produce an assignment cost,

, we can apply the Hungarian algorithm to order actions, and avoid issues around action re-use. However, if the classifier is overconfident in prediction (for example, predicts the same action twice with high probability), this assignment operation could introduce additional errors.

Instead of applying the Hungarian algorithm as a post-hoc assignment stage, it is natural to consider the possibility of learning with embedded inductive biases for action assignment. Recent approaches to differentiable sorting and ranking [5, 3, 29, 32] provide a useful mechanism to learn about permutations in this manner.

4 Action sequencing using Sinkhorn networks

Differentiable sorting networks typically rely on the Sinkhorn operator , [29], acting on a square matrix ,


Here, and denote column and row normalisation operations respectively. Mena et al. [29] show that a differentiable approximation to the permutation can be obtained using the Sinkhorn operator,



a square matrix predicted using a suitable feed-forward neural network. Intuitively, this soft assignment operation can be thought of as the permutation analogue of a softmax operation.

4.1 Image conditioned action sequencing

We make use of Sinkhorn networks to sequence robot actions by training a feedforward convolutional neural network to predict matrix , using the Sinkhorn operator (with Gumbel-Matching [29] to determine permutation . This network can be trained to minimise a mean squared error loss between sequenced actions,



denotes a one-hot encoded base action sequence order, and

a one-hot encoded demonstrated action sequence sampled from the training set. After training, action sequencing occurs by predicting a permutation matrix, and using the Hungarian algorithm for hard assignment.

By construction, a permutation is unable to re-use an action, which forces the network to learn to deal with action ambiguities. Our hypothesis is that behaviour cloning models trained with explicit inductive biases towards permutations will be better suited to constrained action sequencing than feedforward and temporal convolutional neural networks.

4.2 Coping with action subsets

None of the approaches described above are able to deal with restricted subsets of actions. For example, building a tower using 3 blocks does not require all actions be used, but the models above all assume that a fixed number of actions are required to complete tasks.

Figure 3: Framework for action sequencing using visual permutations.

We extend the model above to handle action subsets using an auxiliary stopping network parametrised by , that predicts the number of actions required to complete a task for a given image. This network is trained using a standard cross-entropy classification loss.

The extension to subsets requires that we modify the loss in (7

) to allow for variable action sequence lengths. We accomplish this by masking the predicted and ground truth sequences in the respective loss functions. Fig.

3 provides an overview of the proposed action sequencing model. A Sinkhorn network is used to predict permutations over action sequences conditioned on a reference scene, and a masking network restricts the sequence of actions to only the subset required to complete the referenced task.

5 Experimental Results

We start by investigating the effects of including inductive biases for permutations in the neural network used for behaviour cloning for our running tower stacking example.

5.1 Fixed length action sequencing

Figure 4: Mean average precision (correct block colour) distributions show that BC+Sinkhorn substantially outperforms behaviour cloning models that do not explicitly account for permutations.

Here, robot actions consist of picking up a block from a known location and placing it on the previously placed block. Once a block has been placed, this action should no longer be called, as doing so would demolish the tower. We collect 300 tower stacking demonstrations with randomly ordered action sequences of length 6 using CoppeliaSim and PyRep [18], and save the corresponding image of the completed tower (see Fig. 1).

We train on 200 demonstrations and evaluate on 100 held out demonstrations, using the mean average precision (number of times a block of the correct colour was correctly selected for placement) between predicted and ground truth action sequences. This experiment is repeated 20 times for models trained using different random seeds.

Models compared include direct behaviour cloning (BC) with a fully connected output layer, the same behaviour cloning network with post-hoc assignment using the Hungarian algorithm (BC+Hungarian), and action sequencing using Sinkhorn networks (BC+Sinkhorn). In addition, we also investigate behaviour cloning using a temporal convolutional neural network decoder (BC+TCN/ BC+TCN+Hungarian), a state-of-the art sequence modelling approach commonly used for action recognition [24], which enforces temporal structure in the predicted output sequence.

Figure 4

shows kernel density estimates over the mean average precisions in block predictions over multiple seeds, while Table

1 shows the average number of times a block selection action was re-used (The tower collapses upon action re-use). It is clear that the inductive bias towards permutation prediction introduced using the Sinkhorn networks improves action sequencing.

Colour Precision Action repetitions
(Mean, Std. Dev.) (Tower Collapses)
BC 46 %
BC+Hungarian 0 %
BC+TCN 53 %
BC+TCN+Hungarian 0 %
BC+Sinkhorn 0 %
Table 1: Block colour precision and tower building success

While post-hoc Hungarian algorithm assignment remedies problems where the same action is selected multiple times, and provides substantial improvement over direct behaviour cloning, both fail dismally at image conditioned tower building. Explicitly modelling temporal action sequence behaviours using temporal convolutional neural networks improves performance, but is still outperformed by BC+Sinkhorn.

5.2 Generalisation to unseen configurations

Figure 5: Both BC+Sinkhorn and BC+TCN are able to generalise to previously unseen configurations. Plots are overlaid on a histogram of average precision scores (BC+Sinkhorn) obtained for increasing numbers of demonstrations.

In order to investigate the reasons for performance differences between BC+TCN and BC+Sinkhorn, we explore the generalisation capabilities of the action sequencing models using a modified tower building experiment, with 6 uniquely coloured blocks (avoiding the potential for action ambiguity). We generate a single demonstration pair for each of the 720 possible tower permutations, and train models on increasing numbers of demonstrations. We then test on all 720 demonstrations.

If the behaviour cloning models are capable of generalisation to unseen tower permutations, we would expect the precision of predicted action sequences to be better than a baseline approach of random ordering for unseen permutations and perfect ordering for previously seen permutations.

Fig. 5 shows these results. The dashed line shows the hypothetical precision for action sequences that would be obtained by an approach that memorises previously seen action sequences and randomly guesses orders for unseen sequences. Both BC+Sinkhorn and BC+TCN networks are able to generalise to previously unseen action sequence configurations, and perform similarly in this setting.

This contrasts with the previous set of experiments and indicates that the primary advantage BC+Sinkhorn has over BC+TCN is in dealing with symmetries that arise due to action ambiguities, where more than one action can reproduce a tower, and it becomes significantly more important to reason about prior actions taken in a sequence. When there is no ambiguity in actions, which is rarely the case in robotics applications, TCNs perform similarly to Sinkhorn networks.

5.3 Variable length action sequencing

We investigate variable length action sequencing using a third and final tower building experiment (with actions as in Figure 1, but with variable length action sequences – tower heights ranging from 2 to 6 blocks). As before, all possible permutations of demonstrations are generated (1950), and models are trained on increasing numbers of demonstrations. Since variable length sequences are required, we make use of the stopping mask extension of Section 4.2 for the BC+Sinkhorn networks. We also compare against BC+TCN+Hungarian, extended to deal with variable length sequences through the inclusion of a stopping action class.

Figure 6: Action sequencing using BC+Sinkhorn shows impressive generalisation properties when action subsets are considered. Plots are overlaid on a histogram (BC+Sinkhorn) of block colour precision scores for sequences in the test set.

Fig. 6 shows these results. As before, the dashed line shows the hypothetical precision for action sequences that would be obtained by an approach that memorises previously seen action sequences and randomly guesses orders for unseen sequences. Interestingly, when training using subsets, we obtain more rapid generalisation than in the seemingly simpler case investigated earlier. This occurs because it becomes increasingly more likely that action subsets have been seen within demonstrations as more training data is used.

The task of classifying the number of actions required for a subset is relatively simple for this tower building task, and the stopping mask prediction network successfully identifies the number of actions required to build a tower after approximately 200 demonstrations have been seen. As in Section 5.1, BC+Sinkhorn networks perform better than BC+TCN+Hungarian networks, since they allow for action ambiguity and explicitly take prior action use into consideration.

5.4 Soma puzzle: initialising plans with sequence predictions

Figure 7: Soma puzzle disassembly. Here, the task is to predict the block removal sequence in order to disassemble the Soma cube, given a set of four input images of the cube, captured from different sides. Failure to correctly predict the removal sequence will result in the cube collapsing, placing the environment in a state where pre-scripted action sequences can no longer be used.

The ability to reason about action sequence permutations is particularly important in assembly or disassembly tasks. We investigate the ability of behaviour cloning to solve more complex tasks, using a Soma puzzle [12]. As illustrated in Fig. 7, Soma puzzles consist of 7 distinctly shaped blocks, which can be assembled into arbitrary shaped objects. Here, we consider the task of disassembling a 3x3 Soma cube, which can be constructed in 240 distinct ways (ignoring reflections and rotations).

The Soma puzzle has a long history in robotics and robot learning [28, 27], and has been the study of extensive research due to the complexity of shapes that can be constructed using it. The geometry of the puzzle parts means that disassembling the puzzle using pre-scripted actions requires that parts be removed in a precise order. Failure to do so will result in the puzzle collapsing, placing the environment in a state where pre-scripted actions can no longer be used. Correctly predicting the order of part extraction from images of the cube is challenging as it requires a model that can reason about how parts interlock and their relative positioning.

This problem is also a challenge for traditional planning algorithms, requiring a careful, and non-trivial, specification of relationships between components and problem constraints and a backtracking search over numerous possible action sequences. To investigate this setting, models were trained using a dataset comprising images of the 240 possible initial puzzle configurations, and a manually defined extraction order for each puzzle, randomly (repeated 100 times with different seeds) split into 120 training and validation examples, and 120 test examples.

Initialisation Initial Planning iterations
Collapses % Mean Std. Dev.
Random -
Table 2: Soma cube results

As shown in Table 2, despite outperforming the temporal convolutional architecture (BC+TCN+Hungarian), Sinkhorn behaviour cloning (BC+Sinkhorn) is still only successful on about half of the test cases when predicted action sequences are directly applied. However, when the predicted action sequences are used to initialise a suitable planning algorithm111A backtracking search using the simulator in the loop to test for failures. there are substantial gains in planning time, with a clear reduction in the number of iterations used to search for a suitable planning order.

In this case, it seems that there is limited difference between the TCN and Sinkhorn models, as both provide good initial guesses that speed up planning. However, as will be shown next, results on larger actions sets indicate that Sinkhorn networks scale far better than TCNs, and these performance differences become substantially more pronounced as more actions are considered.

5.5 Scrabble: scaling to larger action sets

Figure 8: Performance as action set size increases.

A simplified Scrabble setting is used to evaluate the ability of BC+Sinkhorn to scale to larger action sets. Here, a standard English Scrabble tile set is used to generate images (10000) of random letter combinations with lengths 3 to 6, sampled from an increasingly large subset of the full tile set. We test on a set of randomly generated test words (5000).

Fig. 8 shows the decrease in performance (mean average spelling precision) as the number of actions used to generate training and test data is increased. BC+Sinkhorn shows extremely impressive scalability, with only small decreases in performance as the action set size increases. This could be remedied by additional training data, although training becomes time consuming with larger action sets222

As the number of actions increased, we observed that we needed to train for substantially longer before reaching convergence, with our largest model (98 actions) requiring approximately 5000 epochs to converge.

. In contrast BC+TCN+Hungarian networks become increasingly unreliable as the action set size is increased, with significant performance drops. This happens because symmetries arising from action ambiguities become increasingly more likely as the action sets are scaled, and classification-based approaches are unable to deal with these ambiguities.

6 Conclusion

This paper introduces a permutation prediction approach to vision-based neural action sequencing. Action sequencing using latent permutations predicted by Sinkhorn networks is most effective in tasks where there are potentially multiple actions leading to a desired state, and where there are constraints on the number of times an action can be used. Results show that neural action sequencing provides valuable improvements in planning speed when used to initialise planning algorithms, and experiments showed that impressive generalisation can be obtained using these networks. Importantly, Sinkhorn networks are able to scale to far greater action set sizes than temporal convolution networks. Temporal convolution and Sinkhorn networks are similar capacity models, and there are no major computational differences in the forward pass, which means that latent permutations are a promising and useful approach to sequence modelling in robotics.

This research was supported by the Alan Turing Institute, as part of the Safe AI for surgical assistance project. We are grateful to Yordan Hristov and members of the Robust Autonomy and Decisions Group for discussions and feedback.


  • [1] S. Bai, J. Z. Kolter, and V. Koltun (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: §3.2.
  • [2] S. Banerjee, T. Lew, R. Bonalli, A. Alfaadhel, I. A. Alomar, H. M. Shageer, and M. Pavone (2020) Learning-based warm-starting for fast sequential convex programming and trajectory optimization. Cited by: §2.
  • [3] M. Blondel, O. Teboul, Q. Berthet, and J. Djolonga (2020) Fast differentiable sorting and ranking. arXiv preprint arXiv:2002.08871. Cited by: §2, §3.2.
  • [4] M. Cornia, L. Baraldi, and R. Cucchiara (2019-06) Show, control and tell: a framework for generating controllable and grounded captions. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §2.
  • [5] M. Cuturi, O. Teboul, and J. Vert (2019) Differentiable ranking and sorting using optimal transport. In Advances in Neural Information Processing Systems, pp. 6858–6868. Cited by: §2, §3.2.
  • [6] N. T. Dantam, Z. K. Kingston, S. Chaudhuri, and L. E. Kavraki (2016-06) Incremental task and motion planning: a constraint-based approach. In Proceedings of Robotics: Science and Systems, AnnArbor, Michigan. Cited by: §1.
  • [7] D. Driess, J. Ha, and M. Toussaint (2020-07) Deep Visual Reasoning: Learning to Predict Action Sequences for Task and Motion Planning from an Initial Scene Image. In Proceedings of Robotics: Science and Systems, Corvalis, Oregon, USA. Cited by: §2, §3.2.
  • [8] K. Ellis, C. Wong, M. Nye, M. Sablé-Meyer, L. Cary, L. Morales, L. Hewitt, A. Solar-Lezama, and J. B. Tenenbaum (2020)

    DreamCoder: growing generalizable, interpretable knowledge with wake-sleep bayesian program learning

    ArXiv abs/2006.08381. Cited by: §2.
  • [9] M. Fox and D. Long (2002) Extending the exploitation of symmetries in planning.. In AIPS, pp. 83–91. Cited by: §2.
  • [10] M. Fox and D. Long (2003) PDDL2.1: an extension to PDDL for expressing temporal planning domains.

    Journal of Artificial Intelligence Research

    20, pp. 61–124.
    Cited by: §1.
  • [11] Y. Fujita and S. Maeda (2018) Clipped action policy gradient. In

    Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018

    , J. G. Dy and A. Krause (Eds.),
    Vol. 80, pp. 1592–1601. Cited by: §2.
  • [12] M. Gardner (1972) Mathematical games. Scientific American 227 (3), pp. 176–184. Cited by: §5.4.
  • [13] I. P. Gent, K. E. Petrie, and J. Puget (2006) Chapter 10 - Symmetry in constraint programming. In Handbook of Constraint Programming, F. Rossi, P. van Beek, and T. Walsh (Eds.), Foundations of Artificial Intelligence, Vol. 2, pp. 329 – 376. Cited by: §2.
  • [14] A. Graves (2013)

    Generating sequences with recurrent neural networks

    arXiv preprint arXiv:1308.0850. Cited by: §3.2.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Scrabble.
  • [16] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.2.
  • [17] C. Innes and S. Ramamoorthy (2020-07) Elaborating on Learned Demonstrations with Temporal Logic Specifications. In Proceedings of Robotics: Science and Systems, Corvalis, Oregon, USA. Cited by: §2.
  • [18] S. James, M. Freese, and A. J. Davison (2019) PyRep: bringing v-rep to deep robot learning. arXiv preprint arXiv:1906.11176. Cited by: Soma puzzle, §5.1.
  • [19] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Tower building.
  • [20] G. Konidaris, L. P. Kaelbling, and T. Lozano-Perez (2018) From skills to symbols: learning symbolic representations for abstract high-level planning. Journal of Artificial Intelligence Research 61, pp. 215–289. Cited by: §2.
  • [21] H. W. Kuhn (1955) The Hungarian method for the assignment problem. Naval research logistics quarterly 2 (1-2), pp. 83–97. Cited by: §3.2.
  • [22] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman (2017) Building machines that learn and think like people. Behavioral and brain sciences 40. Cited by: §2.
  • [23] M. Lázaro-Gredilla, D. Lin, J. S. Guntupalli, and D. George (2019) Beyond imitation: zero-shot task transfer on robots by learning concepts as cognitive programs. Science Robotics 4 (26). Cited by: §2.
  • [24] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager (2017) Temporal convolutional networks for action segmentation and detection. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165. Cited by: §3.2, §5.1.
  • [25] L. I. Lieberman and M. A. Wesley (1977) AUTOPASS: an automatic programming system for computer controlled mechanical assembly. IBM Journal of Research and Development 21 (4), pp. 321–333. Cited by: §2.
  • [26] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.
  • [27] C. M. Malcolm (1995) The SOMASS system: a hybrid symbolic and behaviour-based system to plan and execute assemblies by robot. Hybrid Problems, Hybrid Solutions, pp. 157–168. Cited by: §5.4.
  • [28] C. Malcolm and T. Smithers (1990) Symbol grounding via a hybrid architecture in an autonomous assembly system. Robotics and Autonomous Systems 6 (1), pp. 123 – 144. Note: Designing Autonomous Agents Cited by: §5.4.
  • [29] G. Mena, D. Belanger, S. Linderman, and J. Snoek (2018) Learning latent permutations with gumbel-sinkhorn networks. In International Conference on Learning Representations, Cited by: §2, §3.2, §4.1, §4.
  • [30] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §1, §2.
  • [31] J. Munkres (1957) Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics 5 (1), pp. 32–38. Cited by: §3.2.
  • [32] M. B. Paulus, D. Choi, D. Tarlow, A. Krause, and C. J. Maddison (2020) Gradient estimation with stochastic softmax tricks. arXiv preprint arXiv:2006.08063. Cited by: §2, §3.2.
  • [33] S. Penkov and S. Ramamoorthy (2017) Using program induction to interpret transition system dynamics. arXiv preprint arXiv:1708.00376. Cited by: §2.
  • [34] N. Pochter, A. Zohar, and J. S. Rosenschein (2011) Exploiting problem symmetries in state-based planners. In Twenty-Fifth AAAI Conference on Artificial Intelligence, Cited by: §2.
  • [35] D. A. Pomerleau (1989) Alvinn: an autonomous land vehicle in a neural network. In Advances in neural information processing systems, pp. 305–313. Cited by: §1, §2.
  • [36] R. J. Popplestone, A. P. Ambler, and I. Bellos (1980) An interpreter for a language for describing assemblies. Artificial Intelligence 14 (1), pp. 79–107. Cited by: §2.
  • [37] R. J. Popplestone, Y. Liu, and R. Weiss (1990-Mar.) A group theoretic approach to assembly planning. AI Magazine 11 (1), pp. 82. Cited by: §2.
  • [38] R. A. Rosu, P. Schütt, J. Quenzel, and S. Behnke (2020-07) LatticeNet: Fast Point Cloud Segmentation Using Permutohedral Lattices. In Proceedings of Robotics: Science and Systems, Corvalis, Oregon, USA. Cited by: §2.
  • [39] F. Torabi, G. Warnell, and P. Stone (2018) Behavioral cloning from observation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 4950–4957. Cited by: §1.
  • [40] M. I. Velazco, A. R. Oliveira, and C. Lyra (2002) Neural networks give a warm start to linear optimization problems. In Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN’02 (Cat. No. 02CH37290), Vol. 2, pp. 1871–1876. Cited by: §2.
  • [41] C. Wu, H. Zhao, C. Nandi, J. I. Lipton, Z. Tatlock, and A. Schulz (2019) Carpentry compiler. ACM Transactions on Graphics 38 (6), pp. Article No. 195. Note: presented at SIGGRAPH Asia 2019 Cited by: §2.

Experimental settings

See below for a summary of experimental settings. All code and experimental data generation scripts will be released after peer review. The accompanying video illustrates failure modes and example use cases.

Tower building

Figure 9: Training data for tower building experiments consists of pick and place action sequences and the corresponding images of completed towers. Action sequences are coloured in accordance with the block they correspond to moving for visual clarity. Our goal is to predict the correct sequence of actions required to reproduce an input image scene.
CNN Encoder

32 5x5 kernels, ReLU

64 5x5 kernels, ReLU
2x2 MaxPool2D
128 5x5 kernels, ReLU
2x2 MaxPool2D
256 5x5 kernels, ReLU
2x2 MaxPool2D
Dropout (p=0.5)

128 Neuron FC, ReLU

A set of 6 blocks (2 blue, 2 yellow, 2 red) were used for tower building (baseline and subsets experiments) in CoppeliaSim, with primitive actions to pick up a block from a pre-defined start position and place it above the last placed block (no action is taken to place the first block in the sequence). 6 uniquely coloured blocks were used for generalisation experiments.

Both Sinkhorn and TCNs used a CNN encoder with parameters listed to the left. Behaviour cloning and Sinkhorn networks were trained with batch sizes of 16 for 10000 epochs, while the TCNs were trained for 2000, both using Adam [19] and a learning rate of 3e-4. TCNs used 6 layers of length 6, the length of the maximum action set size.

Soma puzzle

Figure 10: Soma puzzle pieces.

Soma puzzles consist of 7 distinctly shaped blocks, which can be assembled into arbitrary shaped objects. Here, we consider the task of disassembling a 3x3 Soma cube, which can be constructed in 240 distinct ways (ignoring reflections and rotations). Soma solutions were obtained using Polyform Puzzler,, a set of solvers for polyforms.

For Soma puzzle extraction planning, a backtracking search was used to find collapse free extraction orders. Here, actions that trigger collapse were randomly swapped with later actions, using the simulator in the loop to identify collapses. Experiments compared planning speed when search was initialised using a random starting sequence and using sequence predictions from the neural networks.

Soma puzzle parts were modelled in PyRep [18] using independent cubes, and collapse detection was implemented by checking if any of the cubes had moved after a part had been deleted from the scene.

The same CNN encoder architecture used for tower building was used here, but models were trained with batch sizes of 32, using Adam and a learning rate of 3e-4.


Figure 11: Tile set used for image generation demonstration in Coppeliasim.

A standard English scrabble tile set (without blanks) – 98 possible tiles comprising alphabet letters with frequencies a=9, b=2, c=2, d=4, e=12, f=2, g=3, h=2, i=9, j=1, k=1, l=4, m=2, n=6, o=8, p=2, q=1, r=6, s=4, t=6, u=4, v=2, w=2, x=1, y=2, z=1 – is used to generate images of random letter combinations. Since letters can be used multiple times, this introduces additional complexity in the grounding of image components to actions.

Both BC+TCN and BC+Sinkhorn used a Resnet18 encoder [15] for input images and were trained using Adam with a learning rate of 1e-4. Models were trained with a batch size of 64.

TCNs used 6 temporal convolution layers of length 7, the maximum number of actions in an action sequence. Sinkhorn networks used a fixed size latent bottleneck state of 128 dimensions, while TCNs used 7 latent states of 16 dimensions each. Both models were trained for 5000 epochs.

Robot demonstrations were conducted in CoppeliaSim using PyRep, using pre-scriped actions, pick tile , place at position , where denotes an offset corresponding to the -th action in a sequence.