Humans possess a remarkable ability to plan and select actions to rearrange scenes and form infinitely many newly imagined constructs, relying on visual information to guide the process. Taking inspiration from this ability, this paper considers the challenge of learning to sequence a set of high level actions to solve a task depicted in a single reference input image, using as few demonstrations as possible.
Specifically, we are interested in constrained action sequencing tasks where multiple actions can be selected to solve a task, but each action can only be chosen once. Constraints like these are particularly common in robotics, and typically present in all assembly tasks. For example, assembling a stool may require that a seat is attached to four identical legs, using 4 identical screws. The symmetry present in this task means that we have multiple ways of performing this assembly, resulting in ambiguity in action sequencing. This work shows that existing neural action sequencing approaches fail in this setting, and introduces a model that, by construction, copes with this ambiguity.
Traditionally, action sequencing of this form has been the domain of planning and reasoning 
, relying on pre-trained perception modules and known transition dynamics, with clearly defined symbolic rules and constraints. However, more recently, our community has focused on data driven methods relying on neural network universal function approximators to build policies trained using a range of mechanisms, from behaviour cloning[35, 39]
to deep reinforcement learning[30, 26]. Much of our focus here has been on training or learning mechanisms, and there has been arguably less emphasis in robotics on the effect of the architectures we use and the inductive biases therein.
However, the choice of neural architecture strongly dictates the solutions we may find. For example, consider the tower building task in Figure 1, where our robot is required to select and place blocks to build the tower depicted in an input image. One approach to solving this task may be to consider tower building as a process of arbitrarily selecting blocks from an existing set (a fully connected neural classification model). Alternatively, we could frame tower building as a sequential process, where blocks selected are conditioned on previous block selections (a neural classification model with a temporal output layer). Unfortunately, neither of these approaches enforce a particularly important constraint that is common to classical planning systems , but absent from modern neural models – once a block has been placed, it can no longer be used in future.
This common-sense constraint is obvious to humans, and it is reasonably clear that we do not build towers by classification, but rather by rearranging or reordering an existing set. This paper explores this permutation perspective of action sequencing, introducing a neural architecture with latent permutations that allows for constrained, variable-length action sequencing from high-dimensional pixel inputs.
We investigate this model using a series of experiments conducted in a behaviour cloning setting, and show that while augmenting existing neural classification models with post-hoc symbolic constraints is reasonably effective at dealing with action re-use constraints in small scale settings, we gain significant improvements by directly embedding these into neural models using latent permutations. A particularly important finding of this paper is that action sequencing using latent permutations scales to significantly larger action set sizes than standard neural models, and copes well with combinatorially complex settings. In summary, the contributions of this paper are:
2 Related Work
Robotics has traditionally made a distinction between higher-level symbolic or task-level planning, and control at a behavioural level . The former typically assumes that a domain specific language (DSL) defining objects, predicates, actions or operations and goals is available. For example, in robotic assembly , the domain in which our work has most relevance, DSLs may include placement actions defined in a given object frame, with contact constraints. Similarly, in carpentry planning this may include materials and parts, along with associated tools and operations that can be applied to these 
. Here, planning is typically formulated as a constrained search or optimisation problem, which can quickly become computationally expensive in more complex task settings. Early approaches dealing with this search relied on linear programming-like possibility trees or knowledge graphs[36, 37], which try to prune the search space over actions by taking constraints into account. More recently, Lázaro-Gredilla et al.  learn concepts as cognitive programs by searching for algorithms that could generate a demonstrated scene, but this approach is limited to very simple scenes due to the need for dedicated scene parsers. In more complex settings it can be particularly challenging and time-consuming to develop a DSL, and inferring states from partially-observable uncertain environments is non-trivial.
In contrast, data-driven approaches like behaviour cloning  or deep reinforcement learning  try to avoid the need to specify a DSL or carefully program a robot, generally relying on universal neural network function approximators and substantial amounts of training data  to produce suitable policies that act directly on high dimensional observations. These connectionist models fail to incorporate many of the symbolic or logical constraints that are typically present in robot task planning settings. Existing attempts to extend neural models to handle these types of constraints are often made in a post-hoc fashion (eg. action clipping , elaboration using auxiliary losses ). In this paper, we incorporate symbolic planning ideas around symmetries and permutations, which are often exploited in constraint programming to speed up search [9, 13, 34], into neural models through the use of latent permutations. In so doing, we gain generalisability and an improved ability to handle ambiguities in action selection at scale.
Connectionist policies and model-based symbolic planning systems are by no means incompatible. As a practical middleground, there has been increasing interest in pruning the search space to speed up planning by using neural networks as universal function approximators. For example, Dreamcoder  relies on a neural model to propose suitable program structures to speed up search in a program induction setting. A similar technique has been used to interpret transition system dynamics, iteratively refining a priority queue of candidate solutions . Neural surrogate modelling has also been broadly applied to warm start general purpose optimisation procedures [2, 40]. Along these lines, and closest to this work, Driess et al.  propose an image conditioned recurrent model to predict sequences of up to 6 actions, which are then used to speed up symbolic robot planning. Our approach is similar, but, as shown in this work, action sequencing with latent permutations explicitly allows for learning in the presence of action constraints, and significantly outperforms temporal models in settings where action ambiguity may exist and when action set sizes are scaled.
3.1 Problem formulation
This work considers a behaviour cloning setting, where we are required to learn an open-loop, image-conditioned action sequencing policy from demonstrations. More formally, assume that a robot is required to correctly order N action primitives, , to accomplish some task described by an image , depicting a reference state associated with the task. Our goal is to use behaviour cloning to learn to predict an action sequence that will reproduce (or deconstruct) the scene depicted in a query image using prior training examples comprising action sequences and reference images , .
3.2 Baselines: Action sequencing using behaviour cloning and Hungarian assignment
A naive approach to addressing the problem above would be to train a multi-class, multi-label feed-forward convolutional neural networkwith parameters , to predict action sequences directly, using a cross-entropy classification loss,
Here, is a binary label indicating the use of action in the -th step of the demonstrated action sequence, while
is a logit predicted by the neural network.
Since action sequencing is crucial to obtain a desired goal state, failure to predict even a single action correctly will result in failure to complete the task. This makes action sequencing using the behaviour cloning approach described above particularly challenging. Moreover, this naive approach to behaviour cloning is limited as there are no constraints on the model preventing action re-use. This poses problems if action ambiguity is present. For example, in Fig. 1 there are two blocks of each colour, so it is possible that a model lacking the ability to reason about objects or actions already performed would attempt to call actions to pick and place the same object twice when attempting to assemble the tower.
A standard approach to incorporating temporal information like this is to rely on sequence modelling, using recurrency [16, 14] or temporal convolutions [1, 24] in the output sequence prediction, as illustrated in Fig. 2. Models like these have recently been proposed for vision-based action sequencing . However, these models do not explicitly incorporate constraints on action re-use, which are common in robotics.
The ability to reason about permutations is valuable across a wide range of tasks in robotics, and often studied as a balanced linear assignment problem, with the goal of identifying a permutation or assignment matrix that remaps some standard ordering ,
so as to minimise some assignment cost . The Hungarian algorithm [21, 31] is a well known technique to solve problems of this form in polynomial time. By using the logits of the network to produce an assignment cost,
, we can apply the Hungarian algorithm to order actions, and avoid issues around action re-use. However, if the classifier is overconfident in prediction (for example, predicts the same action twice with high probability), this assignment operation could introduce additional errors.
Instead of applying the Hungarian algorithm as a post-hoc assignment stage, it is natural to consider the possibility of learning with embedded inductive biases for action assignment. Recent approaches to differentiable sorting and ranking [5, 3, 29, 32] provide a useful mechanism to learn about permutations in this manner.
4 Action sequencing using Sinkhorn networks
Differentiable sorting networks typically rely on the Sinkhorn operator , , acting on a square matrix ,
Here, and denote column and row normalisation operations respectively. Mena et al.  show that a differentiable approximation to the permutation can be obtained using the Sinkhorn operator,
a square matrix predicted using a suitable feed-forward neural network. Intuitively, this soft assignment operation can be thought of as the permutation analogue of a softmax operation.
4.1 Image conditioned action sequencing
We make use of Sinkhorn networks to sequence robot actions by training a feedforward convolutional neural network to predict matrix , using the Sinkhorn operator (with Gumbel-Matching  to determine permutation . This network can be trained to minimise a mean squared error loss between sequenced actions,
denotes a one-hot encoded base action sequence order, anda one-hot encoded demonstrated action sequence sampled from the training set. After training, action sequencing occurs by predicting a permutation matrix, and using the Hungarian algorithm for hard assignment.
By construction, a permutation is unable to re-use an action, which forces the network to learn to deal with action ambiguities. Our hypothesis is that behaviour cloning models trained with explicit inductive biases towards permutations will be better suited to constrained action sequencing than feedforward and temporal convolutional neural networks.
4.2 Coping with action subsets
None of the approaches described above are able to deal with restricted subsets of actions. For example, building a tower using 3 blocks does not require all actions be used, but the models above all assume that a fixed number of actions are required to complete tasks.
We extend the model above to handle action subsets using an auxiliary stopping network parametrised by , that predicts the number of actions required to complete a task for a given image. This network is trained using a standard cross-entropy classification loss.
The extension to subsets requires that we modify the loss in (7
) to allow for variable action sequence lengths. We accomplish this by masking the predicted and ground truth sequences in the respective loss functions. Fig.3 provides an overview of the proposed action sequencing model. A Sinkhorn network is used to predict permutations over action sequences conditioned on a reference scene, and a masking network restricts the sequence of actions to only the subset required to complete the referenced task.
5 Experimental Results
We start by investigating the effects of including inductive biases for permutations in the neural network used for behaviour cloning for our running tower stacking example.
5.1 Fixed length action sequencing
Here, robot actions consist of picking up a block from a known location and placing it on the previously placed block. Once a block has been placed, this action should no longer be called, as doing so would demolish the tower. We collect 300 tower stacking demonstrations with randomly ordered action sequences of length 6 using CoppeliaSim and PyRep , and save the corresponding image of the completed tower (see Fig. 1).
We train on 200 demonstrations and evaluate on 100 held out demonstrations, using the mean average precision (number of times a block of the correct colour was correctly selected for placement) between predicted and ground truth action sequences. This experiment is repeated 20 times for models trained using different random seeds.
Models compared include direct behaviour cloning (BC) with a fully connected output layer, the same behaviour cloning network with post-hoc assignment using the Hungarian algorithm (BC+Hungarian), and action sequencing using Sinkhorn networks (BC+Sinkhorn). In addition, we also investigate behaviour cloning using a temporal convolutional neural network decoder (BC+TCN/ BC+TCN+Hungarian), a state-of-the art sequence modelling approach commonly used for action recognition , which enforces temporal structure in the predicted output sequence.
shows kernel density estimates over the mean average precisions in block predictions over multiple seeds, while Table1 shows the average number of times a block selection action was re-used (The tower collapses upon action re-use). It is clear that the inductive bias towards permutation prediction introduced using the Sinkhorn networks improves action sequencing.
|Colour Precision||Action repetitions|
|(Mean, Std. Dev.)||(Tower Collapses)|
While post-hoc Hungarian algorithm assignment remedies problems where the same action is selected multiple times, and provides substantial improvement over direct behaviour cloning, both fail dismally at image conditioned tower building. Explicitly modelling temporal action sequence behaviours using temporal convolutional neural networks improves performance, but is still outperformed by BC+Sinkhorn.
5.2 Generalisation to unseen configurations
In order to investigate the reasons for performance differences between BC+TCN and BC+Sinkhorn, we explore the generalisation capabilities of the action sequencing models using a modified tower building experiment, with 6 uniquely coloured blocks (avoiding the potential for action ambiguity). We generate a single demonstration pair for each of the 720 possible tower permutations, and train models on increasing numbers of demonstrations. We then test on all 720 demonstrations.
If the behaviour cloning models are capable of generalisation to unseen tower permutations, we would expect the precision of predicted action sequences to be better than a baseline approach of random ordering for unseen permutations and perfect ordering for previously seen permutations.
Fig. 5 shows these results. The dashed line shows the hypothetical precision for action sequences that would be obtained by an approach that memorises previously seen action sequences and randomly guesses orders for unseen sequences. Both BC+Sinkhorn and BC+TCN networks are able to generalise to previously unseen action sequence configurations, and perform similarly in this setting.
This contrasts with the previous set of experiments and indicates that the primary advantage BC+Sinkhorn has over BC+TCN is in dealing with symmetries that arise due to action ambiguities, where more than one action can reproduce a tower, and it becomes significantly more important to reason about prior actions taken in a sequence. When there is no ambiguity in actions, which is rarely the case in robotics applications, TCNs perform similarly to Sinkhorn networks.
5.3 Variable length action sequencing
We investigate variable length action sequencing using a third and final tower building experiment (with actions as in Figure 1, but with variable length action sequences – tower heights ranging from 2 to 6 blocks). As before, all possible permutations of demonstrations are generated (1950), and models are trained on increasing numbers of demonstrations. Since variable length sequences are required, we make use of the stopping mask extension of Section 4.2 for the BC+Sinkhorn networks. We also compare against BC+TCN+Hungarian, extended to deal with variable length sequences through the inclusion of a stopping action class.
Fig. 6 shows these results. As before, the dashed line shows the hypothetical precision for action sequences that would be obtained by an approach that memorises previously seen action sequences and randomly guesses orders for unseen sequences. Interestingly, when training using subsets, we obtain more rapid generalisation than in the seemingly simpler case investigated earlier. This occurs because it becomes increasingly more likely that action subsets have been seen within demonstrations as more training data is used.
The task of classifying the number of actions required for a subset is relatively simple for this tower building task, and the stopping mask prediction network successfully identifies the number of actions required to build a tower after approximately 200 demonstrations have been seen. As in Section 5.1, BC+Sinkhorn networks perform better than BC+TCN+Hungarian networks, since they allow for action ambiguity and explicitly take prior action use into consideration.
5.4 Soma puzzle: initialising plans with sequence predictions
The ability to reason about action sequence permutations is particularly important in assembly or disassembly tasks. We investigate the ability of behaviour cloning to solve more complex tasks, using a Soma puzzle . As illustrated in Fig. 7, Soma puzzles consist of 7 distinctly shaped blocks, which can be assembled into arbitrary shaped objects. Here, we consider the task of disassembling a 3x3 Soma cube, which can be constructed in 240 distinct ways (ignoring reflections and rotations).
The Soma puzzle has a long history in robotics and robot learning [28, 27], and has been the study of extensive research due to the complexity of shapes that can be constructed using it. The geometry of the puzzle parts means that disassembling the puzzle using pre-scripted actions requires that parts be removed in a precise order. Failure to do so will result in the puzzle collapsing, placing the environment in a state where pre-scripted actions can no longer be used. Correctly predicting the order of part extraction from images of the cube is challenging as it requires a model that can reason about how parts interlock and their relative positioning.
This problem is also a challenge for traditional planning algorithms, requiring a careful, and non-trivial, specification of relationships between components and problem constraints and a backtracking search over numerous possible action sequences. To investigate this setting, models were trained using a dataset comprising images of the 240 possible initial puzzle configurations, and a manually defined extraction order for each puzzle, randomly (repeated 100 times with different seeds) split into 120 training and validation examples, and 120 test examples.
|Collapses %||Mean Std. Dev.|
As shown in Table 2, despite outperforming the temporal convolutional architecture (BC+TCN+Hungarian), Sinkhorn behaviour cloning (BC+Sinkhorn) is still only successful on about half of the test cases when predicted action sequences are directly applied. However, when the predicted action sequences are used to initialise a suitable planning algorithm111A backtracking search using the simulator in the loop to test for failures. there are substantial gains in planning time, with a clear reduction in the number of iterations used to search for a suitable planning order.
In this case, it seems that there is limited difference between the TCN and Sinkhorn models, as both provide good initial guesses that speed up planning. However, as will be shown next, results on larger actions sets indicate that Sinkhorn networks scale far better than TCNs, and these performance differences become substantially more pronounced as more actions are considered.
5.5 Scrabble: scaling to larger action sets
A simplified Scrabble setting is used to evaluate the ability of BC+Sinkhorn to scale to larger action sets. Here, a standard English Scrabble tile set is used to generate images (10000) of random letter combinations with lengths 3 to 6, sampled from an increasingly large subset of the full tile set. We test on a set of randomly generated test words (5000).
Fig. 8 shows the decrease in performance (mean average spelling precision) as the number of actions used to generate training and test data is increased. BC+Sinkhorn shows extremely impressive scalability, with only small decreases in performance as the action set size increases. This could be remedied by additional training data, although training becomes time consuming with larger action sets222 As the number of actions increased, we observed that we needed to train for substantially longer before reaching convergence, with our largest model (98 actions) requiring approximately 5000 epochs to converge.
As the number of actions increased, we observed that we needed to train for substantially longer before reaching convergence, with our largest model (98 actions) requiring approximately 5000 epochs to converge.. In contrast BC+TCN+Hungarian networks become increasingly unreliable as the action set size is increased, with significant performance drops. This happens because symmetries arising from action ambiguities become increasingly more likely as the action sets are scaled, and classification-based approaches are unable to deal with these ambiguities.
This paper introduces a permutation prediction approach to vision-based neural action sequencing. Action sequencing using latent permutations predicted by Sinkhorn networks is most effective in tasks where there are potentially multiple actions leading to a desired state, and where there are constraints on the number of times an action can be used. Results show that neural action sequencing provides valuable improvements in planning speed when used to initialise planning algorithms, and experiments showed that impressive generalisation can be obtained using these networks. Importantly, Sinkhorn networks are able to scale to far greater action set sizes than temporal convolution networks. Temporal convolution and Sinkhorn networks are similar capacity models, and there are no major computational differences in the forward pass, which means that latent permutations are a promising and useful approach to sequence modelling in robotics.
This research was supported by the Alan Turing Institute, as part of the Safe AI for surgical assistance project. We are grateful to Yordan Hristov and members of the Robust Autonomy and Decisions Group for discussions and feedback.
-  (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: §3.2.
-  (2020) Learning-based warm-starting for fast sequential convex programming and trajectory optimization. Cited by: §2.
-  (2020) Fast differentiable sorting and ranking. arXiv preprint arXiv:2002.08871. Cited by: §2, §3.2.
Show, control and tell: a framework for generating controllable and grounded captions.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2019) Differentiable ranking and sorting using optimal transport. In Advances in Neural Information Processing Systems, pp. 6858–6868. Cited by: §2, §3.2.
-  (2016-06) Incremental task and motion planning: a constraint-based approach. In Proceedings of Robotics: Science and Systems, AnnArbor, Michigan. Cited by: §1.
-  (2020-07) Deep Visual Reasoning: Learning to Predict Action Sequences for Task and Motion Planning from an Initial Scene Image. In Proceedings of Robotics: Science and Systems, Corvalis, Oregon, USA. Cited by: §2, §3.2.
DreamCoder: growing generalizable, interpretable knowledge with wake-sleep bayesian program learning. ArXiv abs/2006.08381. Cited by: §2.
-  (2002) Extending the exploitation of symmetries in planning.. In AIPS, pp. 83–91. Cited by: §2.
PDDL2.1: an extension to PDDL for expressing temporal planning domains.
Journal of Artificial Intelligence Research20, pp. 61–124. Cited by: §1.
Clipped action policy gradient.
Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, J. G. Dy and A. Krause (Eds.), Vol. 80, pp. 1592–1601. Cited by: §2.
-  (1972) Mathematical games. Scientific American 227 (3), pp. 176–184. Cited by: §5.4.
-  (2006) Chapter 10 - Symmetry in constraint programming. In Handbook of Constraint Programming, F. Rossi, P. van Beek, and T. Walsh (Eds.), Foundations of Artificial Intelligence, Vol. 2, pp. 329 – 376. Cited by: §2.
Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850. Cited by: §3.2.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Scrabble.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.2.
-  (2020-07) Elaborating on Learned Demonstrations with Temporal Logic Specifications. In Proceedings of Robotics: Science and Systems, Corvalis, Oregon, USA. Cited by: §2.
-  (2019) PyRep: bringing v-rep to deep robot learning. arXiv preprint arXiv:1906.11176. Cited by: Soma puzzle, §5.1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Tower building.
-  (2018) From skills to symbols: learning symbolic representations for abstract high-level planning. Journal of Artificial Intelligence Research 61, pp. 215–289. Cited by: §2.
-  (1955) The Hungarian method for the assignment problem. Naval research logistics quarterly 2 (1-2), pp. 83–97. Cited by: §3.2.
-  (2017) Building machines that learn and think like people. Behavioral and brain sciences 40. Cited by: §2.
-  (2019) Beyond imitation: zero-shot task transfer on robots by learning concepts as cognitive programs. Science Robotics 4 (26). Cited by: §2.
-  (2017) Temporal convolutional networks for action segmentation and detection. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165. Cited by: §3.2, §5.1.
-  (1977) AUTOPASS: an automatic programming system for computer controlled mechanical assembly. IBM Journal of Research and Development 21 (4), pp. 321–333. Cited by: §2.
-  (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.
-  (1995) The SOMASS system: a hybrid symbolic and behaviour-based system to plan and execute assemblies by robot. Hybrid Problems, Hybrid Solutions, pp. 157–168. Cited by: §5.4.
-  (1990) Symbol grounding via a hybrid architecture in an autonomous assembly system. Robotics and Autonomous Systems 6 (1), pp. 123 – 144. Note: Designing Autonomous Agents Cited by: §5.4.
-  (2018) Learning latent permutations with gumbel-sinkhorn networks. In International Conference on Learning Representations, Cited by: §2, §3.2, §4.1, §4.
-  (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §1, §2.
-  (1957) Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics 5 (1), pp. 32–38. Cited by: §3.2.
-  (2020) Gradient estimation with stochastic softmax tricks. arXiv preprint arXiv:2006.08063. Cited by: §2, §3.2.
-  (2017) Using program induction to interpret transition system dynamics. arXiv preprint arXiv:1708.00376. Cited by: §2.
-  (2011) Exploiting problem symmetries in state-based planners. In Twenty-Fifth AAAI Conference on Artificial Intelligence, Cited by: §2.
-  (1989) Alvinn: an autonomous land vehicle in a neural network. In Advances in neural information processing systems, pp. 305–313. Cited by: §1, §2.
-  (1980) An interpreter for a language for describing assemblies. Artificial Intelligence 14 (1), pp. 79–107. Cited by: §2.
-  (1990-Mar.) A group theoretic approach to assembly planning. AI Magazine 11 (1), pp. 82. Cited by: §2.
-  (2020-07) LatticeNet: Fast Point Cloud Segmentation Using Permutohedral Lattices. In Proceedings of Robotics: Science and Systems, Corvalis, Oregon, USA. Cited by: §2.
-  (2018) Behavioral cloning from observation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 4950–4957. Cited by: §1.
-  (2002) Neural networks give a warm start to linear optimization problems. In Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN’02 (Cat. No. 02CH37290), Vol. 2, pp. 1871–1876. Cited by: §2.
-  (2019) Carpentry compiler. ACM Transactions on Graphics 38 (6), pp. Article No. 195. Note: presented at SIGGRAPH Asia 2019 Cited by: §2.
See below for a summary of experimental settings. All code and experimental data generation scripts will be released after peer review. The accompanying video illustrates failure modes and example use cases.
32 5x5 kernels, ReLU
|64 5x5 kernels, ReLU|
|128 5x5 kernels, ReLU|
|256 5x5 kernels, ReLU|
128 Neuron FC, ReLU
A set of 6 blocks (2 blue, 2 yellow, 2 red) were used for tower building (baseline and subsets experiments) in CoppeliaSim, with primitive actions to pick up a block from a pre-defined start position and place it above the last placed block (no action is taken to place the first block in the sequence). 6 uniquely coloured blocks were used for generalisation experiments.
Both Sinkhorn and TCNs used a CNN encoder with parameters listed to the left. Behaviour cloning and Sinkhorn networks were trained with batch sizes of 16 for 10000 epochs, while the TCNs were trained for 2000, both using Adam  and a learning rate of 3e-4. TCNs used 6 layers of length 6, the length of the maximum action set size.
Soma puzzles consist of 7 distinctly shaped blocks, which can be assembled into arbitrary shaped objects. Here, we consider the task of disassembling a 3x3 Soma cube, which can be constructed in 240 distinct ways (ignoring reflections and rotations). Soma solutions were obtained using Polyform Puzzler, http://puzzler.sourceforge.net, a set of solvers for polyforms.
For Soma puzzle extraction planning, a backtracking search was used to find collapse free extraction orders. Here, actions that trigger collapse were randomly swapped with later actions, using the simulator in the loop to identify collapses. Experiments compared planning speed when search was initialised using a random starting sequence and using sequence predictions from the neural networks.
Soma puzzle parts were modelled in PyRep  using independent cubes, and collapse detection was implemented by checking if any of the cubes had moved after a part had been deleted from the scene.
The same CNN encoder architecture used for tower building was used here, but models were trained with batch sizes of 32, using Adam and a learning rate of 3e-4.
A standard English scrabble tile set (without blanks) – 98 possible tiles comprising alphabet letters with frequencies a=9, b=2, c=2, d=4, e=12, f=2, g=3, h=2, i=9, j=1, k=1, l=4, m=2, n=6, o=8, p=2, q=1, r=6, s=4, t=6, u=4, v=2, w=2, x=1, y=2, z=1 – is used to generate images of random letter combinations. Since letters can be used multiple times, this introduces additional complexity in the grounding of image components to actions.
Both BC+TCN and BC+Sinkhorn used a Resnet18 encoder  for input images and were trained using Adam with a learning rate of 1e-4. Models were trained with a batch size of 64.
TCNs used 6 temporal convolution layers of length 7, the maximum number of actions in an action sequence. Sinkhorn networks used a fixed size latent bottleneck state of 128 dimensions, while TCNs used 7 latent states of 16 dimensions each. Both models were trained for 5000 epochs.
Robot demonstrations were conducted in CoppeliaSim using PyRep, using pre-scriped actions, pick tile , place at position , where denotes an offset corresponding to the -th action in a sequence.