Modeling Long-horizon Tasks as Sequential Interaction Landscapes

by   Sören Pirk, et al.

Complex object manipulation tasks often span over long sequences of operations. Task planning over long-time horizons is a challenging and open problem in robotics, and its complexity grows exponentially with an increasing number of subtasks. In this paper we present a deep learning network that learns dependencies and transitions across subtasks solely from a set of demonstration videos. We represent each subtask as an action symbol (e.g. move cup), and show that these symbols can be learned and predicted directly from image observations. Learning from demonstrations and visual observations are two main pillars of our approach. The former makes the learning tractable as it provides the network with information about the most frequent transitions and relevant dependency between subtasks (instead of exploring all possible combination), while the latter allows the network to continuously monitor the task progress and thus to interactively adapt to changes in the environment. We evaluate our framework on two long horizon tasks: (1) block stacking of puzzle pieces being executed by humans, and (2) a robot manipulation task involving pick and place of objects and sliding a cabinet door on a 7-DoF robot arm. We show that complex plans can be carried out when executing the robotic task and the robot can interactively adapt to changes in the environment and recover from failure cases.


page 1

page 3

page 4

page 5

page 6

page 8


Video2Skill: Adapting Events in Demonstration Videos to Skills in an Environment using Cyclic MDP Homomorphisms

Humans excel at learning long-horizon tasks from demonstrations augmente...

Learning Sensorimotor Primitives of Sequential Manipulation Tasks from Visual Demonstrations

This work aims to learn how to perform complex robot manipulation tasks ...

SQUIRL: Robust and Efficient Learning from Video Demonstration of Long-Horizon Robotic Manipulation Tasks

Recent advances in deep reinforcement learning (RL) have demonstrated it...

Wish you were here: Hindsight Goal Selection for long-horizon dexterous manipulation

Complex sequential tasks in continuous-control settings often require ag...

Robot Learning and Execution of Collaborative Manipulation Plans from YouTube Videos

People often watch videos on the web to learn how to cook new recipes, a...

IRIS: Implicit Reinforcement without Interaction at Scale for Learning Control from Offline Robot Manipulation Data

Learning from offline task demonstrations is a problem of great interest...

ManipulaTHOR: A Framework for Visual Object Manipulation

The domain of Embodied AI has recently witnessed substantial progress, p...

I Introduction

Enabled by advances in sensing and control, robots are getting more capable of performing intricate tasks in a robust and reliable manner. In recent years, learned policies for control in robotics have shown impressive results [13]. However, learning a single black-box function mapping from pixels to controls is quite challenging. In particular, complex manipulation tasks have to deal with a diverse sets of objects, their locations, and how they are manipulated. Simultaneously reasoning for both, the ’what’ (e.g. which object) and the ’how’ (e.g. grasping), is a challenging problem. Additionally, due to the long time horizon of many tasks, the model can only observe a small portion of the full task at any given time. This partial observability increases with longer tasks and higher complexity.

Current learning-based planning approaches either focus on object representations [22], on learning sequences of symbols without rooting the plans in the actual environment [57], or generate plans based on explicit geometric representations of the environment [7, 49]. Formulating plans without feedback from the environment does not easily generalize to new scenes and is inevitably limited to static object arrangements. Generating plans based on image data, e.g. by predicting future images, limits the planning horizon to only a few steps [13]

. More recently, reinforcement learning approaches have shown initial success in solving robot manipulation tasks 

[25, 45], however, the end-to-end learning of long-horizon, sequential tasks remains challenging.

Fig. 1: Robot performing a long-horizon manipulation task of re-arranging objects inside a cabinet. We decompose these tasks into a sequence of abstract actions, e. g. ‘close door’, ‘move cup’, etc. Thus, a task spanning over hundreds of image frames can be compactly summarized as a sequence of abstract symbols, which (1) can be accurately predicted at execution time; and (2) precisely executed by a robot.

In this paper, we propose a two layer representation of complex tasks, where we use as an intermediate representation a set of abstract actions or sub-tasks (see Fig. 1). Each action is represented by a symbol that describes what needs to happen to complete a sub-task in an abstract manner (e.g. move cup). This discretization allows us to reason about the structure of tasks without being faced with the intricacies of real environments and the related physics (e.g. object pose).

Each symbol is then used to select an individual policy that describes how an object or the agent itself need to be manipulated toward the higher-level goal. When executing an action we can then consider the complexity imposed by the real scene, such as finding an object or identifying its pose to grasp it. Our goal is to execute complex and long-horizon tasks by learning the sequential dependencies between task-relevant actions. To learn sequences of sub-tasks while respecting changes in the scene, we employ a sequence-to-sequence model, commonly used for natural language processing, to translate sequences of image embeddings to action symbols 

[54, 8].

We test the capabilities of sequence prediction by evaluating our framework on two environments. First, we use a robot arm to manipulate objects in an office environment, where the goal is to find objects in a cabinet, perform operations on the objects, and to move them back into the cabinet. In the environment shown in Fig. 1 the task is to find a cup, to put a ball in the cup, and to move both objects together back to the cabinet. Different sequences of sub-tasks can lead to a successful completion of the task. For example, while the robot has to first open the cabinet door, it can then either move the cup or the ball outside the cabinet, to eventually put the ball in the cup and both objects back into the cabinet. For the second dataset, we have a human perform a stacking task that requires to move blocks from an initially random configuration to three stacks of blocks.

We evaluate and discuss the success of these experiment and demonstrate that using action symbols allows us to organize tasks as different sub-tasks. We empirically evaluate our model, both in an off-line and on-line fashion on two manipulation tasks. In summary, our contributions are:

  • we propose a deep learning network that learns dependencies and transitions across subtasks as action symbols solely from a set of demonstration videos;

  • we introduce a framework that integrates our proposed approach into existing state-of-the-art robotics work on motion primitives [51];

  • we evaluate the learned sequence model on two long horizon tasks, showing that sequences of action symbols can be predicted directly from image observations and be executed in a closed-loop setting by a robot.

Ii Related Work

The topic of long-term planning has received a considerable amount of attention in the past. We provide an overview of related work in the field, with a focus on planning and learning-based approaches.

Motion and Manipulation Planning: traditionally, approaches for planning have focused on computing trajectories for robotic motion to, for example, arrange objects in specified arrangements [36]. Many approaches jointly solve for planning tasks and motion to enable more informed robotic behavior. Examples include: focusing on an explicitly modeling geometric constraints as part of the task planning [32]

, leveraging physics-based heuristics 

[2, 55], hierarchical planning [24, 36], probabilistic models [31, 17] or integrating action parameters as part of learning forward models [20]

. More recently, various approaches start to explicitly focus on planning with neural networks. As an example, Zhang et al. 

[60] propose to use user-defined attributes in environments to then learn policies that enable transitioning between these features of interest.

Hierarchical and Symbolic Planning: it has been recognized that abstract symbols can serve as a meaningful representation to structure tasks [7, 49] and to organize their often hierarchical properties [24]. In the work of Ortehy et al. [46], motion primitives – represented as symbols – are optimized on a geometric level to validate predictions. Garett et al. [15] combine symbolic planning with heuristic search to efficiently perform tasks and motion planning while Muxfeldt et al. [40] introduce a hierarchical decomposition of assembly operations into human-understandable states that formally describe problems during the assembly process. More recently, Xu et al. [59] propose to hierarchically decompose tasks into sub-task specifications to then learn task-conditioned policies to interact with objects in an environment.

Learning Skills: it has been shown that a single task can be represented by hand-crafted state machines, where motion primitives allow to transition between individual states [6, 12, 52]

. While these approaches provide a powerful means to represent agent’s behavior, their specification needs to be adapted manually for individual tasks which prevents their use at scale. To address this issue, there has a been a number of hierarchical imitation learning approaches 

[16, 53, 14, 29] that focus on segmenting long-horizon tasks into subcomponents. As an alternative, reinforcement learning approaches aim at learning policies in an end-to-end manner by obtaining task-relevant features from example data [62, 38, 33]. Combined with deep neural networks, this has shown to be a promising direction to learn object assembly [21] or more advanced motion models [3]. Another direction is to learn policies for robot-object interaction from demonstrations with the goal to reduce uncertainty with expert demonstrations [64, 30, 44, 42, 43]. Despite these promising efforts, learning policies from demonstrations that generalize to new tasks still is an open research problem.

Understanding agent-object interactions: understanding object motion and object-agent interactions enables reliably learning agent behavior to manipulate objects. A number of approaches focus on building physically-plausible representations of scenes and agents, also considering inputs from multiple sensors. Multi-modal models have been explored for predicting the success of liquid pouring [58] and assembly tasks [11, 23]. Luo et al. [37] go even further and propose a RFID-based localization framework that allows to accurately track objects. Object-centric representations, learned from visual data, can serve as a powerful means to understand physical agent and object interactions [63, 10]. As a recent example, Janner et al. [22] propose an object-oriented prediction and planning approach to model physical object interactions for stacking tasks. However, obtaining meaningful signals from visual data is often difficult in real-world settings due to cluttered scenes and occlusions.

Action Recognition: a large body of work focuses on capturing motion and actions from sensor data, ranging from manually defined approaches [56] to video classification [41, 26] and activity recognition [5]

with Convolutional Neural Networks. Other lines of work focus on learning similarity metrics 

[35, 61, 9], attention [47] or even on identifying grammars for describing actions in videos [39, 48, 50]. More recently, Lea et al. [34] introduce a unified approach for action recognition based on hierarchical relationships at different time-scales, while Ahsan et al. [1] propose a self-supervised method to identify the spatio-temporal context in videos.

Unlike existing work that mostly focuses on learning action symbols implicitly – e.g. as latent variables – we represent actions explicitly, which in turn provides more semantics of a task. Furthermore, we learn the action symbols directly from sequences of images. This facilitates to infer the correct order of actions necessary to complete a task, while our method also allows us to respond to changes in the environment. Each individual action is then executed with an individual policy.

Iii Method

Our main goal is to learn the sequential structure of tasks by factorizing them into task-relevant actions. This is motivated by the observation that many tasks are as well combinatorial as they are continuous. They are combinatorial in that an agent has to select among a discrete set of objects to perform a task. For example, a stacking task requires to arrange a number of objects. However, an agent has to also operate in a physical environment that requires to interact with objects in continuous ways.

Optimizing for both of the aforementioned factors to perform long-term planning is challenging due to the uncertainty imposed by the actual scene. Therefore, to perform long-term planning, we first factorize long-horizon tasks into a discrete set of actions. These actions represent what needs to happen to complete a sub-task, but at a very high-level of abstraction and without any notion of how an agent has to perform the action. For example, an action might just be ‘move cup’. Second, once a task is structured into task-relevant actions we use expert policies obtained from learned demonstrations to perform individual actions.

We propose to use a set of action symbols as an abstract representation of sub-tasks. These symbols represent basic actions, such as ‘open door’, ‘move cup’, ‘put ball’, etc., and are manually defined for different tasks (see Table II). Sequences of symbols are intended to provide an abstraction of the task that can be learned to be predicted and then executed on a robot. The set of symbols is denoted as .

Action symbols are used in two ways: first, we train a single frame action classifier, that allows us to generate embeddings of images. Second we train an encoder-decoder sequence-to-sequence model, that translates sequences of image embeddings to sequences of action symbols. Together, both models allow us to predict the next action based on the current state of the scene as well as according to which sub-tasks were already completed. In the following we describe both models.

Iii-a Action Recognition

To obtain a representation of the scene as well as of ongoing actions we train a convolutional neural network as action recognition model. Specifically, we use a ResNet50 [19] backbone with one extra dense layer (32 dimensions) to extract image features and another dense layer followed by a Softmax to finetune the network on action symbols as labels. We train this model as a single image action predictor on images of sequences, where each image is labeled with an action symbol. Action recognition based on a single frame is a challenging problem, as an action shown in a single image can be ambiguous; e.g. reaching toward a cup looks the same as moving away from it. However, our goal is not to use the resulting classification of this model, but instead to use the resulting embedding as input to our sequence-to-sequence model. The sequence-to-sequence model then translates the produced embeddings to action symbols. Furthermore, as the sequence-to-sequence model maintains an internal state it can resolve ambiguities introduced by wrongly predicted action symbols of the action classifier. Fig. 3 provides an overview of how action classifier and sequence-to-sequence model are connected.

Fig. 2: We train an action classifier based on ground truth action symbols to obtain image embeddings. The embeddings are then used as input to the sequence-to-sequence model that translates the embeddings to action symbols. The sequence-to-sequence model is trained on sub-sequences of fixed sequence length (SL) and so as to predict the next (future) action based on a sequence of previous embeddings.
Fig. 3: Encoder-Decoder architecture of a sequence-to-sequence model: image embeddings (

) serve as input for the encoder. The encoder converts the embedding to a state vector (S). The decoder uses the state vector and a symbol sequence to predict the output sequence one symbol ahead of the current time step.

Fig. 2: We train an action classifier based on ground truth action symbols to obtain image embeddings. The embeddings are then used as input to the sequence-to-sequence model that translates the embeddings to action symbols. The sequence-to-sequence model is trained on sub-sequences of fixed sequence length (SL) and so as to predict the next (future) action based on a sequence of previous embeddings.

Iii-B Action-centric Representation

Fig. 4: Pipeline of our framework: we use an action classifier to obtain image embeddings of an incoming sequence of frames. A sequence of embeddings is used as the input of a sequence-to-sequence model that translates the embeddings to a sequence of action symbols. The sequence-to-sequence model is trained to predict the next action symbol based on a sequence of previous image embeddings. The predicted next action symbol is then passed to a low level controller that selects a corresponding policy to perform the action. Policies are parameterized by the pose of individual objects that we identify through object detection.

We use sequence models [54, 8] to predict future action symbols given a history of image embeddings. Given a sequence of image embeddings up to current time , we predict the next action symbols :

We cast the above formulation as a ‘translation’ of image embeddings to action symbol sequence. Therefore, we employ a sequence-to-sequence model [54], an established neural translation formulation, where we map the embedding sequence to an action sequence. In more detail, the sequence-to-sequence model consists of an encoder and decoder LSTM. The encoder consumes the input image as sequence of embeddings and encodes it into a single vector, which is subsequently decoded into an action symbol sequence by a second LSTM (see Fig. 3). Using image embeddings as high-dimensional continuous inputs is one of the major differences to the original translation application of the above model.

Fig. 5: Two example sequences of a blocks stacking task: we associate action symbols to every frame of a sequence (a-e) and (f-j). An action symbol represents the action in an abstract way (e.g. move red block). For the frames (a-e) and (f-j) we show where in the sequence this action was performed. The order of actions can very, but sequential dependencies exist.

Learning the sequential structure of tasks based on image embeddings and action symbols enables to perform tasks in varying combinations of sub-tasks and depending on a given scene configuration. For example, the stacking task shown in Fig. 5 requires stacking colored blocks in a specific configuration. Two blocks (red, yellow) need to be in place before other blocks (pink, green) can be stacked on top of them. Given this task description, the task can be performed in different orders. For example, the blue block can be put up independently of the other blocks, while the green and pink blocks depend on the red and yellow blocks.

Iii-C Performing Actions

To perform actions we model an action symbol as motion primitives [51]. A motion primitive is a parameterized policy to perform an atomic action, such as grasping, placing, etc. Primitives can be used as building blocks that can be composed, for example by a state machine [27], to enable more advanced robot behavior. Considering the task of putting an object into a cabinet, the required motion primitives are: grasping, opening/closing cabinet, and placing. The state machine is used for sequencing the primitives based on the world state. Initially it triggers the cabinet opening primitive. Upon its success, it switches to the grasping primitive and conditions it on the particular object that needs to be grasped. Then it proceeds with the placing primitive, followed by the closing cabinet primitive. In case of a failure, the state machine switches the primitive to recover from the error. Note that the use of state-machine implicitly requires access to a success detection module in order to properly transit from one primitive to another.

The idea of using the state machine together with motion primitives fits well with the approach proposed in this paper. Our symbol prediction network replaces the state machine and success detection module. Each of our action symbols corresponds to a motion primitive, hence we have separate primitives to grasp a cup, grasp a ball, move a cup, move a ball, slide the door, and so on. Note that without loss of generality we decided to use different grasping/moving primitives for each object to simplify the run-time execution. Alternatively, all grasping primitives could be unified to one grasping policy for multiple objects, e.g. cup and ball.

We modeled each of our motion primitive as a dynamical systems policy (DSP) [28], which can be trained from a few demonstrations. Given a target pose, i.e. the object pose, DSP drives the robot arm from its initial pose to the target pose while exhibiting a similar behavior as the demonstrations. In our setup we train each primitive based on five demonstrations captured through Kinesthetic demonstrations. The input to each DSP primitive is the current object and arm end-effector pose, and the output is the next end-effector pose. Our robot is equipped with a perception system that performs object detection and classification [18] and provides the Cartesian pose of each object with respect to the robot frame, which is passed to DSP primitives. We would like to note that, in this paper, we chose DSP representation since it allows us to quickly model each primitive with a couple of demonstrations, however, at the cost of depending on a perception system. Alternatively one can use other methods such as an end-to-end deep network policy to represent each primitive to avoid this dependency.

Fig. 4 illustrates the overall architecture of our system. Once the sequential model determines the next action, the corresponding primitive is called with the poses of relevant objects and the robot starts executing the motion. Note that there are two loops in our system: 1) the DSP control loop which runs at 20Hz and is in charge of moving the arm to the target location, and 2) the symbolic switching loop which runs at 2Hz and determines the next primitive that needs to be executed solely based on the stream of images.

Iii-D Network Architecture and Training

We train the action classifier on single pairs of images and action symbols and randomly select these pairs from all sequences of our training data. Furthermore, we train the action classification model separately for each dataset until it converges, which usually happened after no more then 200 epochs for our datasets.

The sequence-to-sequence network is trained on sequences of image embeddings and action symbols. Instead of training on the full sequences, we train the network on sub-sequences of a specified sequence length (SL). Specifically, we experimented with the sequence lengths 10, 20, and 30 (see Tab. III). The sub-sequences are generated as ‘sliding windows’ over an entire sequence. We train the model so as to translate sequences of image embeddings to predict a sequence of action symbols. However, the sequence of predicted action symbols are offset by , where represents the number of steps we want to predict in the future. For our experiments we mostly relied on setting , which means that we only predict the action one step ahead in the future (Fig. 3).

The neural network architecture of our approach is illustrated in Fig. 3

. The encoder takes the input frame embeddings and generates a state embedding vector from its last recurrent layer, which encodes the information of all input elements. The decoder then takes this state embedding and converts it back to action symbol sequences. We train both networks individually for each task. The sequence-to-sequence model is trained with a latent dimension of 256 and usually converges after 50 epochs. Furthermore, we did not specifically finetune the hyperparameters of either model.

Iv Datasets

To validate the usefulness of a symbol-based action prediction model for manipulation tasks we defined two different datasets. The details for each are summarized in Tab. II.

Manipulation. For this dataset we defined object manipulation tasks, where the goal is to put a ball in a cup. However, the cup and the ball can either be hidden in the cabinet or located somewhere in front of it. Depending on the object configuration, the task then becomes to first open the cabinet, to grasp the cup and the ball, to move them outside the cabinet, then to drop the ball into the cup, and to move the cup with the ball back into the cabinet. Finally, the cabinet door needs to be closed. The cabinet door has a handle and can be opened by a sliding mechanism.

Fig. 6: For the block stacking task we randomly place the five blocks in the scene. The goal is to then move them to a stacked configuration. From left to right: three initial configurations of objects and the final configuration.
Fig. 7: Human operator performing a task of grasping a ball, putting it into a cup, and closing a cabinet door, performed with a teleop system.
Fig. 6: For the block stacking task we randomly place the five blocks in the scene. The goal is to then move them to a stacked configuration. From left to right: three initial configurations of objects and the final configuration.
Dataset #Tasks #Sequences #Symbols #Frames Seq Length
Manipulation 4 791 7 228K 90-600
Blocks 1 287 6 97K 136 - 445
TABLE II: Action symbols and their meaning for both datasets.
Dataset Action Integer Action Symbol Meaning
Manipulation 0 A Move cup
1 B Move ball
2 C Move ball into cup
3 D Move ball and cup
4 E Open door
5 F Close door
6 G Approach cup
7 H Approach ball
8 I Approach to open
9 J Approach to close
10 _ No action
11 # Terminal/Done
Compact Example: EBACDF_
Block Stacking 0 B Move Blue
1 R Move Red
3 Y Move Yellow
3 G Move Green
4 P Move Pink
5 _ No action

Compact Example: _Y_B_G_R_P_
TABLE I: Details of our two datasets.
Fig. 8: Manipulation of objects over long time horizons: the same task is performed in a different order and with different subsets of actions. Which actions are necessary to achieve the goal (cup in cabinet, closed cabinet) depends on the initial scene configuration. For the sequence (a)-(g) cup and ball are initially outside the cabinet. Plausible actions for this scene are to open the cabinet door or to move the ball into the cup and the system selects to move the ball into the cup first. For the sequence shown in (h)-(n) both objects (cup, ball) are inside the cabinet and the only possible action is to first open the cabinet. Finally, for the sequence (o)-(u) the cabinet door is initially open. Many tasks can be performed with a different sequential structure of the actions. The bars underneath the images illustrate the structure of the task-relevant actions; black bars indicate where in the sequence the shown frames were taken from.

Given this setting we define four different tasks: the easiest task is to just move a ball in a cup (Manipulation C). Here the model only needs to predict the correct order of two symbols (G: approach cup, C: move ball into cup). We then make this task gradually more complex by adding action symbols, e.g. to first move ball and cup out of the cabinet before putting the ball in the cup (Manipulation ABC), by moving the cup with the ball back to the cabinet after the other actions have been performed (Manipulation ABCD), and finally, to also open and close the door of the cabinet (Manipulation ABCDEF). Please note that for the tasks (ABC, ABCD, and ABCDEF), the order of the action symbols can vary. For example, it is possible to first move the cup or the ball outside the cabinet. However, some actions need to be executed before others. For example, the door of the cabinet has to be open before the cup with the ball can be moved into the cabinet.

A human operator places the objects into the scene and then performs one of the tasks with the robot arm, controlled by a tele-operation system (Fig. 7). Each sequence consists of 130 - 890 frames and in total we captured 839 sequences. Across the sequences we define tasks with different levels of complexity, going from just moving the ball into the cup, to the full range of actions described above. We used a 80-10-10 split for training, validation, and test data. Possible actions are ‘move cup’, ‘move ball’, ‘move ball to cup’, ‘move ball and cup’, ‘open door’, ‘close door’, and the corresponding approach motions, e.g. ‘approach cup’, ‘approach to open’, etc. (Tab. II). Frames of the captured sequences are manually labeled with the respective action symbols.

Block Stacking. For the second dataset we define a stacking task of five uniquely shaped blocks (similar to Tetris blocks) of different colors. The goal is to generate an object arrangement of two stacks and one single block. For each run the blocks are placed randomly on a table. An operator then takes the blocks and stacks them into the specified configuration. The order in which the objects are stacked is not defined. However, to generate the target configuration of objects only certain action sequences are plausible. While the blue block can be placed any time during the task, the green and the pink block require to first place the yellow and red block respectively. Fig. 7 shows an example of three initial and the target configuration of objects. We captured 289 sequences (150 - 450 frames) of this stacking task and used a 80-10-10 split for training, validation, and test data. The possible actions for this task are to move any of the available blocks (blue, red, yellow, pink, green) and ‘no action’.

Symbol Structure Edit Distance
Method SL10 SL20 SL30 SL10 SL20 SL30 SL10 SL20 SL30


Manipulation - ABCDEF 7.84% 5.98% 6.23% 15.38% 11.53% 15.38% 4.79% 4.86% 5.12%
Manipulation - ABCD 7.33% 6.62% 6.71% 16.01% 5.03% 5.08% 5.85% 5.82% 5.74%
Manipulation - ABC 3.23% 5.38% 8.23% 20.02% 18.02% 10.14% 6.52% 5.21% 6.62%
Manipulation - C 5.46% 4.01% 4.57% 21.42% 14.26% 28.14% 5.46% 4.02% 5.26%
Blocks 8.82% 7.65% 8.15% 29.28% 13.03% 26.54% 7.15% 8.46% 9.03%


Manipulation - ABCDEF 8.80% 7.82% 7.69% 57.69% 44.76% 34.61% 7.27% 6.90% 6.27%
Manipulation - ABCD 10.03% 9.41% 7.90% 35.01% 45.23% 27.19% 9.05% 8.29% 7.37%
Manipulation - ABC 11.06% 8.70% 7.85% 47.26% 40.58% 28.98% 9.38% 7.81% 6.99%
Manipulation - C 10.56% 7.39% 6.92% 25.92% 29.62% 25.93% 10.48% 6.81% 5.92%
Blocks 12.44% 10.31% 10.85% 78.57% 51.14% 44.28% 10.48% 9.03% 9.37%

TABLE III: Sequence Prediction Errors (Symbol, Structure, Edit Distance) for Seq2Seq and LSTM.

V Experiments and Results

In this section we evaluate the performance of our framework on sequence prediction for manipulation tasks. The goal is to predict sequences of action symbols that describe the sequential structure of a task and thereby allow an agent to execute a task in the correct order. The sequence of action symbols is predicted based on a sequence of frame embeddings. This allows us to reliably predict the next action based on the current state of the scene.

V-a Sequence Translation and Prediction

Our model translates sequences of image embeddings to sequences of action symbols. To evaluate the quality of the predicted sequences we use our sequence-to-sequence model to predict the next action symbol based on a sequence of image embeddings and compare the result of a whole predicted sequence with the ground truth sequence. We then measure three different errors over these sequences.

In Tab. III we report results for all metrics and for sequence lengths (SL) of 10, 20, and 30. First, the symbol-to-symbol error measures the overall accuracy of the predicted sequences; each symbol is compared to its corresponding symbol in the ground truth sequence. This error metric provides a way to measure the overall accuracy but it does not account for the impact a wrongly predicted symbol may have for executing a sequence; i.e. if only a single symbol is predicted wrongly, executing the task may fail.

Therefore, we additionally compute the error of predicting the correct sequential structure of actions. For this we again predict an action symbol for each frame in a sequence and then shorten the sequence of symbols to its compact representation; i.e. when the same symbol was predicted repeatedly for consecutive frames we only use the symbol once (similar to Huffman coding). We then compare the predicted sequence with the ground truth sequence in its compact encoding and measure an error when there are any differences in the symbol patterns. The results for the sequences of our datasets are shown in Tab. III (Structure). An example of shortened sequences is shown in Tab. II (Compact Example). When computing the structure error on the compact representation, irregularities in the sequential structure are accounted for. A single change of a symbol would create a different compact encoding and the sequence would be labeled as wrongly predicted.

Finally, we use Levenshtein distance as a more common way to compare symbol sequences (Tab. III, Edit Dist). Here the error is measured as the number of edit operations necessary to convert the predicted sequence into the ground truth sequence. The resulting number of edits is normalized by the number of symbols of the ground truth sequence.

Additionally, we compare the results of the sequence-to-sequence model with a many-to-many LSTM and soft attention weighted annotation [4] (Tab. III). The LSTM consists of a single layer with a latent dimension of 256. Compared to the sequence-to-sequence model the LSTM performs well on the symbol and edit distance metrics, which means that the overall distance of ground truth and predicted sequences is small. However, it performs with significantly less accuracy on predicting the structure of tasks.

To obtain frame embeddings we train a convolution neural network as action classifier against ground truth action labels provided for each frame to perform action classification (Sec. III-D). The performance of this network is reported in Tab. IV.

Method Classification Accuracy
Manipulation ABCDEF 93.7%
Manipulation ABCD 96.9%
Manipulation ABC 91.5%
Manipulation C 93.6%

TABLE IV: Action Classification Accuracy for each task.

V-B Robotic Manipulation

To test the framework’s performance in a realistic setting, we use a real robot to perform the manipulation task based on predicted motion primitives. Our robot is a 7-axis robotic arm equipped with an on-board camera.

We setup a scene with randomly placed objects (cup, ball, open/closed door). Our classification network then generates embeddings of the incoming camera frames. After obtaining sequence length (SL) number of frames (e.g. 10, 20 or 30) the sequence-to-sequence model starts predicting the next action. With additional incoming frames the model keeps predicting the same symbol if no changes in the scene occur. The predicted symbol is then passed to the low-level controller and the robot is set in motion. If the scene changes, e.g. if the robot starts moving from its default position to the cup, the model predicts new action symbols for every frame. A newly predicted action symbol is then pushed to a queue if it is different from the previous symbol in the queue. The robot takes the next action symbol from the queue, performs object detection to obtain the object poses, and runs the motion primitive corresponding to the selected symbol. While the robot is performing the action, the sequence prediction network predicts further action symbols that are pushed to the queue. When the robot is done with an action, it proceeds to the next action symbol and continues the task. Once all actions are completed successfully, the sequence model predicts the terminal symbol (#) and the robot moves back to its default state.

To evaluate how well the robot is able to perform the tasks we measure how often it was able to successfully reach the goal state of a given task. We ran every task 20 times and counted the number of successes and failures. Depending on the scene setup, some predicted symbol sequences are implausible and their execution fails; we count these as failures. However, as our model relies on images embeddings to predict the next action, for some of these sequences the model can recover. Here the model may predict a wrong action symbol, but it then recovers and eventually predicts the correct sequence of actions to arrive in the goal state; we consider these sequences successes. Successful, recovered, and failed task completions are reported in Tab. V.

Method #Success #Recovered #Failure Accuracy
Manipulation ABCDEF 8 8 4 80.0%
Manipulation ABCD 13 3 4 80.0%
Manipulation ABC 12 5 3 85.0%
Manipulation C 17 1 2 90.0%
TABLE V: Robot Task Execution Accuracy.

V-C Closed-loop Response

As shown in Fig. 4, our model works in a closed-loop setting and thus can interactively adapt to changes in the environment to successfully finish a task. In Fig. 10 we show the results of dynamic scene changes. While the robot is working on a task, we interfere with the scene and move objects around. In the example shown in Fig. 10 (d), the user puts the ball back into the cabinet while the robot is placing the cup in front of the cabinet. The robot visually detects this change and retrieves the ball after placing the cup. Here the adaptation is happening at two levels: first the sequence model detects the ball misplacement through the image embeddings and again triggers the ‘grasp ball primitive‘, then the primitive receives the new ball position through the perception system and thus adapt its motion in order to grasp the ball.

In Fig. 10 we show how our pipeline is able to recover from wrongly predicted actions. For the scene setup shown in Fig. 10 (a) the robot first predicts to move the cup Fig. 10 (b) & (c), while the plausible next action would be to move the ball in the cup. In Fig. 10 (d) the robot accidentally moves the cup, which changes the scene. In both cases our system is able to recover as we rely on image embeddings, while the sequence-to-sequence model keeps track of the past actions. This way the system eventually predicts the correct actions to complete the task (Fig. 10 e, f).

In the accompanying video we show several live-action captures of our system to showcase the capabilities of our framework.

Fig. 9: Dynamic interaction: the robot is executing the task of putting a ball in a cup. From an initial object arrangement (a) it first fetches a ball (b), then continues to get the cup (c). A human operator then puts the ball back inside the cabinet (d) and the robot dynamically adapts by changing the plan to again fetch the ball (e). Finally, it completes the task by putting the ball in the cup and both objects back into the cabinet (f).
Fig. 10: Recovering from wrong predictions: depending on the scene setup predicted symbols can be implausible or their execution may fail. For the scene shown in (a) a plausible action would be to put the ball in the cup, but the system predicts to first move the cup (b) & (c). In (d) the gripper is accidentally moving the cup while the arm reaches for the ball. In both cases our system can recover from these failure states as we rely on image embeddings for predicting action symbols. Eventually the robot is then able complete the task (e), (f).
Fig. 9: Dynamic interaction: the robot is executing the task of putting a ball in a cup. From an initial object arrangement (a) it first fetches a ball (b), then continues to get the cup (c). A human operator then puts the ball back inside the cabinet (d) and the robot dynamically adapts by changing the plan to again fetch the ball (e). Finally, it completes the task by putting the ball in the cup and both objects back into the cabinet (f).

Vi Conclusion

We have introduced a framework for translating sequences of action symbols from image embeddings. Each symbol represents a task-relevant action that is necessary to accomplish a sub-task. We have shown that symbols serve as a lightweight and abstract representation that not only enables using sequential models – known from natural language processing – for the efficient learning of task structure, but also to organize the execution of tasks with real robots. Learning to translate image embeddings to action symbols allows us to execute tasks in closed-loop settings, which enables robots to adapt to changing object configuration and scenes. We have demonstrated the usefulness of our framework on two different datasets and evaluated our approach on two model architectures.

One limitation of our current setup is that we rely on ground truth labels for actions in the observed sequences. Automatically obtaining these action labels is not trivial and an interesting avenue for future research. Furthermore, for many tasks it would be important to understand the sequential structure of tasks over even longer time-horizons. Here it seems interesting to further explore other model architectures for sequence translation from image embeddings.


  • [1] Cited by: §II.
  • [2] A. Akbari, Muhayyuddin, and J. Rosell (2016) Task planning using physics-based heuristics on manipulation actions. In ETFA, Vol. , pp. 1–8. Cited by: §II.
  • [3] A. Amiranashvili, A. Dosovitskiy, V. Koltun, and T. Brox (2018) Motion perception in reinforcement learning with dynamic objects. In CoRL, Cited by: §II.
  • [4] D. Bahdanau, K. C., and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: §V-A.
  • [5] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould (2016) Dynamic image networks for action recognition. In CVPR, Vol. , pp. 3034–3042. Cited by: §II.
  • [6] R. Brooks (1986) A robust layered control system for a mobile robot. IEEE J Robot Autom 2 (1), pp. 14–23. Cited by: §II.
  • [7] S. Cambon, R. Alami, and F. Gravot (2009) A hybrid approach to intricate motion, manipulation and task planning. Int. J. Robotics Res. 28 (1), pp. 104–126. Cited by: §I, §II.
  • [8] K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. ArXiv abs/1406.1078. Cited by: §I, §III-B.
  • [9] S. Chopra, R. Hadsell, and Y. LeCun (2005) Learning a similarity metric discriminatively, with application to face verification. In CVPR, Vol. 1, pp. 539–546 vol. 1. Cited by: §II.
  • [10] C. Devin, P. Abbeel, T. Darrell, and S. Levine (2017) Deep object-centric representations for generalizable robot learning. ICRA, pp. 7111–7118. Cited by: §II.
  • [11] N. Fazeli, M. Oller, J. Wu, Z. Wu, J. B. Tenenbaum, and A. Rodriguez (2019) See, feel, act: hierarchical learning for complex manipulation skills with multisensory fusion. Science Robotics 4 (26). Cited by: §II.
  • [12] R. E. Fikes, P. E. Hart, and N. J. Nilsson (1972) Learning and executing generalized robot plans. Artificial Intelligence 3, pp. 251 – 288. Cited by: §II.
  • [13] C. Finn and S. Levine (2017) Deep visual foresight for planning robot motion. In ICRA, pp. 2786–2793. Cited by: §I, §I.
  • [14] R. Fox, S. Krishnan, I. Stoica, and K. Goldberg (2017) Multi-level discovery of deep options. CoRR abs/1703.08294. Cited by: §II.
  • [15] C. R. Garrett, T. Lozano-Pérez, and L. P. Kaelbling (2018) FFRob: leveraging symbolic planning for efficient task and motion planning. Int. J. Robotics Res. 37 (1), pp. 104–136. Cited by: §II.
  • [16] K. Hausman, Y. Chebotar, S. Schaal, G. S. Sukhatme, and J. J. Lim Multi-modal imitation learning from unstructured demonstrations using generative adversarial nets. In NeurIPS 2017, Cited by: §II.
  • [17] K. Hausman, S. Niekum, S. Osentoski, and G. S. Sukhatme (2015-05)

    Active articulation model estimation through interactive perception

    In ICRA, Vol. , pp. 3305–3312. Cited by: §II.
  • [18] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In ICCV, Vol. , pp. 2980–2988. Cited by: §III-C.
  • [19] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. CVPR, pp. 770–778. Cited by: §III-A.
  • [20] S. Höfer and O. Brock (2016) Coupled learning of action parameters and forward models for manipulation. In IROS, Vol. , pp. 3893–3899. Cited by: §II.
  • [21] T. Inoue, G. De Magistris, A. Munawar, T. Yokoya, and R. Tachibana (2017-Sep.) Deep reinforcement learning for high precision assembly tasks. In IROS, Vol. , pp. 819–825. Cited by: §II.
  • [22] M. Janner, S. Levine, W. T. Freeman, J. B. Tenenbaum, C. Finn, and J. Wu (2019) Reasoning about physical interactions with object-centric models. In ICLR, Cited by: §I, §II.
  • [23] J. Jones, G. D. Hager, and S. Khudanpur (2019-01)

    Toward computer vision systems that understand real-world assembly processes

    In WACV, Vol. , pp. 426–434. Cited by: §II.
  • [24] L. P. Kaelbling and T. Lozano-Pérez (2011) Hierarchical task and motion planning in the now. In ICRA, Vol. , pp. 1470–1477. Cited by: §II, §II.
  • [25] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine QT-opt: scalable deep reinforcement learning for vision-based robotic manipulation. CoRL 2018. Cited by: §I.
  • [26] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei (2014) Large-scale video classification with convolutional neural networks. In CVPR, Vol. , pp. 1725–1732. Cited by: §II.
  • [27] M. Khansari and A. Billard (2011)

    Learning stable non-linear dynamical systems with gaussian mixture models

    IEEE Transaction on Robotics. Cited by: §III-C.
  • [28] S.M. Khansari-Zadeh, E. Klingbeil, and O. Khatib (2016) Adaptive human-inspired compliant contact primitives to perform surface-surface contact under uncertainty. Int. J. Robotics Res.. Cited by: §III-C.
  • [29] T. Kipf, Y. Li, H. Dai, V. F. Zambaldi, A. Sanchez-Gonzalez, E. Grefenstette, P. Kohli, and P. Battaglia CompILE: compositional imitation learning and execution. In ICML 2019, Cited by: §II.
  • [30] S. Krishnan, R. Fox, I. Stoica, and K. Goldberg (2017) DDCO: discovery of deep continuous options for robot learning from demonstrations. Cited by: §II.
  • [31] O. Kroemer, C. Daniel, G. Neumann, H. van Hoof, and J. Peters (2015) Towards learning hierarchical skills for multi-phase manipulation tasks. In ICRA, Vol. , pp. 1503–1510. Cited by: §II.
  • [32] F. Lagriffoul, D. Dimitrov, J. Bidot, A. Saffiotti, and L. Karlsson (2014) Efficiently combining task and motion planning using geometric constraints. Int. J. Rob. Res. 33 (14), pp. 1726–1747. Cited by: §II.
  • [33] A. S. Lakshminarayanan, R. Krishnamurthy, P. Kumar, and B. Ravindran (2016) Option discovery in hierarchical reinforcement learning using spatio-temporal clustering. Cited by: §II.
  • [34] C. Lea, R. Vidal, A. Reiter, and G. D. Hager (2016) Temporal convolutional networks: a unified approach to action segmentation. In Computer Vision – ECCV 2016 Workshops, pp. 47–54. Cited by: §II.
  • [35] J. Liu, B. Kuipers, and S. Savarese (2011) Recognizing human actions by attributes. CVPR 2011, pp. 3337–3344. Cited by: §II.
  • [36] T. Lozano-Perez, J. Jones, E. Mazer, P. O’Donnell, W. Grimson, P. Tournassoud, and A. Lanusse (1987) Handey: a robot system that recognizes, plans, and manipulates. In ICRA, Vol. 4, pp. 843–849. Cited by: §II.
  • [37] Z. Luo, Q. Zhang, Y. Ma, M. Singh, and F. Adib (2019) 3D backscatter localization for fine-grained robotics. In USENIX, NSDI’19, pp. 765–781. Cited by: §II.
  • [38] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §II.
  • [39] D. Moore and I. Essa (2002) Recognizing multitasked activities from video using stochastic context-free grammar. In AAAI, pp. 770–776. Cited by: §II.
  • [40] A. Muxfeldt and D. Kubus (2016) Hierarchical decomposition of industrial assembly tasks. In ETFA, Vol. , pp. 1–8. Cited by: §II.
  • [41] J. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici (2015-06) Beyond short snippets: deep networks for video classification. pp. 4694–4702. Cited by: §II.
  • [42] S. Niekum, S. Chitta, A. G. Barto, B. Marthi, and S. Osentoski (2013) Incremental semantically grounded learning from demonstration. In Robotics: Science and Systems, Cited by: §II.
  • [43] S. Niekum, S. Osentoski, G. Konidaris, and A. G. Barto (2012) Learning and generalization of complex tasks from unstructured demonstrations. In IROS, Vol. , pp. 5239–5246. Cited by: §II.
  • [44] S. Niekum, S. Osentoski, G. Konidaris, S. Chitta, B. Marthi, and A. G. Barto (2014-01) Learning grounded finite-state representations from unstructured demonstrations. Int. J. Robotics Res. 34, pp. 131–157. Cited by: §II.
  • [45] OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. Józefowicz, B. McGrew, J. W. Pachocki, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, and W. Zaremba (2018) Learning dexterous in-hand manipulation.. CoRR abs/1808.00177. Cited by: §I.
  • [46] A. Orthey, M. Toussaint, and N. Jetchev (2013) Optimizing motion primitives to make symbolic models more predictive. In ICRA, Vol. , pp. 2868–2873. Cited by: §II.
  • [47] A. J. Piergiovanni, C. Fan, and M. S. Ryoo (2017) Learning latent subevents in activity videos using temporal attention filters. In AAAI, pp. 4247–4254. Cited by: §II.
  • [48] H. Pirsiavash and D. Ramanan (2014) Parsing videos of actions with segmental grammars. In CVPR, Vol. , pp. 612–619. Cited by: §II.
  • [49] E. Plaku and G. D. Hager (2010) Sampling-based motion and symbolic action planning with geometric and differential constraints. In ICRA, Vol. , pp. 5002–5008. Cited by: §I, §II.
  • [50] M.S. Ryoo and J.K. Aggarwal (2006-01) Semantic understanding of continued and recursive human activities. pp. 379 – 378. Cited by: §II.
  • [51] S. Schaal, J. Peters, J. Nakanishi, and A. Ijspeert (2005) Learning movement primitives. Springer Tracts in Advanced Robotics 15, pp. 3337–3344. Cited by: 2nd item, §III-C.
  • [52] S. Sen, A. Garg, D. V. Gealy, S. McKinley, Y. Jen, and K. Y. Goldberg (2016) Automating multi-throw multilateral surgical suturing with a mechanical needle guide and sequential convex optimization. ICRA, pp. 4178–4185. Cited by: §II.
  • [53] A. Sharma, M. Sharma, N. Rhinehart, and K. M. Kitani (2018) Directed-info GAIL: learning hierarchical policies from unsegmented demonstrations using directed information. CoRR abs/1810.01266. External Links: 1810.01266 Cited by: §II.
  • [54] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In NIPS, pp. 3104–3112. Cited by: §I, §III-B, §III-B.
  • [55] M. Toussaint, K. R. Allen, K. A. Smith, and J. B. Tenenbaum (2019) Differentiable physics and stable modes for tool-use and manipulation planning - extended abtract. pp. 6231–6235. Cited by: §II.
  • [56] H. Wang, A. Kläser, C. Schmid, and C. Liu (2011) Action recognition by dense trajectories. In CVPR 2011, Vol. , pp. 3169–3176. Cited by: §II.
  • [57] H. Wang, S. Pirk, E. Yumer, V. G. Kim, O. Sener, S. Sridhar, and L. J. Guibas (2019) Learning a generative model for multi-step human-object interactions from videos. CGF 38 (2), pp. 367–378. Cited by: §I.
  • [58] T.-Y. Wu, J.-T. Lin, T.-H. Wang, C.-W. Hu, J. C. Niebles, and M. Sun (2018) Liquid pouring monitoring via rich sensory inputs. In ECCV, pp. 352–369. Cited by: §II.
  • [59] D. Xu, S. Nair, Y. Zhu, J. Gao, A. Garg, F. F. Li, and S. Savarese (2018-05) Neural task programming: learning to generalize across hierarchical tasks. pp. . Cited by: §II.
  • [60] A. Zhang, A. Lerer, S. Sukhbaatar, R. Fergus, and A. Szlam (2018) Composable planning with attributes. Cited by: §II.
  • [61] Z. Zhang and V. Saligrama (2015) Zero-shot learning via semantic similarity embedding. In ICCV, Vol. , pp. 4166–4174. Cited by: §II.
  • [62] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA, Vol. , pp. 3357–3364. Cited by: §II.
  • [63] Y. Zhu, Y. Zhao, and S. Zhu (2015) Understanding tools: task-oriented object modeling, learning and recognition. In CVPR, Vol. , pp. 2855–2864. Cited by: §II.
  • [64] Z. Zhu and H. Hu (2018) Robot learning from demonstration in robotic assembly: a survey. Robotics 7 (2). Cited by: §II.