Learning to Sequence and Blend Robot Skills via Differentiable Optimization

by   Noémie Jaquier, et al.

In contrast to humans and animals who naturally execute seamless motions, learning and smoothly executing sequences of actions remains a challenge in robotics. This paper introduces a novel skill-agnostic framework that learns to sequence and blend skills based on differentiable optimization. Our approach encodes sequences of previously-defined skills as quadratic programs (QP), whose parameters determine the relative importance of skills along the task. Seamless skill sequences are then learned from demonstrations by exploiting differentiable optimization layers and a tailored loss formulated from the QP optimality conditions. Via the use of differentiable optimization, our work offers novel perspectives on multitask control. We validate our approach in a pick-and-place scenario with planar robots, a pouring experiment with a real humanoid robot, and a bimanual sweeping task with a human model.


Inferring the Geometric Nullspace of Robot Skills from Human Demonstrations

In this paper we present a framework to learn skills from human demonstr...

Robot Program Parameter Inference via Differentiable Shadow Program Inversion

Challenging manipulation tasks can be solved effectively by combining in...

SKID RAW: Skill Discovery from Raw Trajectories

Integrating robots in complex everyday environments requires a multitude...

Learning from Successful and Failed Demonstrations via Optimization

Learning from Demonstration (LfD) is a popular approach that allows huma...

Capability-based Frameworks for Industrial Robot Skills: a Survey

The research community is puzzled with words like skill, action, atomic ...

Learning to Compose Skills

We present a differentiable framework capable of learning a wide variety...

Similarity-Aware Skill Reproduction based on Multi-Representational Learning from Demonstration

Learning from Demonstration (LfD) algorithms enable humans to teach new ...

I Introduction

Humans and animals generally achieve seamless sequences of actions, featuring smooth and natural transitions. Indeed, there are biological evidences that motor actions are composed of fundamental building blocks, which are then smoothly sequenced and combined to realize complex motions [23, 9]. This particularly applies to manipulation tasks, which can be broken down into several smoothly-linked action phases for which the brain selects and executes appropriate controllers [11]. In contrast, learning and executing seamless sequences of actions is still a challenge in robotics. Indeed, skills are usually learned for a specific task and are thus difficult to re-use in a different sequence of actions. Moreover, robot motions are characterized by obvious jerky transitions, which are so typical that people imitate robots by introducing abrupt pauses between subsequent movements.

In this paper, we propose a novel skill-agnostic approach to sequence and blend skills. To do so, we encode sequences of skills as quadratic programs (QP) [24] and leverage differentiable optimization (Optnet) layers [2, 1] to determine the relative importance of each skill throughout the task (see § III for a background). Our approach is skill-agnostic by acting on a set of control values, thus considering skills as a-priori given black-box solutions. Given a set of previously-defined (i.e., learned or programmed) skills and few demonstrations of a task, our formulation not only learns a suitable sequence of possibly-concurrent skills, but also blends transitions ”for free”, i.e., requiring no additional operations (see § IV).

The contributions of this paper are: (i) We propose a novel QP-based approach to learn seamless sequences of skills from demonstrations; (ii

) We formulate a tailored loss function from the optimality of the QP; (

iii) We present two types of QP parameters to encode the importance of skills; (iv) We bring a novel perspective on multitask control via the use of differentiable optimization. We showcase our approach in various experiments with simulated and real robots (§ V).

Ii Related Work

Given a set of individual robotic skills, the challenge is to order and combine them to successfully execute complex manipulation tasks. Sequencing approaches presented in the literature are mainly based on learning from demonstrations (LfD) [19, 20, 27, 13]

or on reinforcement learning (RL) 

[13, 30]. Manschitz et al. [19]

learn both a sequence graph of skills from demonstrations, and a classifier to select the transitions. The authors extend their approach to handle concurrent skill activations 

[20]. As opposed to our work, the transitions between skills are explicitly labeled for the demonstrations. Rozo et al. [27] introduce an object-centered skill sequencing formulation, which builds a complete model of the task by cascading several skill models, and adapting their task parameters. In contrast to our approach, the desired skill sequence is assumed to be given. In [13], demonstrated trajectories are segmented into sequences of skills, where skill policies are represented by linear value function approximations. Sequences from several demonstrations are then combined into skill trees. Stulp et al. [30] extend the PI algorithm to optimize sequences of dynamical movement primitives (DMP) by simultaneously learning their shape and goal parameters. Overall, the aforementioned approaches are specifically tailored to a single skill type, e.g., dynamical systems [19, 20]

, task-parametrized Gaussian mixture model (TP-GMM) 

[27], or DMP [30]. Moreover, transitions are usually handled by matching the end- and start-points of subsequent skills, and are thus characterized by obvious pauses. In contrast, our approach is skill-agnostic and learns sequences featuring seamless and natural transitions.

Other works focus on designing smooth transitions between skills. For instance, several approaches were presented in [29] to blend DMPs, and probabilistic movement primitives (ProMP) can naturally be blended [25]. However, these methods require a known sequence of specific skills and a manual tuning of transition parameters. In [17], motions are generated from a hierarchy of motion primitives, which are activated based on a neural-like dynamics. Therefore, sequencing and blending is achieved by choosing suitable weights and connections. This approach was then combined with optimal control for continuous motion adaptation [22]. Although it generates seamless motions, its applicability is limited due to the necessity of defining the model by hand.

Sequencing and blending of tasks has also been explored in the context of robot multitask control. Salini et al. [28] combine different controllers in a QP formulation by defining a soft hierarchy of tasks. This corresponds to defining a sequence of skills with concurrent activations. Smooth transitions are achieved by smoothly-varying the relative importance of skills (priorities) with manually-tuned weights. In [7], the skills priorities are instead optimized using covariance matrix adaptation evolution strategy (CMA-ES) in order to superpose several controllers for motion generation. Modugno et al. [21] extended this idea to learn time-varying skill priorities given as a weighted sum of basis functions equally spaced in time. The corresponding weights can then be optimized using black-box optimization techniques such as CMA-ES [21] or Bayesian optimization (BO) [31, 16]. Our work distinguishes in that we directly learn the relative importance of skills along the task by differentiating through the optimization problem. In contrast to [21, 31, 16], we leverage LfD to learn sequences of previously-defined skills with seamless transitions. Therefore, our approach requires only few initial demonstrations and no additional trials during the learning phase, thus improving on data-efficiency and training cost compared to black-box optimization techniques.

Iii Background

Iii-a Multitask control with quadratic programming

Quadratic programs (QP) [24, Chap. 16] are extensively used to formulate multitask control of humanoid robots as a constrained optimization problem. Indeed, QP can be solved very efficiently, while explicitly incorporating a wide variety of objectives and accounting for diverse constraints (see e.g., [5, 6]). A QP solves a problem of the form


where is the optimization variable, , are the parameters of the quadratic cost function with denoting the manifold of positive-semidefinite (PSD) matrices, and , , , are the constraints parameters. For robot multitask control, QP are typically used to minimize the weighted sum of a set of tasks, i.e., , where and are the desired and current value of the task , respectively, and is a weight setting the relative importance of the task with respect to the other tasks. Moreover, the constraints typically include the equations of motion (kinematics, or dynamics), the technological limits of the system (e.g., joint limits), and interaction constraints (e.g., grasp or frictional contacts). In this paper, we use a QP to encode a sequence of skills, along which the weights scaling the importance of each skill vary, leading to smooth trajectories and transitions.

Iii-B Karush-Kuhn-Tucker conditions

The Karush-Kuhn-Tucker (KKT) conditions [15] are first order necessary conditions for to be a local solution of a constrained optimization problem. In particular, the KKT conditions corresponding to the QP (1) are (i) with the Lagrangian function of the problem (1), and the Lagrangian multipliers corresponding to its equality and inequality constraints, respectively, (ii) , (iii) , (iv) , and (v) .

In addition to being used throughout the solving process of constrained optimization problems, the KKT conditions were exploited in inverse optimal control (IOC). In IOC, trajectories are viewed as the solution of an optimization problem, which aims at minimizing an unknown (parametric) cost. In this context, Englert et al. [8] used the fact that demonstrations of such trajectories — under the assumption that they are optimal — fulfill the KKT conditions, to determine the optimal parameters of the underlying cost.111Similar ideas have also been explored in the context of inverse reinforcement learning (IRL), where the parameters of a reward function were selected by minimizing the norm of the expert’s policy gradient [26]. We follow a similar reasoning and leverage the QP KKT conditions to define the loss of our sequencing approach.

Iii-C Differentiable optimization layers

Recent works [2, 1] proposed to integrate optimization layers into neural architectures by differentiating through the corresponding optimization problems. In particular, Amos and Kolter [2] introduced Optnet, a neural architecture embedding QP as individual layers. Namely, Optnet defines the output of the current layer as the solution of a QP whose parameters depend on the previous layer , i.e.,


In order to train Optnet using backpropagation, the layer (

2) must be differentiable, i.e., the derivatives of the solution of the QP with respect to its input parameters must be computed. This is achieved by differentiating the KKT conditions of the problem at a given solution (see [2]). In this paper, we leverage Optnet to learn the importance of individual skills throughout the task.

Iv Learning to Sequence and Blend Skills

(b) Diag.
(c) Full
(d) Baseline
(e) Generalization
(f) Evolution of
Fig. 1: Pick()-and-place() task with planar robots. (a) (top) and (bottom) DS skills (). (b)-(f) Demonstration (), reproduction with the -DoF robot using diagonal (), full () and baseline () weights, and generalization with the -DoF robot.

In this section, we present our approach to sequence and blend manipulation skills. In the following, we assume a set of previously-defined individual robot skills (e.g., a skill library). The skills are considered as given black-box solutions, implying that their representations are unknown and may differ across the skills. At each instant, each skill outputs a desired control value , depending on a current state , to be given to the robot in order to execute the skill. For example, dynamical-systems-based skills [10] provide a desired end-effector velocity depending on the current end-effector position, and time-dependent skills such as [32] may output, e.g., a time-varying desired joint or end-effector position. The control values are specific to and may differ across skills. We then consider a manipulation task consisting of an unknown sequence of (some of) the aforementioned skills, possibly concurrently activated. We observe one or several optimal demonstrations of the task consisting of the observed control values, i.e., , where the phase variable encodes the task progress222In the remainder we drop dependencies on to simplify the notation.. In other words, and represent the beginning and the end of the task.

Iv-a Illustrative example: pick-and-place with planar robots

For the sake of clarity of this section, the different concepts underlying our approach are introduced generally before being illustrated for a pick-and-place task executed by planar robots with grippers. In this example, we observe a single manually-designed demonstration provided by a -DoF teacher robot that picks an object, transports it, and places it at a given location (see Fig. 1). The demonstration steps were achieved with proportional controllers activated using the weights of Fig. 0(f). We then consider a set of four skills , where the / and / skills control the arm and gripper motion, respectively. Although we next disclose the skills types, remember that they are considered as given black-box solutions in our approach. Indeed, each skill only provides a desired control value depending on the state at each task instant.

The arm skills are encoded as dynamical systems (DS) [10] trained with the control Lyapunov function scheme of [12]. The obtained DS, illustrated by Fig. 0(a), can then be adapted to new situations via translations and rotations. The desired control values of the DS-based skills correspond to the end-effector velocity and depend on the current end-effector position , such that and . The desired control values of the gripper skills correspond to the velocity of the gripper joints . The velocities and are zero when the gripper is completely opened or closed, and constant otherwise, i.e., and . In this example, the phase variable is defined as with the elapsed time, and the total duration of the task.

Iv-B Sequencing and blending of skills with QPs

Similarly to multitask control, we propose to encode sequences of skills as QPs. Namely, given the desired control values output by the individual skills and the current control values , a sequence of skills can be generated by solving the following optimization problem


at each , where is a varying weight matrix setting the relative importance of the skills throughout the sequence in function of the phase variable encoding the task progress. The problem (3) is usually augmented with linear constraints related to the robotic system (see § III-A). In our case, we also include equality constraints for control values of the same type, i.e., if the skills and have the same type of outputs (e.g., both return end-effector pose values). For instance, following (3), the optimization problem of our illustrative example is formulated as

The constraints come from the shared control values across skills, i.e., the end-effector velocity , and the gripper joints velocity for the and skills, respectively. These constraints can directly be integrated into the optimization problem, which is equivalently written as


Note that (3) can be equivalently formulated as (1) with the optimization variable , and cost parameters , with . Importantly, the skill ordering in (3) is arbitrary. Indeed, the sequence is defined by the weight matrix, that is learned from demonstrations, as explained next. Skills can be added by extending with their control values and expanding accordingly.

Fig. 2: Illustration of the proposed learning approach. The relative importance of the skills is encoded by as a function of . An Optnet layer, solving a QP whose parameters depend on , is then used to determine the control command . is either a block-diagonal (top), or a full (bottom) matrix. The dashed arrows are only activated in the latter to learn the off-diagonal elements.

Given one or several demonstrations of a manipulation task, we aim at learning the skill weight function , so that the reproduction , i.e., the sequence of skills obtained by solving (3) for , replicates the demonstrated task. This corresponds to minimizing a loss function measuring the quality of the reproduction. To do so, we need to solve a nested optimization: For each time instance of the task, we solve (3), and the whole set of solutions is then used to minimize the loss . To solve this problem, we leverage Optnet [2] to integrate the QP (3

) into a neural network. Optnet allows us (

i) to represent the QP parameters as functions, and (ii) to differentiate with respect to the QP parameters to solve the outer optimization of our nested problem using gradient-based approaches. In other words, Optnet backpropagates the loss to optimize both the phase-dependent skills weights and the control outputs . Thus, we can learn the relative importance of the skills throughout the task execution via the matrix . Our proposed neural network takes the phase variable as input, and consists of (i

) a fully-connected layer coupled with a softmax activation function, whose outputs are the QP parameters

(see § IV-D for details), and (ii) of an Optnet layer (2), where , and is the control command transmitted to the robot to execute the task. Our approach is illustrated by Fig. 2.

It is important to emphasize that the proposed formulation not only learns sequences of skills, but also blends the transition between individual skills ”for free”. Indeed, the coupling of the fully-connected layer with a softmax activation induces smooth non-binary weight functions , therefore leading to smooth variations of the relative importance of the skills, i.e., to smooth transitions. This allows our neural architecture to learn and reproduce seamless transitions, as usually observed in human demonstrations. This also implies that skills are not necessarily executed in a strict sequence, but may be activated concurrently if required by the task.

The individual skills outputs may be defined either in task space (e.g., end-effector pose, or velocity), or in joint space (e.g., joint position, or velocity). In the former case, it may be desirable to directly solve the optimization (3) with respect to joint variables when executing the reproduction on the robot. To do so, the current control values can be expressed in function of the joint values by exploiting the kinematic or dynamic relationship between the task- and joint-space variables. In our illustrative example, this corresponds to solving, during the reproduction,


where the arm skills outputs are expressed as with and the arm and gripper joint velocities, respectively, and the manipulator Jacobian. Finally, note that nonlinear relationships must be linearized for the QP formulation.

Iv-C Definition of the loss function

In this section, we take inspiration from the IOC approach of [8] to define the loss function used to train the neural network previously introduced. Namely, we assume that the demonstrations are optimal, i.e., they are optimal solutions to the QP problem (3) and thus satisfy its KKT conditions. As the QP constraints are satisfied during optimal demonstrations, the KKT conditions (ii)-(v) are automatically fulfilled. Therefore, determining the optimal parameters of our neural network can be understood as searching for the parameters fulfilling the first KKT condition for all the demonstrations. This corresponds to minimizing the loss


where we sum over the demonstrations and the progress of the task via the phase variable . The Lagrangian of the problem (3) and its derivative for the -th demonstration are

where ,

is the vector of demonstrated skills outputs,

and are the stacked constraints parameters, and is the vector of Lagrangian multipliers. Moreover, we can express in function of for each demonstration by minimizing the loss subject to the KKT complementary condition, i.e., . Therefore, by setting the optimization variable to the output of our network, the loss of each demonstration is 333Equivalently, for constant .

The loss (6) inherently includes the task specifications via the demonstrations and the QP KKT conditions, and does not require additional task-specific design. To avoid the singular solution , we leverage the softmax activation function, as explained next. Thus, at least one skill is given a high relative importance at each instant of the task.

Iv-D Skills weights as positive-semidefinite matrices

As mentioned previously, the QP parameters are determined by the first part of our neural network. Specifically, the cost parameters are , where the weight matrix is learned by the network. The constraints parameters relate to skills outputs and to the robot physical characteristics. To obtain valid QPs, or equivalently to prevent skills to have negative relative importance weights, the weight matrices must be PSD, i.e., . We here describe two approaches to learn PSD weight matrices.

Diagonal weight matrices

In this case, we define


where each block weights the output of the -th skill, and the scalars are obtained from the fully-connected layer followed by a softmax activation function. The latter ensures that the scalar weights are positive and sum to , thus guaranteeing that is PSD, and that at least one skill is activated at any instant of the task. Notice that we defined the different blocks as proportional to identity matrices to avoid altering the outputs of individual skills.

Full weight matrices

Such matrices allow us to express correlations between different skills, i.e, between their control values , throughout the task. This naturally occurs in various tasks. For example, when approaching and grasping an object, the hand closure is correlated with the velocity at which the object is approached. We learn matrices


where the off-diagonal blocks encode the correlations between the outputs of the skills and . To guarantee the positive semidefiniteness of the matrices , we propose to learn the diagonal and off-diagonal blocks separately. Firstly, the scalar terms are obtained as described in the previous paragraph. Secondly, the off-diagonal matrices are obtained by leveraging the properties of matrices with positive block-diagonal elements [4], namely


where is a contraction matrix, i.e., . Therefore, we use a second fully-connected layer to learn the contraction matrices as , with a tanh and a sigmoid activation function applied to and , respectively, so that . The off-diagonal elements are then computed recursively using the right-hand side of (9). For instance, in the case of a matrix composed of 3 skills, we first compute with and , and then with and . Note that, to facilitate the training of full weight matrices, we initialize the parameters of the scalar terms with a previously-trained diagonal model.

V Experiments

In this section, we evaluate our approach with different robotic platforms and manipulation tasks. All computations were performed on a laptop with GHz CPU and GiB RAM. A video of the experiments accompanies the paper (https://youtu.be/00NXvTpL-YU), and source codes are available at https://github.com/NoemieJaquier/sequencing-blending/.

V-a Illustrative example: pick-and-place with planar robots

We first consider the pick-and-place task introduced in § IV-A and train our approach using diagonal and full weight matrices on the provided single manually-designed demonstration. In order to guarantee that one arm and one hand skill are activated at each instant of the task, we use one softmax activation function for each of the arm and gripper pairs of skills, namely and . The task is then reproduced by the 4-DoF robot. As a baseline, we consider the case where the QP (4) with diagonal weights does not require additional constraints, so that its solution is , . In this case, as the QP solution is readily available, we do not need to solve a nested optimization to minimize a given loss. Instead, the loss (6) can be minimized independently for each value of with classical optimization methods. Finally, a -DoF student robot is requested to reproduce the learned sequence of skills with different pick and place positions. To do so, the and DS skills are adapted to the new target points. For all reproductions, the QP is solved with respect to the arm and gripper joint velocities using (5).

Fig. 0(b) depicts the demonstrated trajectory, as well as the reproduction of the task by the -DoF robot with a diagonal weight matrix. Our approach successfully sequences the available skills and reproduces the task by picking and placing the object at the required locations. The differences of trajectory between the demonstration and the reproduction are due to the fact that the DS arm skills naturally follow a different trajectory than the demonstration between the target points (remember that the demonstration was generated independently from the given skills). For the same reason, the learned weights slightly differ from the manually-designed ones used to generate the demonstration (Fig. 0(f)). The differences of trajectory are attenuated when using a full weight matrix (see Fig. 0(c)), where correlations between skills are exploited to better match the demonstration. Note that only the diagonal weights are represented in Fig. 0(f). As expected, the baseline looks similar to our approach with diagonal weight matrix (see Fig. 0(d)). Slight differences may be due to the different optimizations and to local minima in the loss. However, notice that the baseline applies only to very simple QPs, which are unrealistic for most applications (incl. for the experiments of § V-BV-C). Also, in contrast to our approach, the baseline does not learn the weight matrix as a parametric function of the phase variable. Fig. 0(e) depicts the reproduction of the learned sequence by the -DoF robot using a diagonal weight matrix, showing that our approach successfully generalizes to different pick and place locations. As the full weight matrix naturally overfits a single demonstration, it is not well suited to generalize in this case.

V-B Pouring task with a humanoid robot

Fig. 3: Pouring task with a humanoid robot. The top row shows snapshots of the robot in the resting position (1) and executing the (2), (3), (4) and (5) skills during the task. The bottom-left graphs depict the demonstrated () and reproduced hand position, orientation, and closure trajectories. Reproductions are obtained with our approach using diagonal () and full () weight matrices. A generalized motion obtained with diagonal weight matrices (), as well as a baseline where skills are manually sequenced without blending (), are also displayed. The right column depicts the learned diagonal and full weight matrices at different task instants.

Here, we apply our approach in a real-world scenario to learn a complex sequence of skills on the humanoid robot ARMAR-6 [3]. The robot is positioned in front of a table, on which are placed an empty glass and a -liter plastic bottle partially filled with orange juice. The scenario consists of a pouring task, where the robot grasps the bottle, pours juice into the glass, and places the bottle back on the table. The positions of the objects are assumed a priori known by the robot, but could equally be inferred by a perception system.

As for the previous experiment, a set of skills is provided as black-box solutions. Specifically, four skills are defined for the arm, namely the bottle, , the bottle back, and the arm. Moreover, two joint-velocity-based skills are provided for the five-fingered hand, namely and in a power cylindrical grasp. The four arm skills are defined by DS with radial vector fields pointing toward a fixed point attractor. Their desired control values correspond to the end-effector linear and angular velocities and , which depend on the current end-effector position and orientation , i.e., . The fixed point attractors of the four arm skills are the robot hand grasp pose on the bottle for the skill, a tilted hand pose above the glass for the skill, the hand pose at the position of the bottle on the table for the skill, and the hand resting pose for the skill. The hand skills are defined similar to the gripper skills of the pick-and-place example, and thus open and close all finger joints by controlling their velocity. We train our approach on seven manually-designed demonstrations for which an operator defined the arm and hand trajectories. The bottle and glass positions were varied of and cm along the and axes, respectively. As previously, we use two softmax activation functions for the arm and hand skills, and the phase variable is .

After the learning phase, the robot successfully reproduced the pouring task using both diagonal and full weight matrices (see Fig. 3 (top-left)). Moreover, our approach not only succeeded at learning the desired sequence of skills, but also resulted in seamless transitions as indicated by the absence of pauses and by the smoothness of the trajectories depicted in Fig. 3 (bottom-left). The learned weight matrices are represented in Fig. 3 (right) for the diagonal and full cases. Although the resulting trajectories look similar, the matrices still differ in the relative importance attributed to each skill. Notably, the model with full weight matrices exploits the correlation between the skills to shape the reproduced trajectory, thus featuring lower diagonal values than the diagonal model. Therefore, full weight matrices have better representation capabilities than their diagonal counterpart. However, this comes at the expense of generalization abilities. Indeed, as shown in Fig. 3, the diagonal model was able to generalize to bottle and glass locations that were outside the demonstrated range (here, the bottle and glass positions were swapped along the axis), which the full model could only achieve for locations close to the demonstrations. Finally, we compared our approach to a baseline obtained by manually sequencing the given skills without any learning or blending. As shown in Fig. 3 (bottom-left), the baseline trajectory is characterized by obvious jerky transitions. The resulting timing would cause the robot to overfill the glass, thus failing the reproduction. Importantly, our approach is well-suited for learning and executing the sequence of skills on a real robot. Indeed, the pouring task training lasted a couple of minutes, and the testing time was - ms per timestamp, which allowed us to execute our approach at a control frequency of Hz.

V-C Bimanual sweeping task learned from human data

(b) Position (left arm)
(c) Orientation (right arm)
(d) Diag.
(e) Full
Fig. 4: Bimanual sweeping task with a human model. (a) and VMP skills. (b)-(c) Demonstrations () and reproductions of the task using diagonal () and full () weight matrices. (d)-(e) Snapshots of the reproduction at (top) and (bottom).

We aim at evaluating our approach to sequence and blend skills based on human demonstrations, i.e., on data for which no ground truth is easily available. To do so, we consider a bimanual sweeping task from the KIT motion database [18, 14], in which a human transfers cucumber slices from a cutting board to a bowl. At the beginning of the demonstrations, a subject stands in front of a table. A cutting board on which cucumber slices are placed, is positioned along the edge of the table in front of the human. The human first grasps a plastic bowl with the left hand and a knife with the right hand using cylindrical power grasps. Then, s/he holds the bowl below the table next to the cutting board, and pushes the cucumber slices into the bowl with the knife. Finally, the human places both knife and bowl back.

For the bimanual sweeping task, we consider the motion of each arm separately. Moreover, we use demonstrations of the aforementioned sweeping task performed by two different subjects. First, three naturally-varying demonstrations of the first subject are used to obtain a skill library. Here, we consider a set of four low-level skills per arm, namely and for the left and right arm, respectively. Each human demonstration is manually segmented into four parts corresponding to the , /, , and skills. In this experiment, we use via-points movement primitives (VMP) [32], which offer powerful skill representations that are easily adaptable to new starts, goals and via-points after training. Therefore, each skill is then represented by a time-dependent VMP trained on the corresponding segments of the demonstrations. The desired control values are the end-effector position and unit-quaternion-based orientation given by the mean trajectory retrieved by the VMPs. The desired control values depend on the time , i.e., . All VMPs are executed with the start and goal poses defined by the desired task. The timing of the VMP skills is defined by the duration of the entire task . Within our model, every skill trajectory is then evaluated at the evolving time based on the overall phase variable . The resulting skills are illustrated by Fig. 3(a). As for the previous experiments, these skills are considered as black-box solutions, meaning that their representation is not directly known by our model. We then use three demonstrations provided by a second different subject to train two models of our approach (left and right arm separately) with diagonal and full weight matrices. Note that these demonstrations include variations, as humans motions naturally vary across executions of the same task.

A simulated kinematic human model, as well as models of the bowl, knife and table, are used for the reproduction phase. In this case, the model with diagonal weight matrices could not reproduce the task as it was not able to closely fit the demonstrations (see Fig. 3(b)3(e)). This is due to the significant differences between the low-level skill trajectories (trained on the first subject) and the demonstrations (provided by the second subject). Notice that such differences also appeared in the pick-and-place experiment. However, as opposed to the sweeping task, the arm trajectories between the pick and place locations did not influence the task success, allowing both diagonal and full weight matrices to be used. For the bimanual sweeping task, only full weight matrices lead to a successful reproduction by learning correlations between skills. Notice that, although two separated models were trained for the left and right arms, the learned full weight functions conserved the timing of the motions, allowing both arms to be synchronized during the reproduction. Also, the training and testing times were similar to the pouring task.

Vi Conclusion

We proposed a skill-agnostic formulation to learn to sequence and blend skills using QP-based differentiable optimization layers. This allows us to represent the relative importance of skills as a function of the task progress and to optimize it for a given loss with gradient-based approaches. Our experiments showed that, provided a set of black-box skills and one or few demonstrations of a task, our approach not only learns unknown sequences composed of various types of skills, but also generates smooth motions with seamless, blended transitions. Overall, our diagonal model is advantageous for generalization, while full weight matrices are beneficial when demonstrations must be closely followed.

It is worth noticing that the considered pouring and sweeping tasks are generally difficult to learn with a single model. Instead, our approach decomposes a task by combining several skills, which are easy to train and potentially re-usable across tasks. Moreover, it requires only one or few demonstrations of the complete task, making it less cumbersome to train than trial-and-error-based models. This is a major advantage compared to black-box optimization techniques used in multitask control, although detailed performance comparisons are deferred to future work. Finally, in contrast to end-to-end methods, our formulation is modular, fast to train, and interpretable as the relative importance of skills is directly embedded in the weight matrices.

Importantly, the performance of our approach highly depends on the capabilities of the given individual skills. Namely, a given task can be reproduced only if the provided skill library contains a set of skills that can be sequenced and combined to do so. Also, our approach generalizes to new object locations under the condition that the corresponding skills successfully adapt to these locations. The dependency of the model parameters to a time-driven phase variable also limits the generalization. This can be overcome by defining the phase variable as a time-independent, perception-based measure of task progress, which we will explore in the future.

One drawback of our approach is that the dimensionality of the optimization variable increases rapidly with the number of different types of skills, i.e., which provide different control variables. To be applied to cases featuring a complex library with many different types of skills, we will extend our approach to handle hierarchies of skills. For instance, high-level skills, e.g., cucumber slices to a bowl, may first be learned with our approach as sequences of low-level skills, and then combined in a complex task, e.g., a salad, with an additional QP-based formulation. We will then evaluate our approach in more complex scenarios including, e.g., hierarchies, and soft prioritization of skills.


  • [1] A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond, and J. Z. Kolter (2019) Differentiable convex optimization layers. In NeurIPS, pp. . Cited by: §I, §III-C.
  • [2] B. Amos and J. Z. Kolter (2017) OptNet: differentiable optimization as a layer in neural networks. In ICML, pp. . Cited by: §I, §III-C, §IV-B.
  • [3] T. Asfour, M. Wächter, L. Kaul, S. Rader, P. Weiner, S. Ottenhaus, R. Grimm, Y. Zhou, M. Grotz, and F. Paus (2019) ARMAR-6: a high-performance humanoid for human-robot collaboration in real world scenarios. IEEE RAM 26 (4), pp. 108–121. Cited by: §V-B.
  • [4] R. Bhatia (2007) Positive definite matrices. Princeton University Press. Cited by: §IV-D.
  • [5] K. Bouyarmane and A. Kheddar (2016) On weight-prioritized multitask control of humanoid robots. IEEE Trans. Autom. Control 63 (6), pp. 1542–1557. Cited by: §III-A.
  • [6] C. Collette, A. Micaelli, C. Andriot, and P. Lemerle (2008) Robust balance optimization control of humanoid robots with multiple non coplanar grasps and frictional contacts. In IEEE ICRA, pp. 3187–3193. Cited by: §III-A.
  • [7] N. Dehio, R. F. Reinhart, and J. J. Steil (2015) Multiple task optimization with a mixture of controllers for motion generation. In IEEE/RSJ IROS, pp. 6416–6421. Cited by: §II.
  • [8] P. Englert, N. A. Vien, and M. Toussaint (2017) Inverse KKT: learning cost functions of manipulation tasks from demonstrations. IJRR 36 (13-14), pp. 1474–1488. Cited by: §III-B, §IV-C.
  • [9] T. Flash and B. Hochner (2005) Motor primitives in vertebrates and invertebrates. Curr. Opin. Neurobiol. 15 (6), pp. 660–666. Cited by: §I.
  • [10] E. Gribovskaya, S. M. Khansari-Zadeh, and A. Billard (2011) Learning non-linear multivariate dynamics of motion in robotic manipulators. IJRR 30 (1), pp. 80–117. Cited by: §IV-A, §IV.
  • [11] R. S. Johansson and J. R. Flanagan (2009) Coding and use of tactile signals from the fingertips in object manipulation tasks. Nat. Rev. Neurosci. 10 (5), pp. 345–359. Cited by: §I.
  • [12] S. M. Khansari-Zadeh and A. Billard (2014) Learning control Lyapunov function to ensure stability of dynamical system-based robot reaching motions. Rob. Auton. Syst. 62 (6), pp. 752–765. Cited by: §IV-A.
  • [13] G. Konidaris, S. Kuindersma, R. Grupen, and A. Barto (2012) Robot learning from demonstration by constructing skill trees. IJRR 31 (3), pp. 360–375. Cited by: §II.
  • [14] F. Krebs, A. Meixner, I. Patzer, and T. Asfour (2020-2021) The KIT bimanual manipulation dataset. In IEEE/RAS Humanoids, pp. . Cited by: §V-C.
  • [15] H. W. Kuhn and A. W. Tucker (1951) Nonlinear programming. In

    Berkeley Symp. on Mathematical Statistics and Probability

    pp. 481–492. Cited by: §III-B.
  • [16] J. Li, Y. Zhu, L. Huo, and Y. Chen (2020) Sample-efficient learning of soft priorities for safe control with constrained Bayesian optimization. In IEEE IRC, pp. 406–407. Cited by: §II.
  • [17] T. Luksch, M. Gienger, M. Mühlig, and T. Yoshiike (2012) Adaptive movement sequences and predictive decisions based on hierarchical dynamical systems. In IEEE/RSJ IROS, pp. 2082–2088. Cited by: §II.
  • [18] C. Mandery, Ö. Terlemez, M. Do, N. Vahrenkamp, and T. Asfour (2016) Unifying representations and large-scale whole-body motion databases for studying human motion. IEEE T-RO 32 (4), pp. 796–809. Cited by: §V-C.
  • [19] S. Manschitz, J. Kober, M. Gienger, and J. Peters (2015) Learning movement primitive attractor goals and sequential skills from kinesthetic demonstrations. Rob. Auton. Syst. 74, pp. 97–107. Cited by: §II.
  • [20] S. Manschitz, J. Kober, M. Gienger, and J. Peters (2015) Probabilistic progress prediction and sequencing of concurrent movement primitives. In IEEE/RSJ IROS, pp. 449–455. Cited by: §II.
  • [21] V. Modugno, U. Chervet, G. Oriolo, and S. Ivaldi (2016) Learning soft task priorities for safe control of humanoid robots with constrained stochastic optimization. In IEEE/RAS Humanoids, pp. 101–108. Cited by: §II.
  • [22] M. Mühlig, A. Hayashi, M. Gienger, S. Iba, and T. Yoshiike (2014) Receding horizon optimization of robot motions generated by hierarchical movement primitives. In IEEE/RSJ IROS, pp. 129–135. Cited by: §II.
  • [23] F.A. Mussa-Ivaldi and E. Bizzi (2000) Motor learning through the combination of primitives. Philos. Trans. R. Soc. Lond., B, Biol. Sci. 355 (), pp. 1755–1769. Cited by: §I.
  • [24] J. Nocedal and S. J. Wright (2006) Numerical optimization. Second edition, Springer. Cited by: §I, §III-A.
  • [25] A. Paraschos, C. Daniel, J. Peters, and G. Neumann (2018) Using probabilistic movement primitives in robotics. Auton. Robot. 42 (3), pp. 529–551. Cited by: §II.
  • [26] M. Pirotta and M. Restelli (2016) Inverse reinforcement learning through policy gradient minimization. In AAAI, pp. 1993–1999. Cited by: footnote 1.
  • [27] L. Rozo, M. Guo, A. G. Kupcsik, M. Todescato, P. Schillinger, M. Giftthaler, M. Ochs, M. Spies, N. Waniek, P. Kesper, and M. Bürger (2020) Learning and sequencing of object-centric manipulation skills for industrial tasks. In IEEE/RSJ IROS, pp. 9072–9079. Cited by: §II.
  • [28] J. Salini, V. Padois, and P. Bidaud (2011) Synthesis of complex humanoid whole-body behavior: a focus on sequencing and tasks transitions. In IEEE ICRA, pp. 1283–1290. Cited by: §II.
  • [29] M. Saveriano, F. Franzel, and D. Lee (2019) Merging position and orientation motion primitives. In IEEE ICRA, pp. 7041–7047. Cited by: §II.
  • [30] F. Stulp, E. A. Theodorou, and S. Schaal (2012) Reinforcement Learning with Sequences of Motion Primitives for Robust Manipulation. IEEE T-RO 28 (6), pp. 1360–1370. Cited by: §II.
  • [31] Y. Su, Y. Wang, and A. Kheddar (2018) Sample-efficient learning of soft task priorities through Bayesian optimization. In IEEE/RAS Humanoids, pp. 1–6. Cited by: §II.
  • [32] Y. Zhou, J. Gao, and T. Asfour (2019) Learning via-point movement primitives with inter- and extrapolation capabilities. In IEEE/RSJ IROS, pp. 4301–4308. Cited by: §IV, §V-C.