I Introduction
Humans and animals generally achieve seamless sequences of actions, featuring smooth and natural transitions. Indeed, there are biological evidences that motor actions are composed of fundamental building blocks, which are then smoothly sequenced and combined to realize complex motions [23, 9]. This particularly applies to manipulation tasks, which can be broken down into several smoothlylinked action phases for which the brain selects and executes appropriate controllers [11]. In contrast, learning and executing seamless sequences of actions is still a challenge in robotics. Indeed, skills are usually learned for a specific task and are thus difficult to reuse in a different sequence of actions. Moreover, robot motions are characterized by obvious jerky transitions, which are so typical that people imitate robots by introducing abrupt pauses between subsequent movements.
In this paper, we propose a novel skillagnostic approach to sequence and blend skills. To do so, we encode sequences of skills as quadratic programs (QP) [24] and leverage differentiable optimization (Optnet) layers [2, 1] to determine the relative importance of each skill throughout the task (see § III for a background). Our approach is skillagnostic by acting on a set of control values, thus considering skills as apriori given blackbox solutions. Given a set of previouslydefined (i.e., learned or programmed) skills and few demonstrations of a task, our formulation not only learns a suitable sequence of possiblyconcurrent skills, but also blends transitions ”for free”, i.e., requiring no additional operations (see § IV).
The contributions of this paper are: (i) We propose a novel QPbased approach to learn seamless sequences of skills from demonstrations; (ii
) We formulate a tailored loss function from the optimality of the QP; (
iii) We present two types of QP parameters to encode the importance of skills; (iv) We bring a novel perspective on multitask control via the use of differentiable optimization. We showcase our approach in various experiments with simulated and real robots (§ V).Ii Related Work
Given a set of individual robotic skills, the challenge is to order and combine them to successfully execute complex manipulation tasks. Sequencing approaches presented in the literature are mainly based on learning from demonstrations (LfD) [19, 20, 27, 13]
or on reinforcement learning (RL)
[13, 30]. Manschitz et al. [19]learn both a sequence graph of skills from demonstrations, and a classifier to select the transitions. The authors extend their approach to handle concurrent skill activations
[20]. As opposed to our work, the transitions between skills are explicitly labeled for the demonstrations. Rozo et al. [27] introduce an objectcentered skill sequencing formulation, which builds a complete model of the task by cascading several skill models, and adapting their task parameters. In contrast to our approach, the desired skill sequence is assumed to be given. In [13], demonstrated trajectories are segmented into sequences of skills, where skill policies are represented by linear value function approximations. Sequences from several demonstrations are then combined into skill trees. Stulp et al. [30] extend the PI algorithm to optimize sequences of dynamical movement primitives (DMP) by simultaneously learning their shape and goal parameters. Overall, the aforementioned approaches are specifically tailored to a single skill type, e.g., dynamical systems [19, 20], taskparametrized Gaussian mixture model (TPGMM)
[27], or DMP [30]. Moreover, transitions are usually handled by matching the end and startpoints of subsequent skills, and are thus characterized by obvious pauses. In contrast, our approach is skillagnostic and learns sequences featuring seamless and natural transitions.Other works focus on designing smooth transitions between skills. For instance, several approaches were presented in [29] to blend DMPs, and probabilistic movement primitives (ProMP) can naturally be blended [25]. However, these methods require a known sequence of specific skills and a manual tuning of transition parameters. In [17], motions are generated from a hierarchy of motion primitives, which are activated based on a neurallike dynamics. Therefore, sequencing and blending is achieved by choosing suitable weights and connections. This approach was then combined with optimal control for continuous motion adaptation [22]. Although it generates seamless motions, its applicability is limited due to the necessity of defining the model by hand.
Sequencing and blending of tasks has also been explored in the context of robot multitask control. Salini et al. [28] combine different controllers in a QP formulation by defining a soft hierarchy of tasks. This corresponds to defining a sequence of skills with concurrent activations. Smooth transitions are achieved by smoothlyvarying the relative importance of skills (priorities) with manuallytuned weights. In [7], the skills priorities are instead optimized using covariance matrix adaptation evolution strategy (CMAES) in order to superpose several controllers for motion generation. Modugno et al. [21] extended this idea to learn timevarying skill priorities given as a weighted sum of basis functions equally spaced in time. The corresponding weights can then be optimized using blackbox optimization techniques such as CMAES [21] or Bayesian optimization (BO) [31, 16]. Our work distinguishes in that we directly learn the relative importance of skills along the task by differentiating through the optimization problem. In contrast to [21, 31, 16], we leverage LfD to learn sequences of previouslydefined skills with seamless transitions. Therefore, our approach requires only few initial demonstrations and no additional trials during the learning phase, thus improving on dataefficiency and training cost compared to blackbox optimization techniques.
Iii Background
Iiia Multitask control with quadratic programming
Quadratic programs (QP) [24, Chap. 16] are extensively used to formulate multitask control of humanoid robots as a constrained optimization problem. Indeed, QP can be solved very efficiently, while explicitly incorporating a wide variety of objectives and accounting for diverse constraints (see e.g., [5, 6]). A QP solves a problem of the form
(1) 
where is the optimization variable, , are the parameters of the quadratic cost function with denoting the manifold of positivesemidefinite (PSD) matrices, and , , , are the constraints parameters. For robot multitask control, QP are typically used to minimize the weighted sum of a set of tasks, i.e., , where and are the desired and current value of the task , respectively, and is a weight setting the relative importance of the task with respect to the other tasks. Moreover, the constraints typically include the equations of motion (kinematics, or dynamics), the technological limits of the system (e.g., joint limits), and interaction constraints (e.g., grasp or frictional contacts). In this paper, we use a QP to encode a sequence of skills, along which the weights scaling the importance of each skill vary, leading to smooth trajectories and transitions.
IiiB KarushKuhnTucker conditions
The KarushKuhnTucker (KKT) conditions [15] are first order necessary conditions for to be a local solution of a constrained optimization problem. In particular, the KKT conditions corresponding to the QP (1) are (i) with the Lagrangian function of the problem (1), and the Lagrangian multipliers corresponding to its equality and inequality constraints, respectively, (ii) , (iii) , (iv) , and (v) .
In addition to being used throughout the solving process of constrained optimization problems, the KKT conditions were exploited in inverse optimal control (IOC). In IOC, trajectories are viewed as the solution of an optimization problem, which aims at minimizing an unknown (parametric) cost. In this context, Englert et al. [8] used the fact that demonstrations of such trajectories — under the assumption that they are optimal — fulfill the KKT conditions, to determine the optimal parameters of the underlying cost.^{1}^{1}1Similar ideas have also been explored in the context of inverse reinforcement learning (IRL), where the parameters of a reward function were selected by minimizing the norm of the expert’s policy gradient [26]. We follow a similar reasoning and leverage the QP KKT conditions to define the loss of our sequencing approach.
IiiC Differentiable optimization layers
Recent works [2, 1] proposed to integrate optimization layers into neural architectures by differentiating through the corresponding optimization problems. In particular, Amos and Kolter [2] introduced Optnet, a neural architecture embedding QP as individual layers. Namely, Optnet defines the output of the current layer as the solution of a QP whose parameters depend on the previous layer , i.e.,
(2) 
In order to train Optnet using backpropagation, the layer (
2) must be differentiable, i.e., the derivatives of the solution of the QP with respect to its input parameters must be computed. This is achieved by differentiating the KKT conditions of the problem at a given solution (see [2]). In this paper, we leverage Optnet to learn the importance of individual skills throughout the task.Iv Learning to Sequence and Blend Skills


In this section, we present our approach to sequence and blend manipulation skills. In the following, we assume a set of previouslydefined individual robot skills (e.g., a skill library). The skills are considered as given blackbox solutions, implying that their representations are unknown and may differ across the skills. At each instant, each skill outputs a desired control value , depending on a current state , to be given to the robot in order to execute the skill. For example, dynamicalsystemsbased skills [10] provide a desired endeffector velocity depending on the current endeffector position, and timedependent skills such as [32] may output, e.g., a timevarying desired joint or endeffector position. The control values are specific to and may differ across skills. We then consider a manipulation task consisting of an unknown sequence of (some of) the aforementioned skills, possibly concurrently activated. We observe one or several optimal demonstrations of the task consisting of the observed control values, i.e., , where the phase variable encodes the task progress^{2}^{2}2In the remainder we drop dependencies on to simplify the notation.. In other words, and represent the beginning and the end of the task.
Iva Illustrative example: pickandplace with planar robots
For the sake of clarity of this section, the different concepts underlying our approach are introduced generally before being illustrated for a pickandplace task executed by planar robots with grippers. In this example, we observe a single manuallydesigned demonstration provided by a DoF teacher robot that picks an object, transports it, and places it at a given location (see Fig. 1). The demonstration steps were achieved with proportional controllers activated using the weights of Fig. 0(f). We then consider a set of four skills , where the / and / skills control the arm and gripper motion, respectively. Although we next disclose the skills types, remember that they are considered as given blackbox solutions in our approach. Indeed, each skill only provides a desired control value depending on the state at each task instant.
The arm skills are encoded as dynamical systems (DS) [10] trained with the control Lyapunov function scheme of [12]. The obtained DS, illustrated by Fig. 0(a), can then be adapted to new situations via translations and rotations. The desired control values of the DSbased skills correspond to the endeffector velocity and depend on the current endeffector position , such that and . The desired control values of the gripper skills correspond to the velocity of the gripper joints . The velocities and are zero when the gripper is completely opened or closed, and constant otherwise, i.e., and . In this example, the phase variable is defined as with the elapsed time, and the total duration of the task.
IvB Sequencing and blending of skills with QPs
Similarly to multitask control, we propose to encode sequences of skills as QPs. Namely, given the desired control values output by the individual skills and the current control values , a sequence of skills can be generated by solving the following optimization problem
(3) 
at each , where is a varying weight matrix setting the relative importance of the skills throughout the sequence in function of the phase variable encoding the task progress. The problem (3) is usually augmented with linear constraints related to the robotic system (see § IIIA). In our case, we also include equality constraints for control values of the same type, i.e., if the skills and have the same type of outputs (e.g., both return endeffector pose values). For instance, following (3), the optimization problem of our illustrative example is formulated as
The constraints come from the shared control values across skills, i.e., the endeffector velocity , and the gripper joints velocity for the and skills, respectively. These constraints can directly be integrated into the optimization problem, which is equivalently written as
(4) 
Note that (3) can be equivalently formulated as (1) with the optimization variable , and cost parameters , with . Importantly, the skill ordering in (3) is arbitrary. Indeed, the sequence is defined by the weight matrix, that is learned from demonstrations, as explained next. Skills can be added by extending with their control values and expanding accordingly.
Given one or several demonstrations of a manipulation task, we aim at learning the skill weight function , so that the reproduction , i.e., the sequence of skills obtained by solving (3) for , replicates the demonstrated task. This corresponds to minimizing a loss function measuring the quality of the reproduction. To do so, we need to solve a nested optimization: For each time instance of the task, we solve (3), and the whole set of solutions is then used to minimize the loss . To solve this problem, we leverage Optnet [2] to integrate the QP (3
) into a neural network. Optnet allows us (
i) to represent the QP parameters as functions, and (ii) to differentiate with respect to the QP parameters to solve the outer optimization of our nested problem using gradientbased approaches. In other words, Optnet backpropagates the loss to optimize both the phasedependent skills weights and the control outputs . Thus, we can learn the relative importance of the skills throughout the task execution via the matrix . Our proposed neural network takes the phase variable as input, and consists of (i) a fullyconnected layer coupled with a softmax activation function, whose outputs are the QP parameters
(see § IVD for details), and (ii) of an Optnet layer (2), where , and is the control command transmitted to the robot to execute the task. Our approach is illustrated by Fig. 2.It is important to emphasize that the proposed formulation not only learns sequences of skills, but also blends the transition between individual skills ”for free”. Indeed, the coupling of the fullyconnected layer with a softmax activation induces smooth nonbinary weight functions , therefore leading to smooth variations of the relative importance of the skills, i.e., to smooth transitions. This allows our neural architecture to learn and reproduce seamless transitions, as usually observed in human demonstrations. This also implies that skills are not necessarily executed in a strict sequence, but may be activated concurrently if required by the task.
The individual skills outputs may be defined either in task space (e.g., endeffector pose, or velocity), or in joint space (e.g., joint position, or velocity). In the former case, it may be desirable to directly solve the optimization (3) with respect to joint variables when executing the reproduction on the robot. To do so, the current control values can be expressed in function of the joint values by exploiting the kinematic or dynamic relationship between the task and jointspace variables. In our illustrative example, this corresponds to solving, during the reproduction,
(5) 
where the arm skills outputs are expressed as with and the arm and gripper joint velocities, respectively, and the manipulator Jacobian. Finally, note that nonlinear relationships must be linearized for the QP formulation.
IvC Definition of the loss function
In this section, we take inspiration from the IOC approach of [8] to define the loss function used to train the neural network previously introduced. Namely, we assume that the demonstrations are optimal, i.e., they are optimal solutions to the QP problem (3) and thus satisfy its KKT conditions. As the QP constraints are satisfied during optimal demonstrations, the KKT conditions (ii)(v) are automatically fulfilled. Therefore, determining the optimal parameters of our neural network can be understood as searching for the parameters fulfilling the first KKT condition for all the demonstrations. This corresponds to minimizing the loss
(6)  
where we sum over the demonstrations and the progress of the task via the phase variable . The Lagrangian of the problem (3) and its derivative for the th demonstration are
where ,
is the vector of demonstrated skills outputs,
and are the stacked constraints parameters, and is the vector of Lagrangian multipliers. Moreover, we can express in function of for each demonstration by minimizing the loss subject to the KKT complementary condition, i.e., . Therefore, by setting the optimization variable to the output of our network, the loss of each demonstration is ^{3}^{3}3Equivalently, for constant .The loss (6) inherently includes the task specifications via the demonstrations and the QP KKT conditions, and does not require additional taskspecific design. To avoid the singular solution , we leverage the softmax activation function, as explained next. Thus, at least one skill is given a high relative importance at each instant of the task.
IvD Skills weights as positivesemidefinite matrices
As mentioned previously, the QP parameters are determined by the first part of our neural network. Specifically, the cost parameters are , where the weight matrix is learned by the network. The constraints parameters relate to skills outputs and to the robot physical characteristics. To obtain valid QPs, or equivalently to prevent skills to have negative relative importance weights, the weight matrices must be PSD, i.e., . We here describe two approaches to learn PSD weight matrices.
Diagonal weight matrices
In this case, we define
(7) 
where each block weights the output of the th skill, and the scalars are obtained from the fullyconnected layer followed by a softmax activation function. The latter ensures that the scalar weights are positive and sum to , thus guaranteeing that is PSD, and that at least one skill is activated at any instant of the task. Notice that we defined the different blocks as proportional to identity matrices to avoid altering the outputs of individual skills.
Full weight matrices
Such matrices allow us to express correlations between different skills, i.e, between their control values , throughout the task. This naturally occurs in various tasks. For example, when approaching and grasping an object, the hand closure is correlated with the velocity at which the object is approached. We learn matrices
(8) 
where the offdiagonal blocks encode the correlations between the outputs of the skills and . To guarantee the positive semidefiniteness of the matrices , we propose to learn the diagonal and offdiagonal blocks separately. Firstly, the scalar terms are obtained as described in the previous paragraph. Secondly, the offdiagonal matrices are obtained by leveraging the properties of matrices with positive blockdiagonal elements [4], namely
(9) 
where is a contraction matrix, i.e., . Therefore, we use a second fullyconnected layer to learn the contraction matrices as , with a tanh and a sigmoid activation function applied to and , respectively, so that . The offdiagonal elements are then computed recursively using the righthand side of (9). For instance, in the case of a matrix composed of 3 skills, we first compute with and , and then with and . Note that, to facilitate the training of full weight matrices, we initialize the parameters of the scalar terms with a previouslytrained diagonal model.
V Experiments
In this section, we evaluate our approach with different robotic platforms and manipulation tasks. All computations were performed on a laptop with GHz CPU and GiB RAM. A video of the experiments accompanies the paper (https://youtu.be/00NXvTpLYU), and source codes are available at https://github.com/NoemieJaquier/sequencingblending/.
Va Illustrative example: pickandplace with planar robots
We first consider the pickandplace task introduced in § IVA and train our approach using diagonal and full weight matrices on the provided single manuallydesigned demonstration. In order to guarantee that one arm and one hand skill are activated at each instant of the task, we use one softmax activation function for each of the arm and gripper pairs of skills, namely and . The task is then reproduced by the 4DoF robot. As a baseline, we consider the case where the QP (4) with diagonal weights does not require additional constraints, so that its solution is , . In this case, as the QP solution is readily available, we do not need to solve a nested optimization to minimize a given loss. Instead, the loss (6) can be minimized independently for each value of with classical optimization methods. Finally, a DoF student robot is requested to reproduce the learned sequence of skills with different pick and place positions. To do so, the and DS skills are adapted to the new target points. For all reproductions, the QP is solved with respect to the arm and gripper joint velocities using (5).
Fig. 0(b) depicts the demonstrated trajectory, as well as the reproduction of the task by the DoF robot with a diagonal weight matrix. Our approach successfully sequences the available skills and reproduces the task by picking and placing the object at the required locations. The differences of trajectory between the demonstration and the reproduction are due to the fact that the DS arm skills naturally follow a different trajectory than the demonstration between the target points (remember that the demonstration was generated independently from the given skills). For the same reason, the learned weights slightly differ from the manuallydesigned ones used to generate the demonstration (Fig. 0(f)). The differences of trajectory are attenuated when using a full weight matrix (see Fig. 0(c)), where correlations between skills are exploited to better match the demonstration. Note that only the diagonal weights are represented in Fig. 0(f). As expected, the baseline looks similar to our approach with diagonal weight matrix (see Fig. 0(d)). Slight differences may be due to the different optimizations and to local minima in the loss. However, notice that the baseline applies only to very simple QPs, which are unrealistic for most applications (incl. for the experiments of § VB VC). Also, in contrast to our approach, the baseline does not learn the weight matrix as a parametric function of the phase variable. Fig. 0(e) depicts the reproduction of the learned sequence by the DoF robot using a diagonal weight matrix, showing that our approach successfully generalizes to different pick and place locations. As the full weight matrix naturally overfits a single demonstration, it is not well suited to generalize in this case.
VB Pouring task with a humanoid robot
Here, we apply our approach in a realworld scenario to learn a complex sequence of skills on the humanoid robot ARMAR6 [3]. The robot is positioned in front of a table, on which are placed an empty glass and a liter plastic bottle partially filled with orange juice. The scenario consists of a pouring task, where the robot grasps the bottle, pours juice into the glass, and places the bottle back on the table. The positions of the objects are assumed a priori known by the robot, but could equally be inferred by a perception system.
As for the previous experiment, a set of skills is provided as blackbox solutions. Specifically, four skills are defined for the arm, namely the bottle, , the bottle back, and the arm. Moreover, two jointvelocitybased skills are provided for the fivefingered hand, namely and in a power cylindrical grasp. The four arm skills are defined by DS with radial vector fields pointing toward a fixed point attractor. Their desired control values correspond to the endeffector linear and angular velocities and , which depend on the current endeffector position and orientation , i.e., . The fixed point attractors of the four arm skills are the robot hand grasp pose on the bottle for the skill, a tilted hand pose above the glass for the skill, the hand pose at the position of the bottle on the table for the skill, and the hand resting pose for the skill. The hand skills are defined similar to the gripper skills of the pickandplace example, and thus open and close all finger joints by controlling their velocity. We train our approach on seven manuallydesigned demonstrations for which an operator defined the arm and hand trajectories. The bottle and glass positions were varied of and cm along the and axes, respectively. As previously, we use two softmax activation functions for the arm and hand skills, and the phase variable is .
After the learning phase, the robot successfully reproduced the pouring task using both diagonal and full weight matrices (see Fig. 3 (topleft)). Moreover, our approach not only succeeded at learning the desired sequence of skills, but also resulted in seamless transitions as indicated by the absence of pauses and by the smoothness of the trajectories depicted in Fig. 3 (bottomleft). The learned weight matrices are represented in Fig. 3 (right) for the diagonal and full cases. Although the resulting trajectories look similar, the matrices still differ in the relative importance attributed to each skill. Notably, the model with full weight matrices exploits the correlation between the skills to shape the reproduced trajectory, thus featuring lower diagonal values than the diagonal model. Therefore, full weight matrices have better representation capabilities than their diagonal counterpart. However, this comes at the expense of generalization abilities. Indeed, as shown in Fig. 3, the diagonal model was able to generalize to bottle and glass locations that were outside the demonstrated range (here, the bottle and glass positions were swapped along the axis), which the full model could only achieve for locations close to the demonstrations. Finally, we compared our approach to a baseline obtained by manually sequencing the given skills without any learning or blending. As shown in Fig. 3 (bottomleft), the baseline trajectory is characterized by obvious jerky transitions. The resulting timing would cause the robot to overfill the glass, thus failing the reproduction. Importantly, our approach is wellsuited for learning and executing the sequence of skills on a real robot. Indeed, the pouring task training lasted a couple of minutes, and the testing time was  ms per timestamp, which allowed us to execute our approach at a control frequency of Hz.
VC Bimanual sweeping task learned from human data




We aim at evaluating our approach to sequence and blend skills based on human demonstrations, i.e., on data for which no ground truth is easily available. To do so, we consider a bimanual sweeping task from the KIT motion database [18, 14], in which a human transfers cucumber slices from a cutting board to a bowl. At the beginning of the demonstrations, a subject stands in front of a table. A cutting board on which cucumber slices are placed, is positioned along the edge of the table in front of the human. The human first grasps a plastic bowl with the left hand and a knife with the right hand using cylindrical power grasps. Then, s/he holds the bowl below the table next to the cutting board, and pushes the cucumber slices into the bowl with the knife. Finally, the human places both knife and bowl back.
For the bimanual sweeping task, we consider the motion of each arm separately. Moreover, we use demonstrations of the aforementioned sweeping task performed by two different subjects. First, three naturallyvarying demonstrations of the first subject are used to obtain a skill library. Here, we consider a set of four lowlevel skills per arm, namely and for the left and right arm, respectively. Each human demonstration is manually segmented into four parts corresponding to the , /, , and skills. In this experiment, we use viapoints movement primitives (VMP) [32], which offer powerful skill representations that are easily adaptable to new starts, goals and viapoints after training. Therefore, each skill is then represented by a timedependent VMP trained on the corresponding segments of the demonstrations. The desired control values are the endeffector position and unitquaternionbased orientation given by the mean trajectory retrieved by the VMPs. The desired control values depend on the time , i.e., . All VMPs are executed with the start and goal poses defined by the desired task. The timing of the VMP skills is defined by the duration of the entire task . Within our model, every skill trajectory is then evaluated at the evolving time based on the overall phase variable . The resulting skills are illustrated by Fig. 3(a). As for the previous experiments, these skills are considered as blackbox solutions, meaning that their representation is not directly known by our model. We then use three demonstrations provided by a second different subject to train two models of our approach (left and right arm separately) with diagonal and full weight matrices. Note that these demonstrations include variations, as humans motions naturally vary across executions of the same task.
A simulated kinematic human model, as well as models of the bowl, knife and table, are used for the reproduction phase. In this case, the model with diagonal weight matrices could not reproduce the task as it was not able to closely fit the demonstrations (see Fig. 3(b) 3(e)). This is due to the significant differences between the lowlevel skill trajectories (trained on the first subject) and the demonstrations (provided by the second subject). Notice that such differences also appeared in the pickandplace experiment. However, as opposed to the sweeping task, the arm trajectories between the pick and place locations did not influence the task success, allowing both diagonal and full weight matrices to be used. For the bimanual sweeping task, only full weight matrices lead to a successful reproduction by learning correlations between skills. Notice that, although two separated models were trained for the left and right arms, the learned full weight functions conserved the timing of the motions, allowing both arms to be synchronized during the reproduction. Also, the training and testing times were similar to the pouring task.
Vi Conclusion
We proposed a skillagnostic formulation to learn to sequence and blend skills using QPbased differentiable optimization layers. This allows us to represent the relative importance of skills as a function of the task progress and to optimize it for a given loss with gradientbased approaches. Our experiments showed that, provided a set of blackbox skills and one or few demonstrations of a task, our approach not only learns unknown sequences composed of various types of skills, but also generates smooth motions with seamless, blended transitions. Overall, our diagonal model is advantageous for generalization, while full weight matrices are beneficial when demonstrations must be closely followed.
It is worth noticing that the considered pouring and sweeping tasks are generally difficult to learn with a single model. Instead, our approach decomposes a task by combining several skills, which are easy to train and potentially reusable across tasks. Moreover, it requires only one or few demonstrations of the complete task, making it less cumbersome to train than trialanderrorbased models. This is a major advantage compared to blackbox optimization techniques used in multitask control, although detailed performance comparisons are deferred to future work. Finally, in contrast to endtoend methods, our formulation is modular, fast to train, and interpretable as the relative importance of skills is directly embedded in the weight matrices.
Importantly, the performance of our approach highly depends on the capabilities of the given individual skills. Namely, a given task can be reproduced only if the provided skill library contains a set of skills that can be sequenced and combined to do so. Also, our approach generalizes to new object locations under the condition that the corresponding skills successfully adapt to these locations. The dependency of the model parameters to a timedriven phase variable also limits the generalization. This can be overcome by defining the phase variable as a timeindependent, perceptionbased measure of task progress, which we will explore in the future.
One drawback of our approach is that the dimensionality of the optimization variable increases rapidly with the number of different types of skills, i.e., which provide different control variables. To be applied to cases featuring a complex library with many different types of skills, we will extend our approach to handle hierarchies of skills. For instance, highlevel skills, e.g., cucumber slices to a bowl, may first be learned with our approach as sequences of lowlevel skills, and then combined in a complex task, e.g., a salad, with an additional QPbased formulation. We will then evaluate our approach in more complex scenarios including, e.g., hierarchies, and soft prioritization of skills.
References
 [1] (2019) Differentiable convex optimization layers. In NeurIPS, pp. . Cited by: §I, §IIIC.
 [2] (2017) OptNet: differentiable optimization as a layer in neural networks. In ICML, pp. . Cited by: §I, §IIIC, §IVB.
 [3] (2019) ARMAR6: a highperformance humanoid for humanrobot collaboration in real world scenarios. IEEE RAM 26 (4), pp. 108–121. Cited by: §VB.
 [4] (2007) Positive definite matrices. Princeton University Press. Cited by: §IVD.
 [5] (2016) On weightprioritized multitask control of humanoid robots. IEEE Trans. Autom. Control 63 (6), pp. 1542–1557. Cited by: §IIIA.
 [6] (2008) Robust balance optimization control of humanoid robots with multiple non coplanar grasps and frictional contacts. In IEEE ICRA, pp. 3187–3193. Cited by: §IIIA.
 [7] (2015) Multiple task optimization with a mixture of controllers for motion generation. In IEEE/RSJ IROS, pp. 6416–6421. Cited by: §II.
 [8] (2017) Inverse KKT: learning cost functions of manipulation tasks from demonstrations. IJRR 36 (1314), pp. 1474–1488. Cited by: §IIIB, §IVC.
 [9] (2005) Motor primitives in vertebrates and invertebrates. Curr. Opin. Neurobiol. 15 (6), pp. 660–666. Cited by: §I.
 [10] (2011) Learning nonlinear multivariate dynamics of motion in robotic manipulators. IJRR 30 (1), pp. 80–117. Cited by: §IVA, §IV.
 [11] (2009) Coding and use of tactile signals from the fingertips in object manipulation tasks. Nat. Rev. Neurosci. 10 (5), pp. 345–359. Cited by: §I.
 [12] (2014) Learning control Lyapunov function to ensure stability of dynamical systembased robot reaching motions. Rob. Auton. Syst. 62 (6), pp. 752–765. Cited by: §IVA.
 [13] (2012) Robot learning from demonstration by constructing skill trees. IJRR 31 (3), pp. 360–375. Cited by: §II.
 [14] (20202021) The KIT bimanual manipulation dataset. In IEEE/RAS Humanoids, pp. . Cited by: §VC.

[15]
(1951)
Nonlinear programming.
In
Berkeley Symp. on Mathematical Statistics and Probability
, pp. 481–492. Cited by: §IIIB.  [16] (2020) Sampleefficient learning of soft priorities for safe control with constrained Bayesian optimization. In IEEE IRC, pp. 406–407. Cited by: §II.
 [17] (2012) Adaptive movement sequences and predictive decisions based on hierarchical dynamical systems. In IEEE/RSJ IROS, pp. 2082–2088. Cited by: §II.
 [18] (2016) Unifying representations and largescale wholebody motion databases for studying human motion. IEEE TRO 32 (4), pp. 796–809. Cited by: §VC.
 [19] (2015) Learning movement primitive attractor goals and sequential skills from kinesthetic demonstrations. Rob. Auton. Syst. 74, pp. 97–107. Cited by: §II.
 [20] (2015) Probabilistic progress prediction and sequencing of concurrent movement primitives. In IEEE/RSJ IROS, pp. 449–455. Cited by: §II.
 [21] (2016) Learning soft task priorities for safe control of humanoid robots with constrained stochastic optimization. In IEEE/RAS Humanoids, pp. 101–108. Cited by: §II.
 [22] (2014) Receding horizon optimization of robot motions generated by hierarchical movement primitives. In IEEE/RSJ IROS, pp. 129–135. Cited by: §II.
 [23] (2000) Motor learning through the combination of primitives. Philos. Trans. R. Soc. Lond., B, Biol. Sci. 355 (), pp. 1755–1769. Cited by: §I.
 [24] (2006) Numerical optimization. Second edition, Springer. Cited by: §I, §IIIA.
 [25] (2018) Using probabilistic movement primitives in robotics. Auton. Robot. 42 (3), pp. 529–551. Cited by: §II.
 [26] (2016) Inverse reinforcement learning through policy gradient minimization. In AAAI, pp. 1993–1999. Cited by: footnote 1.
 [27] (2020) Learning and sequencing of objectcentric manipulation skills for industrial tasks. In IEEE/RSJ IROS, pp. 9072–9079. Cited by: §II.
 [28] (2011) Synthesis of complex humanoid wholebody behavior: a focus on sequencing and tasks transitions. In IEEE ICRA, pp. 1283–1290. Cited by: §II.
 [29] (2019) Merging position and orientation motion primitives. In IEEE ICRA, pp. 7041–7047. Cited by: §II.
 [30] (2012) Reinforcement Learning with Sequences of Motion Primitives for Robust Manipulation. IEEE TRO 28 (6), pp. 1360–1370. Cited by: §II.
 [31] (2018) Sampleefficient learning of soft task priorities through Bayesian optimization. In IEEE/RAS Humanoids, pp. 1–6. Cited by: §II.
 [32] (2019) Learning viapoint movement primitives with inter and extrapolation capabilities. In IEEE/RSJ IROS, pp. 4301–4308. Cited by: §IV, §VC.