Auto-conditioned Recurrent Mixture Density Networks for Complex Trajectory Generation

09/29/2018 ∙ by Hejia Zhang, et al. ∙ University of Southern California 0

Recent advancements in machine learning research have given rise to recurrent neural networks that are able to synthesize high-dimensional motion sequences over long time horizons. By leveraging these sequence learning techniques, we introduce a state transition model (STM) that is able to learn a variety of complex motion sequences in joint position space. Given few demonstrations from a motion planner, we show in real robot experiments that the learned STM can quickly generalize to unseen tasks. Our approach enables the robot to accomplish complex behaviors from high-level instructions that would require laborious hand-engineered sequencing of trajectories with traditional motion planners. A video of our experiments is available at



There are no comments yet.


page 1

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Defining motion plans for robotic manipulators is a challenging task when the behavior specification cannot be simply expressed as a sequence of waypoints the end-effector has to follow while opening and closing. Often, laborious hand-engineering is required to compute such task space control inputs to the motion planner in order to generate the trajectory in joint position space. On the other hand, Learning from Demonstration (LfD) enables machine learning models to be trained from expert behavior, without a formal program that encodes the motion plan.

In this work, we investigate how a state-transition model can be learned from a few demonstrations to generate complex motion plans with high-level task inputs. As we show in our experiments, our model is able to synthesize circular trajectories with varying radii, which it was able to generalize to from a sparse set of demonstration trajectories.

Tapping into the potential of deep learning models for motion planning has been reported to lead to two orders of magnitude in speed improvements over conventional planning algorithms 

[1], such as optimal Rapidly-exploring Random Trees (RRT[2] and Batch Informed Trees (BIT[3].

In this work, we propose a deep learning architecture and training methodology that can efficiently learn complex motion plans for a seven degrees-of-freedom robot arm in joint position space. Based on a few demonstrations, our approach can efficiently learn state transitions for various trajectories while generalizing to new tasks.

Figure 1: Real-robot execution of a block stacking task on the Sawyer robot using the learned state transition model synthesizing complex trajectories in joint position space.

Our contributions are as follows:

  1. We present a training procedure and stochastic recurrent neural network architecture that can efficiently learn complex motions from demonstrations in joint position space.

  2. In combination with a learned inverse dynamics model, we show real-robot results on an end-to-end learnable open-loop control pipeline.

  3. We provide extensive real-robot experiments that demonstrate the ability of our STM to generalize to tasks that it has not been trained on.

  4. The generalizability allows our model to accomplish complex behaviors from high-level instructions which would traditionally require laborious hand-engineering and sequencing of trajectories from motion planners.

The paper is organized as follows: We first review related work in Sec. II and then describe our STM in Sec. III. Finally, we show our experiments and results in Sec. IV and Sec. V, respectively.

Figure 2: Visualization of our proposed auto-conditioned recurrent mixture density network to model the state transitions, unrolled over 6 time steps, with auto-conditioning length of 2 and ground truth length of 2.

Ii Related Work

Learning from Demonstration (LfD), also referred to as imitation learning, has been widely studied in the robotics research community.

Behavioral cloning

approaches use supervised learning to train a model to imitate state-action sequences from an expert and have lead to early successes in robot learning 


Given demonstrations from an expert policy which follows an unknown reward function , where

denotes the set of states, inverse reinforcement learning (IRL) and apprenticeship learning approaches attempt to recover the expert’s reward function such that a separate policy can be trained in a different context via reinforcement learning given that reward function.

Inspired by state-of-the-art deep learning techniques for computer vision, such as generative adversarial networks,

generative adversarial imitation learning (GAIL) [5, 6, 7]

approaches learn a policy via reinforcement learning that aims to confound a separate discriminator network which classifies whether the policy’s roll-out stemmed from the policy or from the expert.

In this paper we study trajectory generation from a supervised learning perspective where we are given a set of expert trajectories that are represented by sequences of states. Borrowing architectures and training methodologies from state-of-the-art sequence learning techniques, our work addresses a fundamental issue in behavioral cloning which is the compounding error between the expert and the generated behavior over the course of the trajectory.

Long short-term memory networks (LSTM) [8] are widely used in time series prediction, especially in speech synthesis and speech recognition. Wang et al[9] use auto-regressive recurrent mixture density networks for parametric speech synthesis. Graves et al. use LSTMs to recognize speech [10], generate text and synthesize hand-writing [11].

Synthesizing complex motions has been a long-lasting interest to the computer graphics community [12, 13]. Peng et al[14] apply reinforcement learning to synthesize a motion sequence in a physics-based environment. Li et al[15] use recurrent neural networks (RNN) to learn sequences of multimodal 3D human motions. Sun et al[16] apply RNNs to predict a 3-DOF pedestrian trajectory using long-term data from an autonomous mobile robot deployment.

In our work, we are leveraging recent advancements in recurrent network training to learn sequences of robot states. Auto-conditioned Recurrent Neural Networks [13] are used to synthesize complex trajectories over large time spans.

Several approaches have been proposed to improve the training of RNNs, e.g. Professor Forcing [17], Data as Demonstrator (DaD[18], auto-conditioning [13] and Dataset Aggregation (DAgger[19]. At scheduled intervals in the training procedure, these methods feed the RNN’s previous outputs back into the RNN as input to the following cells to improve the prediction performance. Such methodology makes the RNN more robust to deviations from expert states while the RNN is unrolled over longer time spans without training inputs. Such deviations would otherwise cause the error to accumulate over time.

While recurrent neural networks have been shown to learn and predict time series data over thousands of time steps [13], a roadblock toward its application in a robotics context is the lack of representing uncertainty in the state representation. Besides the stochasticity of the real world, the trajectory generation model also needs to account for multiple possible solutions to find trajectories. A commonly used machine learning model to capture multimodal distributions is the Mixture Density Network (MDN) [20]

which represents multivariate Gaussian Mixture Models (GMM).

Combining an RNN with MDN has been first shown by Schuster [21] where the model is used to learn sequential data while capturing its stochasticity.

Similar to Rahmatizadeh et al[22], we combine an LSTM with an MDN to architecture to architect the state transition model, but perform the trajectory synthesis in the higher-dimensional joint position space, in contrast to Cartesian space. Thanks to auto-conditioning, our method can generate trajectories from perfect demonstrations since in our training procedure the STM automatically learns to correct from states deviating from the demonstrations, whereas the method presented in [22]

uses explicit demonstrations that recover from undesired states back to the desired motion. Furthermore, we present results on training a separate, inverse dynamics model that serves as a torque controller which estimates the required actuator control commands to steer between the joint positions synthesized by the STM.

Iii Our Approach

The STM is trained via supervised learning on demonstrations from a motion planner and predicts the sequence of states given the start state and the desired goal state . We model the state transition model via an LSTM combined with a mixture density network

(MDN) to capture the probability distribution of future states (see Fig. 


The MDN models a multivariate mixture of Gaussians by estimating the distribution over next states as a linear combination of Gaussian kernels:

Where is the number of Gaussians modelled by the MDN, is the learned mixing coefficient and is the -th Gaussian kernel of the form

The kernel mean

and its standard deviation

are learned by the model.

Given the ground-truth state pair , the MDN loss is defined as the negative log-likelihood:

We update the MDN’s weights to minimize the loss via the Adam optimizer [23].

As in [22], we combine an LSTM with an MDN to capture the multi-modal nature of trajectory generation since, in many cases, there are multiple possible solutions connecting start and goal states. We combine the recurrent MDN with auto-conditioning [15], a training schedule that, every iterations for a sequence of time steps, feeds the LSTM’s output back into the cell computing the next state (see Fig. 2). This enables the network to correct itself from states that deviate from demonstrations: by learning from inputs where the network diverges from expert behavior, we capture the distribution of inputs that would cause a compounding error when rolling out the STM in the real world where the demonstrations as inputs are not available anymore. This technique greatly improves the performance, as we report in Sec. V.

Figure 3: Gripper position trajectories for drawing circles of varying radii which the STM has not been explicitly trained on before. Units on the axes are in meters.

Iv Experiments

In our experiments we focus on real-robot applications of our proposed STM architecture and training procedure. We rely on simulators to train the state transition model for the Sawyer robot, a seven degrees-of-freedom robot arm, equipped with a parallel gripper as end-effector.

We collected demonstration trajectories, i.e. sequences of states , in the Gazebo [24] simulator by using the inverse kinematics solver provided by Rethink Robotics for the Sawyer robot. Both the start and goal configuration for each demonstration trajectory are perturbed by uniform noise to capture a larger state space that improves the generalizability of our method.

In the following experiments on the Sawyer robot, we model the state space as follows: the thirteen-dimensional state at time step is represented by the seven joint angles , plus the current relative gripper position to the goal and the time-independent goal position in Cartesian coordinates:

In our definition, the state does not rely on the environment dynamics. This assumption is crucial for the STM to be transferable between different simulation environments and the real world.

Iv-a Sawyer Reacher

Figure 4: Simulation environment in Gazebo for the Sawyer reaching task. The objective for the STM is to synthesize a sequence of joint positions that steer the gripper from a random start configuration to a given goal location (white sphere).

In the first experiment, we evaluate the STM on a basic servoing task: the STM is tasked to synthesize state sequences that move the gripper from a random initial joint configuration to a random sampled goal position, given in task space. In simulation, we collect 45 demonstration trajectories ranging from 50 to 70 states.

Iv-B Sawyer Pick-and-Place

Figure 5: Gazebo environment for the pick-and-place world. Sawyer is tasked to grasp a block from a random location and place it to a designated goal location in Cartesian space.

In the second experiment, we evaluate the STM on pick-and-place task: the STM is tasked to synthesize state sequences that control the gripper from and random initial joint configuration to a random sampled goal position. In simulation, we collect 150 demonstration trajectories ranging from 166-170 states.

Figure 6: 3D gripper positions from 45 demonstration trajectories on the Sawyer reaching environment. Every demonstration starts in the same initial joint configuration and ends in a random goal position (dots).

Iv-C Sawyer Block Stacking

The block stacking task presents a more challenging environment where accuracy in placing blocks is key. We ask the robot to place two blocks on top of each other at a designated position on the table. We collect 150 demonstrations in the Gazebo simulator of Sawyer picking up blocks from random positions and placing it at random goal locations.

We see block stacking as a more complex version of pick-and-place, where the STM needs to learn to place blocks at different heights. For block stacking, we use the same network architecture and training process as pick-and-place, while training it from demonstrations under different target settings, i.e. random 3D positions.

Iv-D High-level Control

In the next experiment, we evaluate how well our model can be used to generate trajectories given high-level task descriptions. We ask the robot to draw a circle of a defined radius and train an STM from a set of 10 circular motion sequences as demonstrations, ranging from circles of radii between and .

Defining such behavior for a traditional motion planning setup would require defining the waypoints on the circle such that the inverse kinematics (IK) solver can find the joint angle transitions to have the gripper servo between them. Instead, a deep learning model could learn from demonstrations the connection between high-level goals (i.e. the given radius) and the desired behavior (i.e. circle-drawing trajectories).

Iv-E Open-loop Control with Inverse Dynamics Model

Figure 7: Graphical model of our open-loop control approach combining learned models for state transitions (STM) and inverse dynamics (IDM). Action is computed by the IDM given a history of the last steps  from the environment, the desired state  from the STM, and the previous actions .

We trained an inverse dynamics model (IDM) to accomplish torque control on the real robot. Combining an STM and IDM has the intriguing advantage of transferring behaviors from simulation to reality: the STM, serving as joint position motion planner, remains unchanged between both environments. The IDM, on the other hand, can be trained separately on each environment and robot configuration as it is the only module that depends on the environment dynamics. Such decoupling of both models has the potential for a higher sample efficiency compared to the simulation-to-real transfer of entire policy networks, as commonly done in traditional deep reinforcement learning approaches that train entirely in simulation [25, 26, 27].

The IDM is a three-layer MDN (each layer having 256 hidden units) that parameterizes a Gaussian mixture model consisting of fifteen normal distributions per action dimension (seven dimensions for joint actuators). Through our experiments, we found an IDM conditioned on the current state, plus the two previous states and actions (compare Fig. 

7 for ), to achieve the highest accuracy in steering between and via torque control.

Iv-E1 Sawyer Reacher (Open-loop Control with STM and IDM)

Figure 8: Tracking performance of the IDM to generate joint torque commands that follow the state sequence synthesized by the STM on the Sawyer reaching task. The IDM produces actions at a higher frequency (10 times as fast) than the STM. Shown are the Cartesian coordinates of the gripper where the joint angles from the STM are played back through forward kinematics and the IDM is deployed in Gazebo with the ODE physics engine.
Figure 9: Gripper position trajectories (lines) from forward kinematics rollouts on state sequences generated by the STM. The robot is tasked to servo its gripper to four goal positions (dots).

Our first Sawyer environment is a simple reaching testbed where the robot is tasked to servo the gripper to one of four desired goal locations. We train the IDM in Gazebo under the Open Dynamics Engine [28] physics simulation. We test in ODE and under the Bullet [29] physics engine.

Figure 10: Plot of gripper position trajectories in the ODE (left) and Bullet (right) physics simulation for the three pick-and-place tasks from open-loop control with a learned STM and IDM in combination. Dots represent the last states of the rollouts. The IDM, generating torque controls to follow the STM’s joint positions, has been trained solely on the ODE physics simulation and exhibits poor tracking performance under different simulation conditions.

Iv-E2 Sawyer Pick-and-Place (Open-loop Control with STM and IDM)

The goal is to place a block at three designated goal positions. The robot always starts from the same joint configuration.

In Fig. 10, we visualize the position of the end-effector while the IDM is tracking the trajectory synthesized by the STM in the ODE physics simulation.

V Results

In Table I, we compare the performance of our auto-conditioned recurrent MDN against several other STM architectures as baselines: a plain LSTM, an auto-conditioned LSTM and a recurrent MDN. Performance is measured as success rates for the experiments described in Sec. IV over 20 rollouts. Our method outperforms other baseline models on all of the tasks, especially on pick-and-place and block stacking, where more complex trajectories need to be synthesized.

For reacher, our proposed STM trained over training iterations in ca. on an Nvidia GeForce GTX 1070 graphics card. For comparison, we use a three-layer LSTM with 64 hidden units per layer for all of the models, and three Gaussians in the MDN-based STMs, i.e. vanilla recurrent MDN and auto-conditioned recurrent MDN. We observed that even on such simple task, the MDN yielded more robust behavior where the STM generated reasonable trajectories for targets that were outside the viable range of Sawyer. STM without such stochastic model failed to find any trajectories that reached close to such goals.

For pick-and-place, our proposed STM trained over training iterations in ca. two hours on the Nvidia GeForce GTX 1070. We use the three-layer LSTM with 128 hidden units per layer for all of the models and 20 Gaussians on the MDN-based models. We observed that auto-conditioning significantly reduces the accumulation of error, which is a common problem in generating trajectories using RNNs.

To investigate the benefits of a deep learning model as high-level controller, we investigate the ability of inferring associating high-level commands with demonstration trajectories, as described in Sec. IV-D. As shown in Fig. 3, our STM is able to learn from a few demonstrations the connection between the radius and the resulting trajectory, yielding circular motions as accurate as in radius.

Reacher Pick-and-place Stacking
LSTM 80% 0% 0%
a.c. LSTM 90% 50% 25%
Recurrent MDN 90% 60% 30%
a.c. Recurrent MDN 100% 100% 80%
Table I: Success rates for the experiments described in Sec. IV over 20 roll-outs with varying architectures and training procedures. All models have been trained on demonstrations over training iterations. The reacher task is successful if t he gripper is within of the goal position by the end of the trajectory. The STM’s for reacher are evaluated in simulation, pick-and-place and stacking success rates come from real-robot experiments.

Vi Conclusion

In this work, we present a recurrent neural network architecture and training procedure that enables the efficient generation of complex joint position trajectories. Our experiments have shown that our STM can generalize to unseen tasks and is able to learn the underlying task specification which enables it to follow high-level instructions. In combination with a learned inverse dynamics model, we have shown a fully trainable motion planning pipeline on a real robot that combines the state transition model, as planning module, with an IDM, as position controller, to generate joint torque commands that tracks the synthesized trajectories.

Future work is directed towards extending our work with a deeper connection to the inverse dynamics model. We plan to close the loop between the STM and IDM such that the STM can be re-evaluated after the trajectory has been executed for one or more time steps. Such approach resembles model-predictive control and would allow the STM and IDM to be reactive to changes in the environment, such as dynamic obstacles, that would require re-planning.


  • [1] A. H. Qureshi, M. J. Bency, and M. C. Yip, “Motion Planning Networks,” ArXiv e-prints, June 2018.
  • [2] S. Karaman and E. Frazzoli, “Sampling-based algorithms for optimal motion planning,” CoRR, vol. abs/1105.1186, 2011. [Online]. Available:
  • [3]

    J. D. Gammell, S. S. Srinivasa, and T. D. Barfoot, “Batch informed trees (bit*): Sampling-based optimal planning via the heuristically guided search of implicit random geometric graphs,” in

    2015 IEEE International Conference on Robotics and Automation (ICRA), May 2015, pp. 3067–3074.
  • [4] D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural network,” in Advances in Neural Information Processing Systems 1, D. S. Touretzky, Ed.    Morgan-Kaufmann, 1989, pp. 305–313.
  • [5] J. Ho and S. Ermon, “Generative Adversarial Imitation Learning,” in NIPS, 2016, pp. 4565–4573.
  • [6] K. Hausman, Y. Chebotar, S. Schaal, G. Sukhatme, and J. J. Lim, “Multi-modal imitation learning from unstructured demonstrations using generative adversarial nets,” in Advances in Neural Information Processing Systems, 2017, pp. 1235–1245.
  • [7] Y. Li, J. Song, and S. Ermon, “Infogail: Interpretable imitation learning from visual demonstrations,” in Advances in Neural Information Processing Systems, 2017, pp. 3812–3822.
  • [8] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov 1997.
  • [9] X. Wang, S. Takaki, and J. Yamagishi, “An autoregressive recurrent mixture density network for parametric speech synthesis,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on.    IEEE, 2017, pp. 4895–4899.
  • [10] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Acoustics, speech and signal processing (icassp), 2013 ieee international conference on.    IEEE, 2013, pp. 6645–6649.
  • [11] A. Graves, “Generating sequences with recurrent neural networks,” CoRR, vol. abs/1308.0850, 2013.
  • [12] X. B. Peng, G. Berseth, K. Yin, and M. van de Panne, “Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning,” ACM Transactions on Graphics (Proc. SIGGRAPH 2017), vol. 36, no. 4, 2017.
  • [13] Y. Zhou, Z. Li, S. Xiao, C. He, Z. Huang, and H. Li, “Auto-conditioned recurrent networks for extended complex human motion synthesis,” in International Conference on Learning Representations, 2018.
  • [14] X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne, “Deepmimic: Example-guided deep reinforcement learning of physics-based character skills,” arXiv preprint arXiv:1804.02717, 2018.
  • [15] K. Li, X. Zhao, J. Bian, and M. Tan, “Sequential learning for multimodal 3d human activity recognition with long-short term memory,” in Mechatronics and Automation (ICMA), 2017 IEEE International Conference on.    IEEE, 2017, pp. 1556–1561.
  • [16] L. Sun, Z. Yan, S. Molina Mellado, M. Hanheide, T. Duckett, et al., “3dof pedestrian trajectory prediction learned from long-term autonomous mobile robot deployment data,” in International Conference on Robotics and Automation (ICRA) 2018.    IEEE, 2018.
  • [17] A. M. Lamb, A. G. A. P. GOYAL, Y. Zhang, S. Zhang, A. C. Courville, and Y. Bengio, “Professor forcing: A new algorithm for training recurrent networks,” in Advances In Neural Information Processing Systems, 2016, pp. 4601–4609.
  • [18] A. Venkatraman, M. Hebert, and J. A. Bagnell, “Improving multi-step prediction of learned time series models.” 2015.
  • [19] S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in

    Proceedings of the fourteenth international conference on artificial intelligence and statistics (AISTATS)

    , 2011, pp. 627–635.
  • [20] C. M. Bishop, “Mixture density networks,” Aston University, Tech. Rep., 1994.
  • [21] M. Schuster, “Better generative models for sequential data problems: Bidirectional recurrent mixture density networks,” in Advances in Neural Information Processing Systems 12, S. A. Solla, T. K. Leen, and K. Müller, Eds.    MIT Press, 2000, pp. 589–595.
  • [22] R. Rahmatizadeh, P. Abolghasemi, and L. Bölöni, “Learning manipulation trajectories using recurrent neural networks,” CoRR, vol. abs/1603.03833, 2016.
  • [23] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [24] N. Koenig and A. Howard, “Design and use paradigms for Gazebo, an open-source multi-robot simulator,” in International Conference on Intelligent Robots and Systems (IROS), vol. 3.    IEEE/RSJ, 2014, pp. 2149–2154.
  • [25] S. James, A. J. Davison, and E. Johns, “Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task,” ser. Proceedings of Machine Learning Research, vol. 78.    PMLR, 2017.
  • [26] F. Sadeghi and S. Levine, “CADRL: Real single-image flight without a single real image,” Robotics: Science and Systems Conference (R:SS), 2017.
  • [27] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke, “Sim-to-real: Learning agile locomotion for quadruped robots,” CoRR, vol. abs/1804.10332, 2018.
  • [28] E. Drumwright, J. Hsu, N. Koenig, and D. Shell, “Extending open dynamics engine for robotics simulation,” in Simulation, Modeling, and Programming for Autonomous Robots, N. Ando, S. Balakirsky, T. Hemker, M. Reggiani, and O. von Stryk, Eds.    Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 38–50.
  • [29] E. Coumans, “Bullet physics simulation,” in ACM SIGGRAPH 2015 Courses, ser. SIGGRAPH ’15.    New York, NY, USA: ACM, 2015.