Defining motion plans for robotic manipulators is a challenging task when the behavior specification cannot be simply expressed as a sequence of waypoints the end-effector has to follow while opening and closing. Often, laborious hand-engineering is required to compute such task space control inputs to the motion planner in order to generate the trajectory in joint position space. On the other hand, Learning from Demonstration (LfD) enables machine learning models to be trained from expert behavior, without a formal program that encodes the motion plan.
In this work, we investigate how a state-transition model can be learned from a few demonstrations to generate complex motion plans with high-level task inputs. As we show in our experiments, our model is able to synthesize circular trajectories with varying radii, which it was able to generalize to from a sparse set of demonstration trajectories.
Tapping into the potential of deep learning models for motion planning has been reported to lead to two orders of magnitude in speed improvements over conventional planning algorithms, such as optimal Rapidly-exploring Random Trees (RRT)  and Batch Informed Trees (BIT) .
In this work, we propose a deep learning architecture and training methodology that can efficiently learn complex motion plans for a seven degrees-of-freedom robot arm in joint position space. Based on a few demonstrations, our approach can efficiently learn state transitions for various trajectories while generalizing to new tasks.
Our contributions are as follows:
We present a training procedure and stochastic recurrent neural network architecture that can efficiently learn complex motions from demonstrations in joint position space.
In combination with a learned inverse dynamics model, we show real-robot results on an end-to-end learnable open-loop control pipeline.
We provide extensive real-robot experiments that demonstrate the ability of our STM to generalize to tasks that it has not been trained on.
The generalizability allows our model to accomplish complex behaviors from high-level instructions which would traditionally require laborious hand-engineering and sequencing of trajectories from motion planners.
Ii Related Work
Learning from Demonstration (LfD), also referred to as imitation learning, has been widely studied in the robotics research community.Behavioral cloning
approaches use supervised learning to train a model to imitate state-action sequences from an expert and have lead to early successes in robot learning.
Given demonstrations from an expert policy which follows an unknown reward function , where
denotes the set of states, inverse reinforcement learning (IRL) and apprenticeship learning approaches attempt to recover the expert’s reward function such that a separate policy can be trained in a different context via reinforcement learning given that reward function.
Inspired by state-of-the-art deep learning techniques for computer vision, such as generative adversarial networks,generative adversarial imitation learning (GAIL) [5, 6, 7]
approaches learn a policy via reinforcement learning that aims to confound a separate discriminator network which classifies whether the policy’s roll-out stemmed from the policy or from the expert.
In this paper we study trajectory generation from a supervised learning perspective where we are given a set of expert trajectories that are represented by sequences of states. Borrowing architectures and training methodologies from state-of-the-art sequence learning techniques, our work addresses a fundamental issue in behavioral cloning which is the compounding error between the expert and the generated behavior over the course of the trajectory.
Long short-term memory networks (LSTM)  are widely used in time series prediction, especially in speech synthesis and speech recognition. Wang et al.  use auto-regressive recurrent mixture density networks for parametric speech synthesis. Graves et al. use LSTMs to recognize speech , generate text and synthesize hand-writing .
Synthesizing complex motions has been a long-lasting interest to the computer graphics community [12, 13]. Peng et al.  apply reinforcement learning to synthesize a motion sequence in a physics-based environment. Li et al.  use recurrent neural networks (RNN) to learn sequences of multimodal 3D human motions. Sun et al.  apply RNNs to predict a 3-DOF pedestrian trajectory using long-term data from an autonomous mobile robot deployment.
In our work, we are leveraging recent advancements in recurrent network training to learn sequences of robot states. Auto-conditioned Recurrent Neural Networks  are used to synthesize complex trajectories over large time spans.
Several approaches have been proposed to improve the training of RNNs, e.g. Professor Forcing , Data as Demonstrator (DaD) , auto-conditioning  and Dataset Aggregation (DAgger) . At scheduled intervals in the training procedure, these methods feed the RNN’s previous outputs back into the RNN as input to the following cells to improve the prediction performance. Such methodology makes the RNN more robust to deviations from expert states while the RNN is unrolled over longer time spans without training inputs. Such deviations would otherwise cause the error to accumulate over time.
While recurrent neural networks have been shown to learn and predict time series data over thousands of time steps , a roadblock toward its application in a robotics context is the lack of representing uncertainty in the state representation. Besides the stochasticity of the real world, the trajectory generation model also needs to account for multiple possible solutions to find trajectories. A commonly used machine learning model to capture multimodal distributions is the Mixture Density Network (MDN) 
which represents multivariate Gaussian Mixture Models (GMM).
Combining an RNN with MDN has been first shown by Schuster  where the model is used to learn sequential data while capturing its stochasticity.
Similar to Rahmatizadeh et al. , we combine an LSTM with an MDN to architecture to architect the state transition model, but perform the trajectory synthesis in the higher-dimensional joint position space, in contrast to Cartesian space. Thanks to auto-conditioning, our method can generate trajectories from perfect demonstrations since in our training procedure the STM automatically learns to correct from states deviating from the demonstrations, whereas the method presented in 
uses explicit demonstrations that recover from undesired states back to the desired motion. Furthermore, we present results on training a separate, inverse dynamics model that serves as a torque controller which estimates the required actuator control commands to steer between the joint positions synthesized by the STM.
Iii Our Approach
The STM is trained via supervised learning on demonstrations from a motion planner and predicts the sequence of states given the start state and the desired goal state . We model the state transition model via an LSTM combined with a mixture density network
(MDN) to capture the probability distribution of future states (see Fig.2).
The MDN models a multivariate mixture of Gaussians by estimating the distribution over next states as a linear combination of Gaussian kernels:
Where is the number of Gaussians modelled by the MDN, is the learned mixing coefficient and is the -th Gaussian kernel of the form
The kernel mean
and its standard deviationare learned by the model.
Given the ground-truth state pair , the MDN loss is defined as the negative log-likelihood:
We update the MDN’s weights to minimize the loss via the Adam optimizer .
As in , we combine an LSTM with an MDN to capture the multi-modal nature of trajectory generation since, in many cases, there are multiple possible solutions connecting start and goal states. We combine the recurrent MDN with auto-conditioning , a training schedule that, every iterations for a sequence of time steps, feeds the LSTM’s output back into the cell computing the next state (see Fig. 2). This enables the network to correct itself from states that deviate from demonstrations: by learning from inputs where the network diverges from expert behavior, we capture the distribution of inputs that would cause a compounding error when rolling out the STM in the real world where the demonstrations as inputs are not available anymore. This technique greatly improves the performance, as we report in Sec. V.
In our experiments we focus on real-robot applications of our proposed STM architecture and training procedure. We rely on simulators to train the state transition model for the Sawyer robot, a seven degrees-of-freedom robot arm, equipped with a parallel gripper as end-effector.
We collected demonstration trajectories, i.e. sequences of states , in the Gazebo  simulator by using the inverse kinematics solver provided by Rethink Robotics for the Sawyer robot. Both the start and goal configuration for each demonstration trajectory are perturbed by uniform noise to capture a larger state space that improves the generalizability of our method.
In the following experiments on the Sawyer robot, we model the state space as follows: the thirteen-dimensional state at time step is represented by the seven joint angles , plus the current relative gripper position to the goal and the time-independent goal position in Cartesian coordinates:
In our definition, the state does not rely on the environment dynamics. This assumption is crucial for the STM to be transferable between different simulation environments and the real world.
Iv-a Sawyer Reacher
In the first experiment, we evaluate the STM on a basic servoing task: the STM is tasked to synthesize state sequences that move the gripper from a random initial joint configuration to a random sampled goal position, given in task space. In simulation, we collect 45 demonstration trajectories ranging from 50 to 70 states.
Iv-B Sawyer Pick-and-Place
In the second experiment, we evaluate the STM on pick-and-place task: the STM is tasked to synthesize state sequences that control the gripper from and random initial joint configuration to a random sampled goal position. In simulation, we collect 150 demonstration trajectories ranging from 166-170 states.
Iv-C Sawyer Block Stacking
The block stacking task presents a more challenging environment where accuracy in placing blocks is key. We ask the robot to place two blocks on top of each other at a designated position on the table. We collect 150 demonstrations in the Gazebo simulator of Sawyer picking up blocks from random positions and placing it at random goal locations.
We see block stacking as a more complex version of pick-and-place, where the STM needs to learn to place blocks at different heights. For block stacking, we use the same network architecture and training process as pick-and-place, while training it from demonstrations under different target settings, i.e. random 3D positions.
Iv-D High-level Control
In the next experiment, we evaluate how well our model can be used to generate trajectories given high-level task descriptions. We ask the robot to draw a circle of a defined radius and train an STM from a set of 10 circular motion sequences as demonstrations, ranging from circles of radii between and .
Defining such behavior for a traditional motion planning setup would require defining the waypoints on the circle such that the inverse kinematics (IK) solver can find the joint angle transitions to have the gripper servo between them. Instead, a deep learning model could learn from demonstrations the connection between high-level goals (i.e. the given radius) and the desired behavior (i.e. circle-drawing trajectories).
Iv-E Open-loop Control with Inverse Dynamics Model
We trained an inverse dynamics model (IDM) to accomplish torque control on the real robot. Combining an STM and IDM has the intriguing advantage of transferring behaviors from simulation to reality: the STM, serving as joint position motion planner, remains unchanged between both environments. The IDM, on the other hand, can be trained separately on each environment and robot configuration as it is the only module that depends on the environment dynamics. Such decoupling of both models has the potential for a higher sample efficiency compared to the simulation-to-real transfer of entire policy networks, as commonly done in traditional deep reinforcement learning approaches that train entirely in simulation [25, 26, 27].
The IDM is a three-layer MDN (each layer having 256 hidden units) that parameterizes a Gaussian mixture model consisting of fifteen normal distributions per action dimension (seven dimensions for joint actuators). Through our experiments, we found an IDM conditioned on the current state, plus the two previous states and actions (compare Fig.7 for ), to achieve the highest accuracy in steering between and via torque control.
Iv-E1 Sawyer Reacher (Open-loop Control with STM and IDM)
Our first Sawyer environment is a simple reaching testbed where the robot is tasked to servo the gripper to one of four desired goal locations. We train the IDM in Gazebo under the Open Dynamics Engine  physics simulation. We test in ODE and under the Bullet  physics engine.
Iv-E2 Sawyer Pick-and-Place (Open-loop Control with STM and IDM)
The goal is to place a block at three designated goal positions. The robot always starts from the same joint configuration.
In Fig. 10, we visualize the position of the end-effector while the IDM is tracking the trajectory synthesized by the STM in the ODE physics simulation.
In Table I, we compare the performance of our auto-conditioned recurrent MDN against several other STM architectures as baselines: a plain LSTM, an auto-conditioned LSTM and a recurrent MDN. Performance is measured as success rates for the experiments described in Sec. IV over 20 rollouts. Our method outperforms other baseline models on all of the tasks, especially on pick-and-place and block stacking, where more complex trajectories need to be synthesized.
For reacher, our proposed STM trained over training iterations in ca. on an Nvidia GeForce GTX 1070 graphics card. For comparison, we use a three-layer LSTM with 64 hidden units per layer for all of the models, and three Gaussians in the MDN-based STMs, i.e. vanilla recurrent MDN and auto-conditioned recurrent MDN. We observed that even on such simple task, the MDN yielded more robust behavior where the STM generated reasonable trajectories for targets that were outside the viable range of Sawyer. STM without such stochastic model failed to find any trajectories that reached close to such goals.
For pick-and-place, our proposed STM trained over training iterations in ca. two hours on the Nvidia GeForce GTX 1070. We use the three-layer LSTM with 128 hidden units per layer for all of the models and 20 Gaussians on the MDN-based models. We observed that auto-conditioning significantly reduces the accumulation of error, which is a common problem in generating trajectories using RNNs.
To investigate the benefits of a deep learning model as high-level controller, we investigate the ability of inferring associating high-level commands with demonstration trajectories, as described in Sec. IV-D. As shown in Fig. 3, our STM is able to learn from a few demonstrations the connection between the radius and the resulting trajectory, yielding circular motions as accurate as in radius.
|a.c. Recurrent MDN||100%||100%||80%|
In this work, we present a recurrent neural network architecture and training procedure that enables the efficient generation of complex joint position trajectories. Our experiments have shown that our STM can generalize to unseen tasks and is able to learn the underlying task specification which enables it to follow high-level instructions. In combination with a learned inverse dynamics model, we have shown a fully trainable motion planning pipeline on a real robot that combines the state transition model, as planning module, with an IDM, as position controller, to generate joint torque commands that tracks the synthesized trajectories.
Future work is directed towards extending our work with a deeper connection to the inverse dynamics model. We plan to close the loop between the STM and IDM such that the STM can be re-evaluated after the trajectory has been executed for one or more time steps. Such approach resembles model-predictive control and would allow the STM and IDM to be reactive to changes in the environment, such as dynamic obstacles, that would require re-planning.
-  A. H. Qureshi, M. J. Bency, and M. C. Yip, “Motion Planning Networks,” ArXiv e-prints, June 2018.
-  S. Karaman and E. Frazzoli, “Sampling-based algorithms for optimal motion planning,” CoRR, vol. abs/1105.1186, 2011. [Online]. Available: http://arxiv.org/abs/1105.1186
J. D. Gammell, S. S. Srinivasa, and T. D. Barfoot, “Batch informed trees (bit*): Sampling-based optimal planning via the heuristically guided search of implicit random geometric graphs,” in2015 IEEE International Conference on Robotics and Automation (ICRA), May 2015, pp. 3067–3074.
-  D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural network,” in Advances in Neural Information Processing Systems 1, D. S. Touretzky, Ed. Morgan-Kaufmann, 1989, pp. 305–313.
-  J. Ho and S. Ermon, “Generative Adversarial Imitation Learning,” in NIPS, 2016, pp. 4565–4573.
-  K. Hausman, Y. Chebotar, S. Schaal, G. Sukhatme, and J. J. Lim, “Multi-modal imitation learning from unstructured demonstrations using generative adversarial nets,” in Advances in Neural Information Processing Systems, 2017, pp. 1235–1245.
-  Y. Li, J. Song, and S. Ermon, “Infogail: Interpretable imitation learning from visual demonstrations,” in Advances in Neural Information Processing Systems, 2017, pp. 3812–3822.
-  S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov 1997.
-  X. Wang, S. Takaki, and J. Yamagishi, “An autoregressive recurrent mixture density network for parametric speech synthesis,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 4895–4899.
-  A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013, pp. 6645–6649.
-  A. Graves, “Generating sequences with recurrent neural networks,” CoRR, vol. abs/1308.0850, 2013.
-  X. B. Peng, G. Berseth, K. Yin, and M. van de Panne, “Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning,” ACM Transactions on Graphics (Proc. SIGGRAPH 2017), vol. 36, no. 4, 2017.
-  Y. Zhou, Z. Li, S. Xiao, C. He, Z. Huang, and H. Li, “Auto-conditioned recurrent networks for extended complex human motion synthesis,” in International Conference on Learning Representations, 2018.
-  X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne, “Deepmimic: Example-guided deep reinforcement learning of physics-based character skills,” arXiv preprint arXiv:1804.02717, 2018.
-  K. Li, X. Zhao, J. Bian, and M. Tan, “Sequential learning for multimodal 3d human activity recognition with long-short term memory,” in Mechatronics and Automation (ICMA), 2017 IEEE International Conference on. IEEE, 2017, pp. 1556–1561.
-  L. Sun, Z. Yan, S. Molina Mellado, M. Hanheide, T. Duckett, et al., “3dof pedestrian trajectory prediction learned from long-term autonomous mobile robot deployment data,” in International Conference on Robotics and Automation (ICRA) 2018. IEEE, 2018.
-  A. M. Lamb, A. G. A. P. GOYAL, Y. Zhang, S. Zhang, A. C. Courville, and Y. Bengio, “Professor forcing: A new algorithm for training recurrent networks,” in Advances In Neural Information Processing Systems, 2016, pp. 4601–4609.
-  A. Venkatraman, M. Hebert, and J. A. Bagnell, “Improving multi-step prediction of learned time series models.” 2015.
S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and
structured prediction to no-regret online learning,” in
Proceedings of the fourteenth international conference on artificial intelligence and statistics (AISTATS), 2011, pp. 627–635.
-  C. M. Bishop, “Mixture density networks,” Aston University, Tech. Rep., 1994.
-  M. Schuster, “Better generative models for sequential data problems: Bidirectional recurrent mixture density networks,” in Advances in Neural Information Processing Systems 12, S. A. Solla, T. K. Leen, and K. Müller, Eds. MIT Press, 2000, pp. 589–595.
-  R. Rahmatizadeh, P. Abolghasemi, and L. Bölöni, “Learning manipulation trajectories using recurrent neural networks,” CoRR, vol. abs/1603.03833, 2016.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  N. Koenig and A. Howard, “Design and use paradigms for Gazebo, an open-source multi-robot simulator,” in International Conference on Intelligent Robots and Systems (IROS), vol. 3. IEEE/RSJ, 2014, pp. 2149–2154.
-  S. James, A. J. Davison, and E. Johns, “Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task,” ser. Proceedings of Machine Learning Research, vol. 78. PMLR, 2017.
-  F. Sadeghi and S. Levine, “CADRL: Real single-image flight without a single real image,” Robotics: Science and Systems Conference (R:SS), 2017.
-  J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke, “Sim-to-real: Learning agile locomotion for quadruped robots,” CoRR, vol. abs/1804.10332, 2018.
-  E. Drumwright, J. Hsu, N. Koenig, and D. Shell, “Extending open dynamics engine for robotics simulation,” in Simulation, Modeling, and Programming for Autonomous Robots, N. Ando, S. Balakirsky, T. Hemker, M. Reggiani, and O. von Stryk, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 38–50.
-  E. Coumans, “Bullet physics simulation,” in ACM SIGGRAPH 2015 Courses, ser. SIGGRAPH ’15. New York, NY, USA: ACM, 2015.