I Introduction
Defining motion plans for robotic manipulators is a challenging task when the behavior specification cannot be simply expressed as a sequence of waypoints the endeffector has to follow while opening and closing. Often, laborious handengineering is required to compute such task space control inputs to the motion planner in order to generate the trajectory in joint position space. On the other hand, Learning from Demonstration (LfD) enables machine learning models to be trained from expert behavior, without a formal program that encodes the motion plan.
In this work, we investigate how a statetransition model can be learned from a few demonstrations to generate complex motion plans with highlevel task inputs. As we show in our experiments, our model is able to synthesize circular trajectories with varying radii, which it was able to generalize to from a sparse set of demonstration trajectories.
Tapping into the potential of deep learning models for motion planning has been reported to lead to two orders of magnitude in speed improvements over conventional planning algorithms
[1], such as optimal Rapidlyexploring Random Trees (RRT) [2] and Batch Informed Trees (BIT) [3].In this work, we propose a deep learning architecture and training methodology that can efficiently learn complex motion plans for a seven degreesoffreedom robot arm in joint position space. Based on a few demonstrations, our approach can efficiently learn state transitions for various trajectories while generalizing to new tasks.
Our contributions are as follows:

We present a training procedure and stochastic recurrent neural network architecture that can efficiently learn complex motions from demonstrations in joint position space.

In combination with a learned inverse dynamics model, we show realrobot results on an endtoend learnable openloop control pipeline.

We provide extensive realrobot experiments that demonstrate the ability of our STM to generalize to tasks that it has not been trained on.

The generalizability allows our model to accomplish complex behaviors from highlevel instructions which would traditionally require laborious handengineering and sequencing of trajectories from motion planners.
Ii Related Work
Learning from Demonstration (LfD), also referred to as imitation learning, has been widely studied in the robotics research community.
Behavioral cloningapproaches use supervised learning to train a model to imitate stateaction sequences from an expert and have lead to early successes in robot learning
[4].Given demonstrations from an expert policy which follows an unknown reward function , where
denotes the set of states, inverse reinforcement learning (IRL) and apprenticeship learning approaches attempt to recover the expert’s reward function such that a separate policy can be trained in a different context via reinforcement learning given that reward function.
Inspired by stateoftheart deep learning techniques for computer vision, such as generative adversarial networks,
generative adversarial imitation learning (GAIL) [5, 6, 7]approaches learn a policy via reinforcement learning that aims to confound a separate discriminator network which classifies whether the policy’s rollout stemmed from the policy or from the expert.
In this paper we study trajectory generation from a supervised learning perspective where we are given a set of expert trajectories that are represented by sequences of states. Borrowing architectures and training methodologies from stateoftheart sequence learning techniques, our work addresses a fundamental issue in behavioral cloning which is the compounding error between the expert and the generated behavior over the course of the trajectory.
Long shortterm memory networks (LSTM) [8] are widely used in time series prediction, especially in speech synthesis and speech recognition. Wang et al. [9] use autoregressive recurrent mixture density networks for parametric speech synthesis. Graves et al. use LSTMs to recognize speech [10], generate text and synthesize handwriting [11].
Synthesizing complex motions has been a longlasting interest to the computer graphics community [12, 13]. Peng et al. [14] apply reinforcement learning to synthesize a motion sequence in a physicsbased environment. Li et al. [15] use recurrent neural networks (RNN) to learn sequences of multimodal 3D human motions. Sun et al. [16] apply RNNs to predict a 3DOF pedestrian trajectory using longterm data from an autonomous mobile robot deployment.
In our work, we are leveraging recent advancements in recurrent network training to learn sequences of robot states. Autoconditioned Recurrent Neural Networks [13] are used to synthesize complex trajectories over large time spans.
Several approaches have been proposed to improve the training of RNNs, e.g. Professor Forcing [17], Data as Demonstrator (DaD) [18], autoconditioning [13] and Dataset Aggregation (DAgger) [19]. At scheduled intervals in the training procedure, these methods feed the RNN’s previous outputs back into the RNN as input to the following cells to improve the prediction performance. Such methodology makes the RNN more robust to deviations from expert states while the RNN is unrolled over longer time spans without training inputs. Such deviations would otherwise cause the error to accumulate over time.
While recurrent neural networks have been shown to learn and predict time series data over thousands of time steps [13], a roadblock toward its application in a robotics context is the lack of representing uncertainty in the state representation. Besides the stochasticity of the real world, the trajectory generation model also needs to account for multiple possible solutions to find trajectories. A commonly used machine learning model to capture multimodal distributions is the Mixture Density Network (MDN) [20]
which represents multivariate Gaussian Mixture Models (GMM).
Combining an RNN with MDN has been first shown by Schuster [21] where the model is used to learn sequential data while capturing its stochasticity.
Similar to Rahmatizadeh et al. [22], we combine an LSTM with an MDN to architecture to architect the state transition model, but perform the trajectory synthesis in the higherdimensional joint position space, in contrast to Cartesian space. Thanks to autoconditioning, our method can generate trajectories from perfect demonstrations since in our training procedure the STM automatically learns to correct from states deviating from the demonstrations, whereas the method presented in [22]
uses explicit demonstrations that recover from undesired states back to the desired motion. Furthermore, we present results on training a separate, inverse dynamics model that serves as a torque controller which estimates the required actuator control commands to steer between the joint positions synthesized by the STM.
Iii Our Approach
The STM is trained via supervised learning on demonstrations from a motion planner and predicts the sequence of states given the start state and the desired goal state . We model the state transition model via an LSTM combined with a mixture density network
(MDN) to capture the probability distribution of future states (see Fig.
2).The MDN models a multivariate mixture of Gaussians by estimating the distribution over next states as a linear combination of Gaussian kernels:
Where is the number of Gaussians modelled by the MDN, is the learned mixing coefficient and is the th Gaussian kernel of the form
Given the groundtruth state pair , the MDN loss is defined as the negative loglikelihood:
We update the MDN’s weights to minimize the loss via the Adam optimizer [23].
As in [22], we combine an LSTM with an MDN to capture the multimodal nature of trajectory generation since, in many cases, there are multiple possible solutions connecting start and goal states. We combine the recurrent MDN with autoconditioning [15], a training schedule that, every iterations for a sequence of time steps, feeds the LSTM’s output back into the cell computing the next state (see Fig. 2). This enables the network to correct itself from states that deviate from demonstrations: by learning from inputs where the network diverges from expert behavior, we capture the distribution of inputs that would cause a compounding error when rolling out the STM in the real world where the demonstrations as inputs are not available anymore. This technique greatly improves the performance, as we report in Sec. V.
Iv Experiments
In our experiments we focus on realrobot applications of our proposed STM architecture and training procedure. We rely on simulators to train the state transition model for the Sawyer robot, a seven degreesoffreedom robot arm, equipped with a parallel gripper as endeffector.
We collected demonstration trajectories, i.e. sequences of states , in the Gazebo [24] simulator by using the inverse kinematics solver provided by Rethink Robotics for the Sawyer robot. Both the start and goal configuration for each demonstration trajectory are perturbed by uniform noise to capture a larger state space that improves the generalizability of our method.
In the following experiments on the Sawyer robot, we model the state space as follows: the thirteendimensional state at time step is represented by the seven joint angles , plus the current relative gripper position to the goal and the timeindependent goal position in Cartesian coordinates:
In our definition, the state does not rely on the environment dynamics. This assumption is crucial for the STM to be transferable between different simulation environments and the real world.
Iva Sawyer Reacher
In the first experiment, we evaluate the STM on a basic servoing task: the STM is tasked to synthesize state sequences that move the gripper from a random initial joint configuration to a random sampled goal position, given in task space. In simulation, we collect 45 demonstration trajectories ranging from 50 to 70 states.
IvB Sawyer PickandPlace
In the second experiment, we evaluate the STM on pickandplace task: the STM is tasked to synthesize state sequences that control the gripper from and random initial joint configuration to a random sampled goal position. In simulation, we collect 150 demonstration trajectories ranging from 166170 states.
IvC Sawyer Block Stacking
The block stacking task presents a more challenging environment where accuracy in placing blocks is key. We ask the robot to place two blocks on top of each other at a designated position on the table. We collect 150 demonstrations in the Gazebo simulator of Sawyer picking up blocks from random positions and placing it at random goal locations.
We see block stacking as a more complex version of pickandplace, where the STM needs to learn to place blocks at different heights. For block stacking, we use the same network architecture and training process as pickandplace, while training it from demonstrations under different target settings, i.e. random 3D positions.
IvD Highlevel Control
In the next experiment, we evaluate how well our model can be used to generate trajectories given highlevel task descriptions. We ask the robot to draw a circle of a defined radius and train an STM from a set of 10 circular motion sequences as demonstrations, ranging from circles of radii between and .
Defining such behavior for a traditional motion planning setup would require defining the waypoints on the circle such that the inverse kinematics (IK) solver can find the joint angle transitions to have the gripper servo between them. Instead, a deep learning model could learn from demonstrations the connection between highlevel goals (i.e. the given radius) and the desired behavior (i.e. circledrawing trajectories).
IvE Openloop Control with Inverse Dynamics Model
We trained an inverse dynamics model (IDM) to accomplish torque control on the real robot. Combining an STM and IDM has the intriguing advantage of transferring behaviors from simulation to reality: the STM, serving as joint position motion planner, remains unchanged between both environments. The IDM, on the other hand, can be trained separately on each environment and robot configuration as it is the only module that depends on the environment dynamics. Such decoupling of both models has the potential for a higher sample efficiency compared to the simulationtoreal transfer of entire policy networks, as commonly done in traditional deep reinforcement learning approaches that train entirely in simulation [25, 26, 27].
The IDM is a threelayer MDN (each layer having 256 hidden units) that parameterizes a Gaussian mixture model consisting of fifteen normal distributions per action dimension (seven dimensions for joint actuators). Through our experiments, we found an IDM conditioned on the current state, plus the two previous states and actions (compare Fig.
7 for ), to achieve the highest accuracy in steering between and via torque control.IvE1 Sawyer Reacher (Openloop Control with STM and IDM)
Our first Sawyer environment is a simple reaching testbed where the robot is tasked to servo the gripper to one of four desired goal locations. We train the IDM in Gazebo under the Open Dynamics Engine [28] physics simulation. We test in ODE and under the Bullet [29] physics engine.
IvE2 Sawyer PickandPlace (Openloop Control with STM and IDM)
The goal is to place a block at three designated goal positions. The robot always starts from the same joint configuration.
In Fig. 10, we visualize the position of the endeffector while the IDM is tracking the trajectory synthesized by the STM in the ODE physics simulation.
V Results
In Table I, we compare the performance of our autoconditioned recurrent MDN against several other STM architectures as baselines: a plain LSTM, an autoconditioned LSTM and a recurrent MDN. Performance is measured as success rates for the experiments described in Sec. IV over 20 rollouts. Our method outperforms other baseline models on all of the tasks, especially on pickandplace and block stacking, where more complex trajectories need to be synthesized.
For reacher, our proposed STM trained over training iterations in ca. on an Nvidia GeForce GTX 1070 graphics card. For comparison, we use a threelayer LSTM with 64 hidden units per layer for all of the models, and three Gaussians in the MDNbased STMs, i.e. vanilla recurrent MDN and autoconditioned recurrent MDN. We observed that even on such simple task, the MDN yielded more robust behavior where the STM generated reasonable trajectories for targets that were outside the viable range of Sawyer. STM without such stochastic model failed to find any trajectories that reached close to such goals.
For pickandplace, our proposed STM trained over training iterations in ca. two hours on the Nvidia GeForce GTX 1070. We use the threelayer LSTM with 128 hidden units per layer for all of the models and 20 Gaussians on the MDNbased models. We observed that autoconditioning significantly reduces the accumulation of error, which is a common problem in generating trajectories using RNNs.
To investigate the benefits of a deep learning model as highlevel controller, we investigate the ability of inferring associating highlevel commands with demonstration trajectories, as described in Sec. IVD. As shown in Fig. 3, our STM is able to learn from a few demonstrations the connection between the radius and the resulting trajectory, yielding circular motions as accurate as in radius.
Reacher  Pickandplace  Stacking  

LSTM  80%  0%  0% 
a.c. LSTM  90%  50%  25% 
Recurrent MDN  90%  60%  30% 
a.c. Recurrent MDN  100%  100%  80% 
Vi Conclusion
In this work, we present a recurrent neural network architecture and training procedure that enables the efficient generation of complex joint position trajectories. Our experiments have shown that our STM can generalize to unseen tasks and is able to learn the underlying task specification which enables it to follow highlevel instructions. In combination with a learned inverse dynamics model, we have shown a fully trainable motion planning pipeline on a real robot that combines the state transition model, as planning module, with an IDM, as position controller, to generate joint torque commands that tracks the synthesized trajectories.
Future work is directed towards extending our work with a deeper connection to the inverse dynamics model. We plan to close the loop between the STM and IDM such that the STM can be reevaluated after the trajectory has been executed for one or more time steps. Such approach resembles modelpredictive control and would allow the STM and IDM to be reactive to changes in the environment, such as dynamic obstacles, that would require replanning.
References
 [1] A. H. Qureshi, M. J. Bency, and M. C. Yip, “Motion Planning Networks,” ArXiv eprints, June 2018.
 [2] S. Karaman and E. Frazzoli, “Samplingbased algorithms for optimal motion planning,” CoRR, vol. abs/1105.1186, 2011. [Online]. Available: http://arxiv.org/abs/1105.1186

[3]
J. D. Gammell, S. S. Srinivasa, and T. D. Barfoot, “Batch informed trees (bit*): Samplingbased optimal planning via the heuristically guided search of implicit random geometric graphs,” in
2015 IEEE International Conference on Robotics and Automation (ICRA), May 2015, pp. 3067–3074.  [4] D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural network,” in Advances in Neural Information Processing Systems 1, D. S. Touretzky, Ed. MorganKaufmann, 1989, pp. 305–313.
 [5] J. Ho and S. Ermon, “Generative Adversarial Imitation Learning,” in NIPS, 2016, pp. 4565–4573.
 [6] K. Hausman, Y. Chebotar, S. Schaal, G. Sukhatme, and J. J. Lim, “Multimodal imitation learning from unstructured demonstrations using generative adversarial nets,” in Advances in Neural Information Processing Systems, 2017, pp. 1235–1245.
 [7] Y. Li, J. Song, and S. Ermon, “Infogail: Interpretable imitation learning from visual demonstrations,” in Advances in Neural Information Processing Systems, 2017, pp. 3812–3822.
 [8] S. Hochreiter and J. Schmidhuber, “Long ShortTerm Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov 1997.
 [9] X. Wang, S. Takaki, and J. Yamagishi, “An autoregressive recurrent mixture density network for parametric speech synthesis,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 4895–4899.
 [10] A. Graves, A.r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013, pp. 6645–6649.
 [11] A. Graves, “Generating sequences with recurrent neural networks,” CoRR, vol. abs/1308.0850, 2013.
 [12] X. B. Peng, G. Berseth, K. Yin, and M. van de Panne, “Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning,” ACM Transactions on Graphics (Proc. SIGGRAPH 2017), vol. 36, no. 4, 2017.
 [13] Y. Zhou, Z. Li, S. Xiao, C. He, Z. Huang, and H. Li, “Autoconditioned recurrent networks for extended complex human motion synthesis,” in International Conference on Learning Representations, 2018.
 [14] X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne, “Deepmimic: Exampleguided deep reinforcement learning of physicsbased character skills,” arXiv preprint arXiv:1804.02717, 2018.
 [15] K. Li, X. Zhao, J. Bian, and M. Tan, “Sequential learning for multimodal 3d human activity recognition with longshort term memory,” in Mechatronics and Automation (ICMA), 2017 IEEE International Conference on. IEEE, 2017, pp. 1556–1561.
 [16] L. Sun, Z. Yan, S. Molina Mellado, M. Hanheide, T. Duckett, et al., “3dof pedestrian trajectory prediction learned from longterm autonomous mobile robot deployment data,” in International Conference on Robotics and Automation (ICRA) 2018. IEEE, 2018.
 [17] A. M. Lamb, A. G. A. P. GOYAL, Y. Zhang, S. Zhang, A. C. Courville, and Y. Bengio, “Professor forcing: A new algorithm for training recurrent networks,” in Advances In Neural Information Processing Systems, 2016, pp. 4601–4609.
 [18] A. Venkatraman, M. Hebert, and J. A. Bagnell, “Improving multistep prediction of learned time series models.” 2015.

[19]
S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and
structured prediction to noregret online learning,” in
Proceedings of the fourteenth international conference on artificial intelligence and statistics (AISTATS)
, 2011, pp. 627–635.  [20] C. M. Bishop, “Mixture density networks,” Aston University, Tech. Rep., 1994.
 [21] M. Schuster, “Better generative models for sequential data problems: Bidirectional recurrent mixture density networks,” in Advances in Neural Information Processing Systems 12, S. A. Solla, T. K. Leen, and K. Müller, Eds. MIT Press, 2000, pp. 589–595.
 [22] R. Rahmatizadeh, P. Abolghasemi, and L. Bölöni, “Learning manipulation trajectories using recurrent neural networks,” CoRR, vol. abs/1603.03833, 2016.
 [23] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [24] N. Koenig and A. Howard, “Design and use paradigms for Gazebo, an opensource multirobot simulator,” in International Conference on Intelligent Robots and Systems (IROS), vol. 3. IEEE/RSJ, 2014, pp. 2149–2154.
 [25] S. James, A. J. Davison, and E. Johns, “Transferring endtoend visuomotor control from simulation to real world for a multistage task,” ser. Proceedings of Machine Learning Research, vol. 78. PMLR, 2017.
 [26] F. Sadeghi and S. Levine, “CADRL: Real singleimage flight without a single real image,” Robotics: Science and Systems Conference (R:SS), 2017.
 [27] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke, “Simtoreal: Learning agile locomotion for quadruped robots,” CoRR, vol. abs/1804.10332, 2018.
 [28] E. Drumwright, J. Hsu, N. Koenig, and D. Shell, “Extending open dynamics engine for robotics simulation,” in Simulation, Modeling, and Programming for Autonomous Robots, N. Ando, S. Balakirsky, T. Hemker, M. Reggiani, and O. von Stryk, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 38–50.
 [29] E. Coumans, “Bullet physics simulation,” in ACM SIGGRAPH 2015 Courses, ser. SIGGRAPH ’15. New York, NY, USA: ACM, 2015.
Comments
There are no comments yet.