I Introduction
In robotic systems, the tasks are usually defined in coordinate space, whereas the control commands are defined in actuator space. In order to perform task level robot learning, an appropriate transformation from coordinate space to actuator space is required. If the intrinsic coordinates of a manipulator are defined as a vector of joint angles
, and the position and orientation vector of the end effector as a vector , then the forward kinematics function can be given by the following equation(1) 
The inverse kinematics problem[1, 2] is to find a mapping from the endeffector coordinates to actuator space which can be represented as
(2) 
For redundant robotic systems, that is, when the dimension of the taskspace is smaller than the dimension of the jointspace , is not a unique mapping. Given a taskspace position , there can be many corresponding jointspace configurations of . Thus, learning inverse kinematics relates to learning multivalued function.
Inverse kinematics for humanoid robots are important for applications like pick and place[3], physics engines[4, 5, 6, 7, 8] and humanrobot interactions like teleoperating a robot to grasp objects[9], or execute a series of coordinated gestures[10, 11]. Inverse kinematic approaches can be broadly divided into two categories, namely closedfrom analytical methods and numerical methods. Some examples of numerical methods are BFGS[12], pseudoJacobian inverse[1, 13, 14], Jacobian transpose[1, 15, 14], Damped Least Square method (DLS)[16, 14], and Cyclic Coordinate Descent (CCD). Unlike closedform analytical methods, the convergence time of numerical methods may vary and the results are not repeatable. On the top of that, computing inverse kinematics under constraints of stability and selfcollision avoidance cannot be done efficiently in real time.
Recent advancements in RL[17] like Deep learning (DQN)[18], Deterministic Policy gradients (DPG)[19], Guided Policy Search[20], Trust region policy optimization[21] and DDPG[22, 23]
provide us many frameworks for not only learning the complex problem of IK, but also to optimize the required criteria. Among these methods, DDPG learns an efficient policy when the action space is continuous which is the case with inverse kinematics. Reinforcement Learning works on the experienced data, and thus would avoid problems due to matrix inversions which may occur while solving general inverse kinematics. Therefore, learning would never demand impossible postures which occurs due to illconditioned matrix inversions.
In this paper we demonstrate how deep RL can be used to learn generalized solutions for inverse kinematics. We propose a DDPG based IK solver which takes into account criteria of stability and selfcollision avoidance while generating configurations. We validated the method by applying this framework to learn reachability tasks in the double support phase[24, 25].
The rest of the paper is organized as follows. In Section II, kinematic model of the robot is explained followed by the calculation of zero moment point (ZMP) and a brief description of general IK solvers. Section III explains DDPG algorithm and the proposed methodology to learn the stable IK solver. It also gives a brief description about the reward function used and the network architecture. Following that we show the results of training in Section IV along with numerical simulations. Finally conclusions and future work are discussed in Section V.
Ii Kinematic modelling and Inverse Kinematics
The humanoid model used for study is shown in Fig. 1. The robot is small, with total height of 84 cm, total weight below 5 Kg and 27 DoF. The main highlight is its vertebral column (5 DoF articulated torso), which makes the biped more close to human. In this work, we have tried to explore the usage of articulated torso for performing reachability hand tasks.
Iia Kinematic model of the robot
The kinematic model of the robot is represented using the DH convention with the base frame at the right leg sole and the first joint angle starting from right ankle. Starting from the base, a coordinate frame is defined at each joint and at the end of each endeffector (hands and left leg). The complete kinematic model with axes numbering is shown in Fig. 2.
IiB Stability and Calculation of ZMP
ZMP[26], [25] is one of the widely used dynamic stability measures which was proposed by Vukobratović and Stepanenko in 1972. In bipeds, a configuration is stable if ZMP lies inside the convex hull of the feet, which defines the support polygon[27].
In double support phase when the robot is stationary, ZMP is equal to the center of mass (CoM) projection in the support polygon. Whereas when the robot is not stationary, ZMP might deviate from the CoM projection. In order to have an accurate ZMP point we need to include momentum and angular momentum into the calculation. Hence the ZMP should be calculated as [28]
(3) 
where , are the x and y coordinates of CoM, M is the total Mass of the robot, is acceleration due to gravity and [, , ], [, , ] are the angular and linear momentum respectively w.r.t base frame.
IiC General Inverse Kinematics and Stability
Most of the general inverse kinematics solvers work in the velocity domain and solve for inverse kinematics using Jacobian or gradient descent method. Although constraints like singularity avoidance and joint limits can be included in these methods, stability criteria cannot be included directly in IK solver. Hence the resultant solution of inverse kinematics may not be stable in case of humanoids.
One way to avoid such scenarios is to keep on checking the intermediate configurations and reiterating until a stable configuration is achieved by the IK solver. Generally, this takes very long time to converge and in some cases, the solver might not be able to find a stable configuration at all. Therefore, there is a need for an IK solver which takes stability into account while solving and doesn’t require any external checking. This kind of solver can be easily learnt using learning based inverse kinematics[29]. RL requires only the kinematic model of the robot for learning a generalized solver. Using this idea, a generalized IK framework can be defined for complex robots like humanoids where balance and posture plays a great role apart from reaching the goal.
Iii Deep Deterministic Policy Gradient (DDPG) for Inverse Kinematics
Reinforcement learning [17] can be used to train an agent which learns from the environment directly without the use of any external data. An RL agent takes an action () depending on its state () and observes the reward () given out by the environment. This process is repeated. The aim of the RL agent is to maximize the cumulative reward in any task. Hence the underlying reward function and its modelling plays a crucial role in RL.
A RL agent can learn different types of policies pertaining to the given task. A policy function represents the agent’s behaviour needed to complete a given task. It is a mapping from state to action. Policy learning can be subdivided into two categories: 1) Stochastic 2) Deterministic. Stochastic policy learns the conditional probability of taking an action
, given that it is in a state , . A stochastic policy is useful only when number of actions are discrete and countable at any given state. In a continuous stateaction space, this leads to a large number of possibilities and hence large memory and search time. A deterministic policy on the other hand learns the action as a function of state,. This function is nonlinear in complex systems like humanoids and neuralnetworks serves as good model to learn such kind of functions. DDPG provides us with a very good frame work to train the neuralnetworks for learning highly nonlinear functions.
Iiia Deep Deterministic Policy Gradient
DDPG uses the underlying idea of DQN in the continuous stateaction space. It is an ActorCritic Policy learning[30] method with added target networks to stabilize the learning process. DDPG uses experience replay which addresses the issue of data being dependent and nonidentically distributed as most optimization algorithms need samples that are identical and independently distributed. Transitions are sampled from the environment according to the given exploration policy and the tuple are stored in a replay buffer of finite size. When this buffer becomes full, oldest samples are discarded. A minibatch of samples, is used to update the network. The critic network is learnt using Bellman equation[31] as in learning[32], and the actor updates the policy in the direction that improves
, i.e., critic provides the loss function for actor. In order to avoid the divergence of neural networks in
learning, target networks are used which track the original networks slowly.Suppose , represent critic and actor networks respectively and , represent their target networks, the loss function for critic network can be given as
(4) 
where is the reward and is discount factor.
The loss function for the actor is given as
(5) 
The target networks are updated as follows with
(6) 
As it can be observed from Eq. 4, reward function is an integral part of the network update and hence the underlying policy that is learnt by the network. Therefore reward function should be modelled carefully so that the RL agent learns the policy correctly.
IiiB State vector and network architecture
The chosen state vector consists of joint angles (), the end effector coordinates and the goal position coordinates. The action vector is a set of angular velocities, . Hence the policy learns a mapping from configuration space to velocity space. The state vector is of 21 dimensions and the action vector is of 13 dimensions. A 2 layered network consisting of fully connected layers with 400, 300 hidden units is used for both Actor and Critic. [33]
is taken as activation function and
is taken as 0.001. Batch normalization is used in the network to avoid overfitting and handle the scale variance problems.
IiiC Reward function
The main objective of an IK problem is to provide a set of angles () that are needed to reach the given position and orientation. Most of the Jacobian based methods solve for this using gradient descent and the solution is minimized in terms of . Therefore is included as a part of the reward function. In order to ensure that the configurations given out by the solver are within the stability region, a large negative reward is given whenever it goes out of stability bounds. The final reward function is shown in Eq. 7.
(7) 
where are the normalization constants, is the absolute distance between goal position and the current end effector position and is the angular difference between the starting configuration and the current configuration of the joint. In our case, is , is and is 20.
IiiD Environment modelling and Training
Modelling of the environment is a very crucial part for any RL algorithm. In order to learn a generalized inverse kinematics solution, the entire configuration space needs to be spanned while training. This is achieved by randomly sampling both start configuration and goal position for every episode. Algorithm 1 shows the environment used for the training.
The Actor and Critic networks are trained using the given Humanoid Environment. In DDPG, policy is learnt by the Actor network and Qvalue function is learnt by the Critic network. Target networks are used for both Actor and Critic and these are updated very slowly using as in Eq. 6. We used a Replay buffer of size . The pseudo code for training is given in Algorithm 2
. A normally distributed decaying random noise is used for the exploration noise which is observed to provide good results in training. Critic and Actor networks are updated as given in Eqs.
4 and 5 respectively. The training results and their evaluations are shown in the subsequent sections.Accuracy  Episodes  

900  1200  1500  1800  2100  2700  3000  3900  9900  
Min  23%  61%  72%  75%  79%  80%  76%  88%  87% 
Max  36%  69%  78%  82%  85%  84%  83%  89%  93% 
Mean  30%  66.33%  75%  78.33%  82.66%  82%  79.66%  88.66%  90% 
Iv Results and Simulations
The humanoid model was trained taking into account all the criteria explained in the previous sections. Training was run for 50000 episodes with 150 steps in each episode, totalling 7.5 million steps.
Iva Training results
Figs. 2(a) and 2(b) show the normalized value and reward of training. In Fig. 2(a), the plot started to nearly saturate after 30000 episodes showing the attainment of optimal value function. Error is defined as the difference between the endeffector and goal position at the end of an episode. The corresponding normalized error plot is shown in Fig. 2(c). The error goes on decreasing with training and reaches a minimum value soon after 1500 episodes showing that the network has learnt the required policy. The same is reflected in the reward function as shown in Fig. 2(b).
The trained model is tested for reachability tasks by giving random start configurations and goal positions. The accuracies of the network obtained by using 100 random samples over 3 different seeds at different points of learning are documented in Table. I. It can be observed from the table, that the accuracy goes on increasing with the training and oscillates in a small region after 2000 episodes. The highest mean accuracy obtained is 90% at 9900 episodes.
IvB Simulated Experiments
The trained IK solver is tested in the dynamic simulator of MSC Adams environment. The joint trajectories generated by the solver are given as input to the simulator for testing the solution. A set of three experiments which have high probability of losing balance are chosen in order to demonstrate the efficiency of the learnt IK solver and also to explore the advantages of an articulated torso.
In the first task, it had to reach a point in the far right end where it needs to use its spine to bend towards the right, as shown in Fig. 4a. In the second task, as shown in Fig. 5a, it has to reach a point in the leftback side, where the chest motion is tested. In the last task, it had to reach a point below its knee where it tried to explore the limitation of the pelvis and abdomen joints which is shown in Fig. 6a.
Figs. 4a, 5a and 6a show the end effector trajectories along with the final posture of the robot. The corresponding ZMP plots for tasks are shown in Figs. 4b, 5b and 6b. It was observed that the ZMP stays within the support polygon while performing each of these tasks.
In all of the three tasks, it was observed that the robot used the other hand to balance itself and stay within the stability region and also avoided self collision. The vertebral column played an important role in making the postures similar to humans, which can be observed from the Figs. 4, 5 and 6.
V Conclusions and Future work
This paper proposes a methodology for generating dynamically stable inverse kinematic solutions using deep RL. The approach was able to learn a robust IK solver within 2000 episodes. The robustness of the model was tested by giving various complex tasks. It was able to reach most of the points in its configuration space without losing its balance. Also, the solver converges to an inverse kinematics solution in less number of iterations as compared to general inverse kinematic solvers in most of the cases. Although the proposed model has limitations on precision, this model can serve as a good prototype for inverse kinematics solver of highly redundant manipulators.
Future work includes increasing the accuracy of the trained model and learning more complex tasks which involves movement of legs.
References
 [1] A. Colomé, “Smooth inverse kinematics algorithms for serial redundant robots,” Ph.D. dissertation, Master Thesis, Institut de Robotica i Informatica Industrial (IRI), Universitat Politecnica de Catalunya (UPC), Barcelona, Spain, 2011.
 [2] Y. Chua, K. P. Tee, and R. Yan, “Robust optimal inverse kinematics with selfcollision avoidance for a humanoid robot,” in ROMAN, 2013 IEEE. IEEE, 2013, pp. 496–502.
 [3] J.P. Saut, M. Gharbi, J. Cortés, D. Sidobre, and T. Siméon, “Planning pickandplace tasks with twohand regrasping,” in Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on. IEEE, 2010, pp. 4528–4533.
 [4] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for modelbased control,” in Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE, 2012, pp. 5026–5033.
 [5] E. Rohmer, S. P. Singh, and M. Freese, “Vrep: A versatile and scalable robot simulation framework,” in Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on. IEEE, 2013, pp. 1321–1326.
 [6] N. Koenig and A. Howard, “Design and use paradigms for gazebo, an opensource multirobot simulator,” in Intelligent Robots and Systems, 2004.(IROS 2004). Proceedings. 2004 IEEE/RSJ International Conference on, vol. 3. IEEE, 2004, pp. 2149–2154.
 [7] S. Carpin, M. Lewis, J. Wang, S. Balakirsky, and C. Scrapper, “Usarsim: a robot simulator for research and education,” in Robotics and Automation, 2007 IEEE International Conference on. IEEE, 2007, pp. 1400–1405.
 [8] E. Coumans, “Bullet physics engine,” Open Source Software: http://bulletphysics. org, vol. 1, 2010.
 [9] N. Chen, C.M. Chew, K. P. Tee, and B. S. Han, “Humanaided robotic grasping,” in ROMAN, 2012 IEEE. IEEE, 2012, pp. 75–80.
 [10] M. Do, P. Azad, T. Asfour, and R. Dillmann, “Imitation of human motion on a humanoid robot using nonlinear optimization,” in Humanoid Robots, 2008. Humanoids 2008. 8th IEEERAS International Conference on. IEEE, 2008, pp. 545–552.
 [11] K. P. Tee, R. Yan, Y. Chua, and Z. Huang, “Singularityrobust modular inverse kinematics for robotic gesture imitation,” in Robotics and Biomimetics (ROBIO), 2010 IEEE International Conference on. IEEE, 2010, pp. 920–925.

[12]
M. F. Møller, “A scaled conjugate gradient algorithm for fast supervised learning,”
Neural networks, vol. 6, no. 4, pp. 525–533, 1993.  [13] D. E. Whitney, “Resolved motion rate control of manipulators and human prostheses,” IEEE Transactions on manmachine systems, vol. 10, no. 2, pp. 47–53, 1969.
 [14] S. R. Buss, “Introduction to inverse kinematics with jacobian transpose, pseudoinverse and damped least squares methods,” IEEE Journal of Robotics and Automation, vol. 17, no. 119, p. 16, 2004.
 [15] W. A. Wolovich and H. Elliott, “A computational technique for inverse kinematics,” in Decision and Control, 1984. The 23rd IEEE Conference on, vol. 23. IEEE, 1984, pp. 1359–1363.
 [16] J. J. Moré, “The levenbergmarquardt algorithm: implementation and theory,” in Numerical analysis. Springer, 1978, pp. 105–116.
 [17] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1.
 [18] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double qlearning.” in AAAI, 2016, pp. 2094–2100.

[19]
D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,
“Deterministic policy gradient algorithms,” in
Proceedings of the 31st International Conference on Machine Learning (ICML14)
, 2014, pp. 387–395.  [20] S. Levine and V. Koltun, “Guided policy search,” in Proceedings of the 30th International Conference on Machine Learning (ICML13), 2013, pp. 1–9.
 [21] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in Proceedings of the 32nd International Conference on Machine Learning (ICML15), 2015, pp. 1889–1897.
 [22] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
 [23] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 3389–3396.
 [24] C.L. Shih and W. A. Gruver, “Control of a biped robot in the doublesupport phase,” IEEE transactions on systems, man, and cybernetics, vol. 22, no. 4, pp. 729–735, 1992.
 [25] M. Dekker, “Zeromoment point method for stable biped walking,” Eindhoven University of Technology, 2009.
 [26] M. Vukobratović and J. Stepanenko, “On the stability of anthropomorphic systems,” Mathematical biosciences, vol. 15, no. 12, pp. 1–37, 1972.
 [27] W. Vaughan Jr and R. Herrnstein, “Stability,” Melioration, and Natural Selection, vol. 1, pp. 185–215, 1987.
 [28] S. Kajita, H. Hirukawa, K. Harada, and K. Yokoi, Introduction to humanoid robotics. Springer, 2014, vol. 101.
 [29] A. D’Souza, S. Vijayakumar, and S. Schaal, “Learning inverse kinematics,” in Intelligent Robots and Systems, 2001. Proceedings. 2001 IEEE/RSJ International Conference on, vol. 1. IEEE, 2001, pp. 298–303.
 [30] M. Barto, “J. 4 supervised actorcritic reinforcement learning,” Handbook of learning and approximate dynamic programming, vol. 2, p. 359, 2004.
 [31] S. Peng, “A generalized dynamic programming principle and hamiltonjacobibellman equation,” Stochastics: An International Journal of Probability and Stochastic Processes, vol. 38, no. 2, pp. 119–134, 1992.
 [32] C. J. Watkins and P. Dayan, “Qlearning,” Machine learning, vol. 8, no. 34, pp. 279–292, 1992.

[33]
W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understanding and improving convolutional neural networks via concatenated rectified linear units,” in
International Conference on Machine Learning, 2016, pp. 2217–2225.
Comments
There are no comments yet.