I Introduction
Ia Reinforcement Learning for Industrial Robotics
In a close future, it is likely to see robots programming themselves to carry out new industrial tasks. From manufacturing to assembling, there are a wide range of tasks that can be performed faster and with higher accuracy by robot manipulators. Over the past two decades, reinforcement learning in robotics
[2, 3] has made rapid progress and enabled robots to learn a wide variety of tasks [4, 5, 6]. The emergence of selfprogramming robots might speed up the development of industrial robotic platforms insofar as robots can learn to execute tasks with very high accuracy and precision.One major step in a robot learning algorithm is the exploration phase. In such phase, random commands are sent to the robot such that it discovers both its environment and the way it responds to commands sent. In this process, random commands are sent to the robot, which can result in any possible movement within its reachable workspace. In an industrial context, such unpredictable behavior is dangerous, for instance when a robot has to learn a task jointly with a human worker (e.g. an assembly task). For this reason, it seems interesting to work with KUKA LBR iiwa robot manipulators, which are very good for collaborative tasks as their compliance can be adjusted easily and they can be programmed to stop when feeling contact.
IB Literature Overview
Reinforcement Learning (RL) and Optimal Feedback Control (OFC) are very similar in their formulation : find a policy that minimizes a certain cost function under a certain dynamics (see section II for more details). They both enable to phrase many challenging robotic tasks. Indeed, a solution to such problem is both an optimal openloop trajectory and a feedback controller. If the dynamics is linear and the cost function quadratic, an optimal solution can be computed analytically using LinearQuadraticRegulators theory [7].
When the dynamics is more complex (nonlinear), the problem becomes more difficult but can still be solved with iterative LinearQuadraticRegulator algorithm (iLQR or iLQG) [8]. As its name suggests, this algorithm iteratively fits local linear approximations to the dynamics and computes a locally optimal solution under this linear model. In the context of RL, the dynamics is considered unknown. To deal with this issue, [9, 10]
have proposed to build the linear model by exploring the environment and make a linear regression.
In [1], we recently proposed another method that consists in computing the cost function likewise, using exploration and quadratic regression. This way, the model is more precise and can converge faster towards high precision tasks, which is the main purpose of our research. Indeed, in some tasks, for example Cartesian positioning, a typical approach [11] consists in including the Cartesian position in the state, build a linear model and then build a quadratic cost from this linear approximation. Such approach does not really make sens as this quantity has already been approximated in the first order and thus cannot produce a good second order model for update.
IC Main contribution and paper organization
In this paper, we extend the concepts of [1]. Second order methods have been implemented to compute trajectory update and this way increase the speed of convergence by reducing the number of iLQG pass required. We also study the influence of different parameters on the speed of convergence.Such parameters are compared and chosen using the VREP software [12]. Finally, we propose an experimental validation on the physical device, using the parameters found by simulation. The KUKA LBR iiwa learns a positioning task in Cartesian space using angular position control without any geometric model provided. This rather simple task enables to measure easily the accuracy of the policy found.
The paper is organized as follows. Derivation of iLQG with learnt dynamics and cost function is written in section II. In section III, we try to find the best learning parameters through simulating the same learning situation with different parameters using the VREP simulation software. Experimental validation on KUKA LBR iiwa is presented in section IV and section V proposes a discussion on the results and future work.
Ii Learning a Local Trajectory With High Precision
This section summarizes the method used. First, the derivation of iLQG in the context of unknown dynamics and learnt cost function is written. The second order method to compute the improved controller is explain in a second step.
Iia A few definitions
This section begins with some useful definitions:

A trajectory of length is defined by the repetition times of the pattern shown in Fig. 1. Mathematically, it can be denoted by
where and
represent respectively state and control vectors. In our problem (see section
III and IV), the state is the vector of joint positions and the actions are joints target positions. 
The cost and dynamics functions are defined as follows:
(1) (2) outputs the cost and the next state, both with respect to previous state and action.

The controller is the function we want to optimize. For a given state, it needs to output the action with smallest cost that follows the dynamics. In our case, it is denoted by and has the special special form of a timevarying linear controller:
(3)
The guiding principle of iLQG is to alternate between the two following steps. From a nominal trajectory, denoted by and , , compute a new local optimal controller. From a given controller, draw a new nominal trajectory.
IiB Local approximations of cost and dynamics
As explained earlier, from a given nominal trajectory , the goal is to update the controller such that the cost is minimized in its neighborhood. In this process, the first step is to compute local approximations of the cost function and dynamics around the nominal trajectory:
(4) 
(5) 
where and represent variations from the nominal trajectory and is the vector . The notation (resp. ) is the Jacobian (resp. Hessian) matrix of w.r.t. (resp. and ).
We propose to compute both approximations following an exploration and regression scheme. The first stage generates a certain number
of random trajectories around the nominal. These trajectories are normally distributed around
with a certain timevarying covariance . Hence, during the sample generation phase, the controller is stochastic and follows:(6) 
where stands for normal distribution. From these samples, we can make two regressions a linear one to get the dynamics and a second order polynomial one [13] to approximate the cost function.
IiC Update the controller
This section is about updating the controller once we have good Taylor expansions of the dynamics and cost function. In order to get a controller with low cost over the whole trajectory, we need to use the two value functions: and . represents the expected cost until the end of the trajectory if following after being in state and selecting action . is the same but conditioned only on , if is deterministic, these two functions are exactly the same. The reader can refer to [14] for more detailed definitions of these value functions.
First, we need to compute quadratic Taylor expansions of both value functions:
(7) 
(8) 
In the context of trajectory optimization defined above, [11] shows that these two functions can be approximated quadratically by
(9) 
These functions are computed backward for all the time steps, starting with , the final cost.
Under such quadratic value functions, following the derivation in [15], we can show that the optimal controller under such dynamics and cost is defined by
(10) 
A criterion to compute the new covariance is also needed. The goal being to explore the environment, we follow [16] and choose the covariance with highest entropy in order to maximize information gained during exploration. Such covariance matrix is:
(11) 
IiD Limit the deviation from nominal trajectory
The controller derived above is optimal only if the dynamics and cost are respectively linear and quadratic everywhere. The approximations being only valid locally, the controller needs to be kept close from the nominal trajectory to remain acceptable for update. This problem can be solved by adding a constraint to the cost minimization problem:
(12) 
where
is the statistical KullbackLeibler divergence.
andare the probability trajectory distributions under the current controller and the updated one.
[11] shows that such constrained optimization problem can be solved rather easily by introducing the modified cost function:
(13) 
Indeed, using dual gradient descent, we can find a solution to the constrained problem by alternating between the two following steps:

Compute the optimal unconstrained controller under for a given

If the controller does not satisfy (12), increase .
A large has the effect of increasing the importance on constraint satisfaction, so the larger is, the closer the new trajectory distribution will be from the previous one.
IiE Initialize and choose
The way is defined from approximation does not guaranty positive definiteness for . Which means that it might not be eligible to be a covariance matrix. This issue is addressed by increasing such that the distribution is close enough from the previous one. As the previous trajectory has a positive definite covariance, there must be an that will enforce positive definiteness. This gives a good way to initialize for a given pass.
Finally, the choice of is very important. If it is too small, the controller sequence won’t progress towards optimality. On the other hand, if it is too large, it might be unstable. The idea is to start with a certain and decrease it if the new accepted controller is worst than the previous one.
Iii Kuka Lbr Iiwa Positioning: Tune the Learning Parameters
A validation of the method is proposed on learning a simple inverse kinematics task. We consider a KUKA LBR iiwa robot (Fig. 2), where the geometric parameters are unknown. The state variables are the joints angular positions and the control vector gathers target joints positions for next state. The idea is to reach a Cartesian position of the end effector with high accuracy () without any geometric model.
Iiia Cost function
For this problem, the cost function needs to be expressed in terms of the Cartesian distance between the endeffector and the target point. We chose the cost function proposed in [11]:
(14) 
where and are both real user defined parameters. As we do not consider any geometric parameter of the robot, the distance cannot be obtained with direct model considerations and needs to be measured from sensors.
IiiB Tune the algorithm parameters
Previous work [1] showed that a number of samples around is a good balance between accurate quadratic regression and exploration time for d.o.f. robots. So we carry out our experiments with . Among all the parameters defined in previous sections, we identified critical ones : (the initial covariance, defined below), and from the cost function, . In this section, we learned optimal angular positions for the situation below with different sets of values on these parameters using VREP simulation software. The situation is the following:

Initial position : All angles at (straight position on Fig. 2)

Target position : Cartesian vector in , in the robot frame (red sphere on Fig. 2)

Initial mean command : target angular positions = initial positions (no move command).
Fig. 2 show a trajectory found by the algorithm.
The initial covariance matrix is also an important parameter as it defines the exploration range for all the future steps. Indeed, if it has large values, next iteration needs to have large covariance also because of (12). In our implementation, we start with diagonal covariance matrix where all the diagonal entries are the same. we denote the initial value of such diagonal entries, it is one of the parameters to be studied.
IiiC Results and analysis
From what we acknowledged running our algorithm, we picked up three values for each parameter and tried all the possible combinations to choose a good set of parameters for positioning task. Results obtained are summarized in table I. In our simulation, the robot was only allowed trials to reach precision. Thus, we insist that in table I, an underlined number represents the number of iLQG iterations before convergence whereas other numbers are the remaining distance to objective after iterations.
100  11  16  13  
1000  0.25  12  10  
10000  13  0.27  8  
100  0.11  14  16  
1000  10  12  10  
10000  0.10  1.69  0.24  
100  0.11  0.22  0.84  
1000  0.13  12  0.20  
10000  13  0.23  15 
100  0.32  0.15  0.39  

1000  0.45  0.28  0.22  
10000  0.30  0.29  0.31  
100  0.14  0.32  0.32  
1000  14  1.93  1.70  
10000  1.82  0.99  0.11  
100  0.34  0.38  0.39  
1000  0.71  0.29  0.53  
10000  0.70  0.14  2.31 
100  12.79  12.42  17.83  

1000  4.42  0.30  3.50  
10000  2.88  10.93  2.60  
100  24.37  15.75  10.13  
1000  7.66  6.32  1.87  
10000  2.67  8.37  6.44  
100  1.93  8.93  
1000  8.03  2.23  3.50  
10000  2.70  4.83  2.60 
[5pt]
Together with the raw data in table I, we plot the evolution of the distance within the iterations of a simulation for several sets of parameters. Looking at table I, it seems that the most critical parameter is . Fig. 3 shows three learning curves where only varies. From here it appears that the initial covariance is not crucial in the early stages of the learning process. However, looking at the bottom plot, which is a zoom on the final steps, we acknowledge that if the covariance is too large, the algorithm will not converge towards the desired accuracy behavior. Hence, we recommend to keep around to obtain the desired accurate behavior.
After setting to , we made the same plots for the other parameters (Fig. 4). These reveal that and do not appear to influence the behavior in this range of values. However, looking at the bottom plot, we can see that needs to be kept large enough such that an iLQG iteration can make enough progress towards optimality. For small , we waste time stuck near the initial configuration. For the three plots in Fig. 4, the zooms are not included as they do not reveal anything more.
Iv Experimental Validation on the Robot
In this section, we run our algorithm on a real KUKA LBR iiwa for a similar positioning task. The situation is slightly different:

Initial position : , angular positions in (Fig. 5, left picture)

Target position : , Cartesian position, in and in the robot frame.

Initial mean command : target angular positions = initial positions (no move command).
The choice of changing the initial configuration was motivated by two reasons. First, it enables to show that the parameters found in section III are not case dependant. Second, we had constraints with our working environment and this configuration was better regarding research going on with other robots simultaneously.
Fig. 5 shows the KUKA LBR iiwa in its initial and final configuration (after reaching the desired endeffector position).
Iva Results obtained on the robot
The learning process defined above resulted in the learning curve on Fig. 6. We note that it takes as many steps to go from initial configuration to accuracy than from to . The final command provided by the algorithm is . Regarding the learning time, the overall process took approximately minutes, for exploration and for calculations.
IvB Measure the Cartesian distance
On this experimental validation, the distance was computed from the endeffector position read from the robot internal values. Even if it was probably computed thanks to direct DH model, our algorithm used it as an output from a black box. Thus, similar results would have been obtained using any other distance measurement sensor (e.g. laser tracker). We just note that, the precision reached is relative to the measurement tool precision.
However, in future work, it will be useful to use an external measurement tool in order to compare our positioning method precision with other techniques. Indeed, the precision of the inverse kinematics of the robot cannot be defined with internal robot measurements. The previous statement is precisely the reason why we need to calibrate industrial robots. Hence, we will need to train the robot with external distance measurement sensors and to compare the precision with other methods using the same sensors.
V Discussion
In previous work [1], we showed that learning the cost function is more stable and converges faster than including distance in the state and approximate it in the first order. Here, we extend this work with second order improvement of the controller, which shows faster convergence properties under well chosen parameters. The high precision reached for this simple positioning task let us hope that such methods will be suitable for more complex industrial tasks.
In many applications, it is also interesting to handle orientation of the end effector. Such modification is not an issue, one just needs to make several points on the end effector match several target points ( or depending on the shape of the tool). This has been done with VREP and converges just as well, even if taking more time. We chose not to present these results in this paper as they do not show any additional challenge and learning curves are less easy to interpret as distances are to be averaged between the three points.
In future work, we plan on trying to handle manipulation tasks with contact, which is a major challenge as the functions to approximate will not be smooth anymore near contact points.
References
 [1] J. Guérin, O. Gibaru, E. Nyiri, and S. Thiery, “Locally optimal control under unknown dynamics with learnt cost function : application to industrial robot positioning.” to appear in ACD Advanced Control and Diagnosis, 2016.
 [2] J. Kober and J. Peters, “Reinforcement learning in robotics: A survey,” in Reinforcement Learning. Springer, 2012, pp. 579–610.
 [3] M. P. Deisenroth, G. Neumann, J. Peters et al., “A survey on policy search for robotics.” Foundations and Trends in Robotics, vol. 2, no. 12, pp. 1–142, 2013.
 [4] C.K. Lin, “H∞ reinforcement learning control of robot manipulators using fuzzy wavelet networks,” Fuzzy Sets and Systems, vol. 160, no. 12, pp. 1765–1786, 2009.
 [5] M. P. Deisenroth, C. E. Rasmussen, and D. Fox, “Learning to control a lowcost manipulator using dataefficient reinforcement learning,” 2011.
 [6] J.J. Park, J.H. Kim, and J.B. Song, “Path planning for a robot manipulator based on probabilistic roadmap and reinforcement learning,” International Journal of Control Automation and Systems, vol. 5, no. 6, p. 674, 2007.
 [7] F. L. Lewis, D. Vrabie, and V. L. Syrmos, Optimal Control. John Wiley & Sons, 2012.
 [8] Y. Tassa, T. Erez, and E. Todorov, “Synthesis and stabilization of complex behaviors through online trajectory optimization,” in Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE, 2012, pp. 4906–4913.
 [9] D. Mitrovic, S. Klanke, and S. Vijayakumar, “Adaptive optimal feedback control with learned internal dynamics models,” in From Motor Learning to Interaction Learning in Robots. Springer, 2010, pp. 65–84.

[10]
S. Levine and P. Abbeel, “Learning neural network policies with guided policy search under unknown dynamics,” in
Neural Information Processing Systems (NIPS), 2014.  [11] S. Levine, N. Wagener, and P. Abbeel, “Learning contactrich manipulation skills with guided policy search,” CoRR, vol. abs/1501.05611, 2015. [Online]. Available: http://arxiv.org/abs/1501.05611
 [12] M. F. E. Rohmer, S. P. N. Singh, “Vrep: a versatile and scalable robot simulation framework,” in Proc. of The International Conference on Intelligent Robots and Systems (IROS), 2013.
 [13] S. M. Stigler, “Gergonne’s 1815 paper on the design and analysis of polynomial regression experiments,” Historia Mathematica, vol. 1, no. 4, pp. 431–439, 1974.
 [14] R. S. Sutton and A. G. Barto, Introduction to reinforcement learning. MIT Press Cambridge, 1998, vol. 135.
 [15] E. Todorov and W. Li, “A generalized iterative lqg method for locallyoptimal feedback control of constrained nonlinear stochastic systems,” in American Control Conference, 2005. Proceedings of the 2005, June 2005, pp. 300–306 vol. 1.

[16]
S. Levine and V. Koltun, “Guided policy search,” in
Proceedings of The 30th International Conference on Machine Learning
, 2013, pp. 1–9.
Comments
There are no comments yet.