ROS2Learn: a reinforcement learning framework for ROS 2

03/14/2019 ∙ by Yue Leire Erro Nuin, et al. ∙ 0

We propose a novel framework for Deep Reinforcement Learning (DRL) in modular robotics that provides an approach which trains a robot directly from joint states, using traditional robotic tools. We use an state-of-the-art implementation of the Proximal Policy Optimization, Trust Region Policy Optimization and Actor-Critic Kronecker-Factored Trust Region algorithms to learn policies in four different Modular Articulated Robotic Arm (MARA) environments. We support this process using a framework that communicates with typical tools used in robotics, such as Gazebo and Robot Operating System 2 (ROS 2). We compare the robustness of the performance of such methods in modular robots with an empirical study in simulation.



There are no comments yet.


page 6

Code Repositories


ROS2 enabled Machine Learning algorithms

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Current robot systems are designed, built and programmed by teams with multidisciplinary skills. The traditional approach to program such systems is typically referred to as the robotics control pipeline

and requires going from observations to final low-level control commands through: a) state estimation, b) modeling and prediction, c) planning, and d) low level control translation

[1]. As introduced by Zamalloa et al. [2], the entire process requires the fine tuning of every step in the pipeline, incurring in a significant complexity, where optimization at every step is critical and has a direct impact in the final result.

Artificial Intelligence methods and, particularly, neuromorphic techniques such as artificial neural networks (ANNs) are becoming more and more relevant in robotics. Starting from 2016, promising results, such as the work of Levine et al. [3], showed a path towards simplifying the construction of robot behaviours through the use of deep neural networks as a replacement of the traditional approach outlined above. The described end-to-end approach for programming robots scales nicely when compared to traditional methods.

Reinforcement Learning (RL) is a field of machine learning that is concerned with making sequences of decisions. It considers an agent situated in an environment where for each timestep the agent takes an action and receives an observation and a reward. A RL algorithm seeks to maximize the agent’s total reward trough a trial and error learning process. Deep Reinforcement Learning (DRL) is the study of RL by using neural networks as function approximates. In recent years, several techniques for DRL have shown good success in learning complex behaviour skills and solving challenging control tasks in high-dimensional state-space

[4, 5, 6, 7, 8]. However, many of the benchmarked environments such as Atari [9] and Mujoco [10] rarely deal with realistic or complex environments (frequent in robotics) [11, 12], or use the tools commonly used in the field such as the Robot Operating System (ROS)[13]. The research conducted in the previous work can only be translated into real world robots with a considerable amount of effort for each particular robot. Hence, the scalability of previous methods for modular robots is questionable.

Modular robots can extend their components seamlessly by just adding modules to the robotic system. This brings clear advantages for the construction of robots, however training them with current DRL methods becomes cumbersome due to the following reasons: every small change in the physical structure of the robot will require a new training; building the tools to train modular robots (such as the simulation model, virtual drivers) is a time consuming process; transferring the results to the real robot is complex given the flexibility of these systems. In this work we present a framework that utilizes the traditional tools in the robotics environment, such as Gazebo[14] and ROS 2, which simplifies the process of building modular robots and their corresponding tools. Our framework includes baseline implementations[15]

for the most common DRL techniques for policy iteration methods. Using this framework we present the results obtained benchmarking DRL methods in a modular robot with 6 degrees-of-freedom (DoF).

Ii Previous Work

Recent advances in the field of RL have led to the development of different approaches with neural network function approximators. Among the available techniques, the focus of this work is on model-free RL methods: Proximal Policy Optimization (PPO) [7] and natural gradient policy based methods, such as Trust Region Policy Optimization (TRPO) [6] and Actor Critic using Kronecker-Factored Trust Region (ACKTR) [8]. All of these are known as policy gradient methods, which perform updates at each episode to the policy parameters (on-policy).

TRPO [6] is a policy gradient method meant to solve RL problems more efficiently than ”vanilla” policy gradient (VPG) [16]. The idea used is to update the weights as fast as possible without diverging. For achieving this, TRPO uses a constrain linked to the KL-Divergence [17]

, that gives a measure of distance between two probability distributions. TRPO can be applied both for learning non-trivial tasks in continuous control as well as for discrete control policies directly from raw pixel inputs. Thanks to the use of the natural policy gradient method, TRPO overcomes some limitations of VPG such as choosing the step-size and the low sample efficiency. Compared to other algorithms

[18] [19], TRPO has proven to be a good approach for continuous control tasks. Details of the theoretical aspects of the TRPO method are given in Section III-B1.

ACKTR is an actor-critic RL method that applies trust region policy optimization using Kronecker-factored approximation (K-FAC) to the curvature. This method uses the natural policy gradient and optimizes both the actor and the critic [8]. Similarly to TRPO, ACTKR also uses the benefits from natural policy gradient and can be applied both in continuous and discrete environments. In the evaluation of Wu et. al [8], ACKTR sample and computational efficiency was evaluated in Atari and several Mujoco environments, and it was compared with the performance of Advantage Actor Critic (A2C) and TRPO. Wu et. al [8] results indicate that the performance of ACKTR surpassed the performance of A2C and TRPO. In this work, we extend the evaluation of ACKTR to a set of environments which are particular for representing different robot configurations and scenarios. Details of the theoretical aspects of the ACKTR method are given in Section III-B3.


is a policy gradient method for RL which alternates between sampling data trough interaction with the environment and optimizing the ”surrogate” objective using Stochastic Gradient Descent


. PPO differs from standard policy gradient methods by enabling multiple epochs of mini batch updates. Compared to its predecessor (TRPO), PPO uses the ”surrogate” objective by clipping the policy probability ratio. In the original work, PPO was evaluated in the Atari, Mujoco and Roboschool

[20] environments, where it had better performance compared to A2C, A2C + Trust region, VPG and TRPO. Details of the theoretical aspects of PPO are given in Section III-B2.

Previous works, [21, 22, 23]

, present partial success of transferring learned behaviour in simulation to a real robot. These works explain the importance of having scenes in simulation as similar as possible to the reality in order to simplify the process of transferring the learned behaviour to real scenarios. Yuke Zhu et. al

[24] describe high-quality and realistic 3D scenes. The approach of Tobin et. al [25], randomizes the rendering in simulation, reaching enough variability. This allows for the images in the real world to be considered as just another variation in the simulator. To the best of our knowledge, the work conducted in previous approaches focuses on restricted scenarios in a controlled environment where specific algorithms for solving particular task were used. This is not the case when a robotic system needs to be deployed in realistic scenarios, specially if the robot is modular and can present a number of different configurations.

The methods presented above have all different theoretical approaches for solving RL tasks with their strengths and drawbacks, which makes it hard to determine which one is the most appropriate choice for a particular application and environment. The aim of this work is to evaluate the above mentioned RL algorithms with the focus of determining which one of them is best suited for modular robotic applications. Section III describes the theoretical aspects of the evaluated RL methods and their adaptation to applications for modular robots. Section IV presents the experimental evaluation conducted in 6DoF modular Modular Articulated Robotic Arm (MARA) robot111 Section V summarizes results and presents future perspective and work.

Iii Methods

Iii-a Nomenclature

The methods presented bellow will be consistent with the following nomenclature that is partially inspired on the work by Peters et al. [26]:

The three main components of a RL system for robotics include the state (also found in literature as ), the action (also found as ) and the reward denoted by . We will denote the current time step by . The stochasticity of the environment gets represented by using a probability distribution as model where denotes the current action and , denote the current and next state, respectively. Further, we assume that most policy gradient methods have actions that are generated by a policy which is modeled as a probability distribution in order to incorporate exploratory actions.

The policy is assumed to be parametrized by policy parameters . The sequence of states and actions forms a trajectory denoted by where denotes the horizon which can be infinite. Often, trajectory, history, trial or roll-out are used interchangeably. At each instant of time, the learning system receives a reward .

The general goal of policy optimization is to optimize the policy parameters so that the expected return is optimized:


where is the discount factor.

For real-world applications, we require that any change to the policy parameterization has to be smooth. Otherwise, drastic changes can be hazardous for the actor, and useful initializations of the policy based on domain knowledge would vanish after a single update step. For these reasons, policy gradient methods which follow the steepest descent on the expected return are the method of choice. These methods update the policy parameters according to the gradient update rule


where denotes the learning rate and the current update number.

the state (also found in literature as )
the action (also found as )
the reward
time step
probability distribution representing the stochasticity of the environment
discount factor
return in the roll-outs
Table I: Summary of the terms used within the article.

Iii-B Benchmarked algorithms

One of the main distinctions between algorithms in RL is based on if they are value-based or policy-based. The first class, value-based, attempts to learn to assess correctly what is the reward obtained in a certain state and thus, maximize the final expected reward. The second class, policy-based, attempts to learn what action to do at each state in order to maximize the final reward. Robotics is dominated by scenarios with continuous changes in states and actions spaces, which implies that most traditional value-based off-the-shelf RL approaches are not valid for treating such situations. As pointed out by Peters et al. [26], Policy Gradient (PG) methods differ significantly from others as they do not suffer from these problems in the same way other techniques do.
One of the typical problems experienced when doing RL in robotics222provided that there is no additional state estimator (actor-critic methods) is that uncertainty in the state might degrade the performance of the policy. PG methods suffer from this as well. However, they rely on optimization techniques for the policy that do not need to be changed when dealing with this uncertainty.
The nature of PG methods allows them to deal with continuous states and actions in exactly the same way as discrete ones. PG techniques can be used either on model-free or model-based approaches. The policy representation can be chosen in order to be meaningful for the task, and can incorporate domain knowledge. This often leads to the use of fewer parameters in the learning process. Additionally, its generic formulation shows that PG methods are valid even when the reward function is discontinuous or even unknown.

While PG techniques might seem interesting for a roboticist on a first look, they are by definition on-policy and need to forget data reasonably fast in order to avoid the introduction of a bias to the gradient estimator. In other words, they are not as good as other techniques at using the data available (their sample efficiency is low). Other typical problem with PG methods is that convergence is only guaranteed to a local maximum while in tabular representations, value function methods are guaranteed to converge to a global maximum.

Iii-B1 Trust Region Policy Optimization (TRPO)

Trust Region Policy optimization is an attempt of improving VPG, by choosing appropriately the magnitude of update at each iteration [6]. In order to know how much to update, TRPO uses the KL-divergence, which returns a measure of how different two probability distributions are. The formal definition of the problem is:


where is the surrogate advantage, representing how good a policy is with respect to an old policy :


where is the advantage function. In the baselines implementation this advantage function is calculated using a value estimation (Actor-Critic structure)[15].
The analytical solution of the KL- divergence for each step is expensive, but is possible to approximate its value using a Taylor expansion of degree 2. To solve the optimization problem from Eq.3, its commonly used the Langrangian method. The obtained expression can be Taylor expanded, and the obtained result is known as the natural gradient [27]. In order to solve analytically the natural gradient, the Fisher information matrix (FIM) needs to be calculated (the Hessian of the KL-divergence), which is not trivial to compute and store. For that TRPO uses a trick, optimizing a sub-problem, and finally performing a backtracking line search.
This approach, more than a structure, is way of optimizing the search of the parameters. Therefore it can be used with an Actor-Critc structure that calculates the value function and uses it in the advantage calculation for example.

1:  Input: initial policy parameters , initial value function parameters

  Hyperparameters: KL-divergence limit

, backtracking coefficient , maximum number of backtracking steps
3:  for   do
4:     Collect set of trajectories by running policy in the environment.
5:     Compute rewards-to-go .
6:     Compute advantage estimates, (using any method of advantage estimation) based on the current value function .
7:     Estimate policy gradient as:
8:     Use the conjugate gradient algorithm to compute:
where is the Hessian of the sample average KL-divergence.
9:     Update the policy by backtracking line search with:
where is the smallest value which improves the sample loss and satisfies the sample KL-divergence constraint.
10:     Fit value function by regression on mean-squared error:
typically via some gradient descent algorithm.
11:  end for
Algorithm 1 Trust Region Policy Optimization (TRPO)

Iii-B2 Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) is an alternative to Trust Region Policy Optimization (TRPO) [6], that attains data efficiency and reliable performance of TRPO while using first order optimization. In the case of standard PG methods, the gradient update is performed per data sample. On the other hand, PPO enables multiple epochs of mini-batch updates. There are a few variants of PPO in the literature, which optimize the ”surrogate” objective or use adaptive KL penalty coefficient [7]. The Clipped Surrogate Objective is given as:


where , is the probability ratio of the current policy and the previous policy , is an estimator of the advantage function at timestep and is a hyperparameter, for example . The first term is the ”surrogate” objective that is also used in TRPO (Eq.4). The second term, , is clipping the probability ratio, , to be between the interval . The takes the minimum of the un-clipped and clipped value, which excludes the change in the probability ratio when the objective improves and includes it when the objective is worse. The clipping prevents PPO from having a large policy update. The Adaptive KL Penalty Coefficient is an alternative to the clipped ”surrogate” objective or an addition to it where the goal is to use the penalty on KL divergence and update the penalty coefficient to achieve some target KL divergence () at each policy update. As described in [7], the KL Penalty Coefficient performed worse than the surrogate objective, therefore the presented pseudocode and experimental evaluation of PPO uses the clipped surrogate objective.

1:  Initialize the time steps ()
2:  Initialize the clipping value
3:  for  do
4:     for  do
5:        Run MLP policy and generate action
6:        Execute action in emulator and observe reward
7:        Update observation () based on current joint positions and end-effector position
8:        Estimate advantage function
9:        for  do

           Compute SGD of loss function:

11:        end for
12:     end for
13:  end for
Algorithm 2 Proximal Policy Optimization (PPO)

Iii-B3 Actor Critic using Kronecker-Factored Trust Region (ACKTR)

The idea of Actor Critic using Kronecker-Factored Trust Region (ACKTR) is to replace the Stochastic Gradient Descent (SGD), which explores the weight space inefficiently, and to optimize both the actor and the critic using Kronecker-factored approximate curvature (K-FAC) with trust region [8]. ACKTR replaces SGD of A2C, the synchronous version of A3C [28], and instead computes the natural gradient update. The natural gradient update is applied both to the actor and the critic.

ACKTR uses the K-FAC to compute the natural gradient update efficiently. In order to define a Fisher metric for RL policies, ACKTR uses a policy function that defines a distribution over actions given the current state, and takes the expectation over the trajectory distribution. The mathematical formulation for the Fisher metric is given by:


where is the distribution of trajectories. In practice, we approximate the intractable expectation above with trajectories collected during training. In the case of training the critic, one can think of it as a least-squares function approximation problem. In this case, the most common second-order algorithm is Gauss-Newton, which approximates the curvature as the Gauss-Newton matrix , where is the Jacobian mapping from parameters to outputs [29]. The Gauss-Newton matrix is equivalent to the Fisher matrix for a Gaussian observation model, which allows to apply K-FAC to the critic as well. In more detail, the output of the critic

is defined to be a Gaussian distribution with

. Setting to 1 is equivalent to the vanilla Gauss-Newton method.

In the case when the actor and the critic are disjoint, it is possible to apply K-FAC updates to each of them using the same metric as defined in Equation 6. To prevent instability during training, it is important to use an architecture where the two networks both share lower-layer representations but have distinct output layers [30, 28]

. The joint distribution of the policy and the value distribution can be defined by assuming independence of the two output distributions, for instance

, and constructing the Fisher metric with respect to . This is similar to the standard K-FAC, except that we need to sample the two networks’ outputs independently. In this case, the K-FAC to approximate the Fisher matrix is:


The pseudocode presented gives an overview of the ACKTR implementation used in our evaluation.


  Assume shared parameter vector for the actor

and for the critic.
2:  Assume global shared counter
3:  Initialize step counter
4:  repeat
5:     Reset action and state
7:     Get state
8:     repeat
9:        Perform action according to policy
10:        Receive reward and new state
13:     until terminal or
15:     for   do
17:        Calculate natural gradient for the actor:
19:        ,
20:        with as distribution of trajectories collected during training
21:        if  and are joint then
22:           Output of the critic is defined to be a Gaussian distribution:
23:           Apply Fisher matrix for the critic
24:        end if
25:        if  and are disjoint then
26:           Apply K-FAC to approximate Fisher matrix for the critic
28:        end if
29:     end for
30:  until 
Algorithm 3 Actor Critic using Kronecker-Factored Trust Region (ACKTR)

Iv Experiments

As previously presented by Zamora et al [12], for the benchmark experiments we use an extension of the OpenAI gym which is tailored for robotics. We added four additional environments to evaluate the algorithms, which match the modular MARA 6DoF. The environments differ on how they reward the actions taken, and are described in detail in gym-gazebo2 [31]. For the training, we used the Gazebo simulator and corresponding ROS 2 packages, to convert the actions generated from each algorithm into appropriate trajectories that the robot can execute.

We set the initial position of the robot to zero for all joints and reset the robot to this initial position when the number of steps exceeds the maximum timesteps for an episode. We code this in an environment-specific variable denoted max_episode_steps, which in our case is set to 2048. For these specific experiments, we located the fixed target at the coordinates with respect to the origin of the environment, which in our case is set to be the base of the 6DoF MARA robot; and the orientation at the quaternion , with respect to the table orientation. Each algorithm generates actions that are translated into the corresponding ROS 2 messages and are executed in simulation. The simulation then returns the observations (current joint positions and end-effector pose) and gives them to the algorithm. Figure 1 illustrates the experimental environment. For each environment we perform one experiment consistent in a training for 1 million steps in the environment.

Figure 1: gym-gazebo2 MARA robot environments displayed on Gazebo gzclient simulator. All environments are included since their differences are in how the learning is rewarded and not in the model
Figure 2: Performance comparisons of the tested algorithms for the environment. The shaded region denotes the deviation with respect to the previous 100 steps. PPO and TRPO seem to achieve a similar level of performance by the end of the experiments, even though TRPO seems to learn faster at earlier stages. The reward obtained by ACKTR remains unchanged compared with the other algorithms.
Figure 3: Performance comparisons of the tested algorithms for the environment. The shaded region denotes the deviation of the rewards with respect to the previous 100 steps. PPO is able to get a slightly better result than TRPO in this environment, while ACKTR stays flat compared to the other two.
Figure 4: Performance comparisons of the tested algorithms for the environment trained. The shaded region denotes the deviation of the rewards with respect to the previous 100 steps. PPO and TRPO seem to have similar performance, while the reward obtained by ACKTR remains flat in comparison.
Figure 5: Performance comparisons of the tested algorithms for the environment. The shaded region denotes the deviation with respect to the previous 100 steps. TRPO shows better performance towards the end of the experiment compared to PPO, while this time, ACKTR shows some learning towards the end of the experiment, though not comparable with any of the other algorithms.

Figure 2, Figure 3, Figure 4 and Figure 5 show the reward obtained in the learning process for the different algorithms. In general TRPO and PPO show ability to learn at a similar pace, particularly in non-orient environments, which is not surprising given that they both have similar formulation. The discrepancies in orient environments might be due to the fact that those could be more dependant on the random initialization. ACKTR does not seem to be an efficient learner for this task. See more details in Section V. It can be due to the used hyperparameters, which were all the same in the three different algorithms in order to compare them.

V Conclusion and Future work

We have presented evaluation of different DRL techniques for modular robotics. Our setup and framework consists of tools, such as ROS 2 and Gazebo, allowing a more realistic representation of the environment. Our results show that our proposed framework is stable during training of neural networks trough RL with policy-based methods.

There still remain many challenges within the DRL field for robotics. The main problems are the long training times, the simulation-to-real robot transfer, reward shaping, sample efficiency and extending the behaviour to diverse tasks and robot configurations.

So far, our work with the modular robot MARA has focused on simple tasks such as reaching a point in space. In order to have an end-to-end training framework (from pixels to motor torques) and to perform more complex tasks, we aim to integrate additional rich sensory input such as vision. Inspired by the work of [32, 33]

, we intend to explore imitation learning that provides high-quality human training data through demonstrations which might be useful for the robot to learn to perform more complex tasks.

We envision the future of robotics to be modular robots where the trained network can generalize online to modifications in the robot such as change of a component or dynamic obstacle avoidance. In order to accomplish this, we aim to explore methods that allow novel training approaches of the robot for every new environment, type of robot or when the original task for which the network was trained for is changed. Inspired by [34], we aim to evaluate meta-learning and hierarchical RL methods that allow to generalize to new tasks and environments by learning sub-policies; for instance motor primitives that can be reused across different sets of tasks, and even generalizing to unseen new tasks.

Figure 6, Figure 7, Figure 8 and Figure 9 show the reward obtained in the learning process with ACKTR algorithm in each MARA robot environment. The shaded region denotes the deviation with respect to the previous 100 steps.

Figure 6: Performance of environment trained with ACKTR algorithm
Figure 7: Performance of environment trained with ACKTR algorithm
Figure 8: Performance of environment trained with ACKTR algorithm
Figure 9: Performance of environment trained with ACKTR algorithm