How Much Do Unstated Problem Constraints Limit Deep Robotic Reinforcement Learning?

09/20/2019 ∙ by W. Cannon Lewis II, et al. ∙ Rice University 4

Deep Reinforcement Learning is a promising paradigm for robotic control which has been shown to be capable of learning policies for high-dimensional, continuous control of unmodeled systems. However, RoboticReinforcement Learning currently lacks clearly defined benchmark tasks, which makes it difficult for researchers to reproduce and compare against prior work. “Reacher” tasks, which are fundamental to robotic manipulation, are commonly used as benchmarks, but the lack of a formal specification elides details that are crucial to replication. In this paper we present a novel empirical analysis which shows that the unstated spatial constraints in commonly used implementations of Reacher tasks make it dramatically easier to learn a successful control policy with DeepDeterministic Policy Gradients (DDPG), a state-of-the-art Deep RL algorithm. Our analysis suggests that less constrained Reacher tasks are significantly more difficult to learn, and hence that existing de facto benchmarks are not representative of the difficulty of general robotic manipulation.



There are no comments yet.


page 1

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Robotic autonomy is challenging for several reasons: robots have many degrees of freedom, require an agent capable of continuous observation and action, and can exhibit action and sensing uncertainty. For many problems such as manipulation under uncertainty, navigation in unknown environments, or interaction with human beings, it is difficult or impossible to model the environment well a priori. It is for this reason that machine learning methods, which adapt to new environments and tasks, are a promising frontier in robotic autonomy. Reinforcement Learning (RL)

[1] is a particularly promising paradigm, as defining an RL problem requires only the specification of a reward function encoding success.

In recent years, Deep RL (which extends RL with deep neural networks) has had demonstrable success on manipulation tasks

[2, 3, 4, 5]. In reaction to these successes there has been a push toward standardization of benchmarks and testing conditions used to evaluate Deep RL methods [6, 7]. Simulation suites such as OpenAI Gym [8] and MuJoCo [9] have enabled this standardization, but no existing work has shown that the de facto benchmark tasks are truly representative of the challenges presented by autonomous motion of commercially available robot manipulators. More work is needed to clarify precisely when Deep RL methods work well for general robotic autonomy.

Fig. 1: An example of a typical goal constraint region in the ‘FetchReach-v0’ environment, a commonly-used example of the “Reacher” tasks that we analyze in this paper. The orange box here is approximately the constraint region from which goals are sampled. Image from [10] with added constraint visualization.

More recently, the authors of [7]

have shown that several axes of variation serve as confounding variables in reported Deep RL results. They demonstrate that hyperparameter setting, network architecture, reward scaling, random seeds, environment specification, and codebase choice can have significant impact on empirical learning behavior. Our goal in this paper is to extend the binary observation that environment specification can affect learning to a qualitative one; we wish to understand

to what degree small changes in problem description impact the difficulty of a particular family of robotic tasks. We show this by introducing subtle variations of a popular robotic RL environment, the “Reacher” family of tasks. Indeed, [7] performs some of their analysis on the “Reacher” task variant introduced by [10], and so our analysis is highly related to this prior work. “Reacher” tasks are a prime example of the fundamental manipulation and locomotion tasks found in prior work [11, 12, 10]. These tasks are necessary building blocks to more complex control tasks such as search-and-rescue or construction, but there has been no extensive analysis of how their specification can affect learning.

The main contribution of this paper is an analysis showing that the “Reacher” tasks used in prior works are not representative of the full difficulty presented by general robotic manipulation (or even basic pick-and-place tasks). More precisely, we show that existing benchmarks (e.g., those proposed in [10]

) constrain goal sampling, which has a significant impact on learning. We perform this analysis by constructing a series of “Reacher” tasks which interpolate between tasks similar to prior work and a more general unconstrained task. “Reacher” tasks challenge an agent to use low-level (joint position, velocity, or torque) control to move the end effector of a manipulator to a point in the robot’s workspace. Note that if position control is used, this family of tasks amounts to learning the inverse kinematics of the manipulator; if velocity or torque control is used then the task is closely related to learning inverse dynamics

[13]. Thus, “Reacher” tasks expose some fundamental challenges in robotics. However, prior work uses goal constraint regions to restrict the effective workspace of the manipulator being controlled to such a degree that the underlying learning problem is changed and, we argue, simplified. Our empirical analysis using the state-of-the art DDPG [14] algorithm on a simulated UR5 robot supports this point and shows that this task comparison is apt, as we find that the DDPG fails to generalize in unexpected ways as the effective workspace is expanded. Following the methods used by [7], we fix our algorithm, code, and hyperparameter settings across all experiments, and focus our analysis on the task definition.

This analysis is supported and systematized by a software framework (ROSGym) that we developed to connect standard implementations of RL algorithms to commonly used robot control software. More precisely, we wrote a Python interface that integrates the Robot Operating System (ROS) [15] and OpenAI Gym, which we then used to generate the results in this paper. The flexibility of ROS allowed us to easily compare variations of the “Reacher” task specification in order to better understand the influence of factors such as the number of joints and goal constraint region on learning.

Ii On Deep RL

In this work, we focus our analysis on the specification of “Reacher” benchmark tasks for Robotic RL. In order to eliminate the other sources of variability identified by [7], we use a fixed learning algorithm (DDPG) and fixed hyperparameters to perform our experiments. In this section, we will provide the minimal necessary background in Deep RL to contextualize our choice of DDPG for this analysis.

Ii-a RL Background

We consider a standard reinforcement learning problem definition [1]

, in which we model a task of interest as a learning agent interacting with a Markov Decision Process (MDP). An MDP

is a tuple describing an environment which an agent interacts with in discrete time steps . More precisely, at each time step our agent occupies a state , initially sampled from . At each time step, the agent takes an action

, and experiences a probabilistic state transition according to the probability distribution over

defined by the transition function . As a result of this action, the agent also receives a reward . The tuple is typically called a transition, and the full sequence of these transitions over is called a trajectory or rollout. In this paper we restrict our attention to continuous control; specifically, the case where and for .

In reinforcement learning we are concerned with learning a policy which maximizes the total reward . This policy may also be probabilistic, in which case we learn , where is the set of probability distributions over , and seek to maximize . In practice, RL methods often modify this definition of the total reward to include a discount for future states. This discounted total reward is called the value function, and is defined as , where at all time steps. This definition of the value function naturally gives rise to the action-value function , which intuitively assigns a value to being in a state and taking action , while prioritizing earlier sources of reward.

From previous results [1], it is known that the action-value function satisfies the Bellman equation:

The recursive nature of this equation allows RL methods to iteratively estimate the action-value function from experience. If

is represented by a function approximator with parameters , we can derive a differentiable loss:



Gradients from this loss can then be used to adjust the parameters and improve the approximation of [16].

In problems with finite action spaces, performing the above optimization and using the greedy policy which selects the action with the highest action-value at every time step is known as Q-learning. However, for problems with continuous action spaces it is impractical to find the optimal action at each time step. In [14] the authors show that this problem can be made tractable using an actor-critic method in which both the deterministic policy and action-value function are approximated by deep neural networks, which they term Deep Deterministic Policy Gradients (DDPG). By decomposing (1) and isolating the policy component of the action-value function gradient, DDPG allows an agent to optimize a policy over a continuous action space. Optimizing a policy in this way causes learning instability, and so DDPG utilizes an experience replay buffer to decorrelate experienced transitions and improve stability. Transitions are added to this experience replay buffer when the agent receives them from the environment, and then the agent samples training examples from the buffer during training. We refer the interested reader to [14] for further algorithmic details.

Ii-B Why DDPG?

In this paper, we focus on Deep RL as applied to robotic control, particularly in manipulation settings. Among recent Deep RL methods, DDPG demonstrates a number of desirable features. First, DDPG learns continuous control policies which eliminate the need for action discretization, the previously dominant methodology enabling robotic RL [11]. Second, DDPG learns a deterministic control policy, which is advantageous for robotic applications because the learned policy can be reproducibly tested and verified once learning has converged. Though other methods such as TRPO [17] and Soft Q-Learning [18] are promising for problems with continuous action spaces, these methods learn stochastic policies which are more difficult to verify. Third, DDPG is model-free, which means that it can be applied to novel tasks and robots without extensive feature engineering or incorporation of expert knowledge. Finally, because DDPG is an off-policy method, it can be modified to make use of additional sources of experience (such as human demonstration) which have been shown to improve learning [4].

To the best of our knowledge, few results exist which successfully apply Deep RL on simulated or real commercially available robots. Some of the most impressive results in this field use demonstration for initialization or make use of a dynamics model to simplify the learning task [4, 19, 20]. Model-free methods are often demonstrated on a variety of simulated tasks from MuJoCo or OpenAI Gym, and occasionally in tabletop manipulation tasks on a commercially available robot such as a Fetch, UR5, or Baxter [2, 21, 5]. However, examining the robotic benchmarks proposed in [10] or implemented in MuJoCo reveals a core similarity with manipulation benchmarks commonly demonstrated on commercial robots: task goals (such as goal end effector position) are sampled from a goal constraint region above a “table” surface. Figure 1 shows a visualization of a typical goal constraint region. Recently, [22] and [7] have previously argued that seemingly innocuous unstated assumptions such as algorithm implementation, parameterization, initialization, and reward scale can have inordinate effects on the success of learning. Here, we argue that the goal constraint region is another significant assumption which affects robotic reinforcement learning. This goal constraint region is an often unstated part of the Reacher specification, and our recognition of this phenomenon comes from examining the publicly available implementations of environments in [10, 9]111See, e.g., and

Iii Methods

Iii-a Algorithmic Details

In this work, we conduct experiments using DDPG [14] with Hindsight Experience Replay (HER) [5]. We implemented our own version of DDPG for this purpose, which we intend to open-source along with the ROSGym interface described below. The authors of [5] demonstrate that, in reinforcement learning tasks similar to the “Reacher” tasks considered in our current work, augmenting the learning agent’s experience with counterfactual experience can speed up learning convergence and result in a higher success rate for the convergent policy. We employ HER by, for each episode of training, appending a modified trajectory to DDPG’s experience replay buffer where the rewards are re-calculated as if the final end effector position reached by the agent during the episode was the goal position.

While conducting this research, we examined a number of other modifications to DDPG and our environment specification that are not employed in the following experiments. Some prior work (e.g., [23]) suggests that Prioritized Experience Replay may improve learning stability and rate of convergence, but our experience was that this tended to destabilize or prevent learning. The experiments below utilize uniform sampling from the experience replay buffer. Other work [5] has suggested that sparse rewards give rise to better learning than dense rewards, but we found the opposite to be true and so used the dense reward formulated in (2).

Iii-B Environmental Specification

When conducting an experiment using RL, an experimenter’s choice of can have a significant effect on learning. Though we developed our own simulated environment using ROS, we took inspiration from the existing “Reacher-v2” implementation in OpenAI Gym in order to follow previous results. In our setup, as in [8], each state is formed in the following way:


is the vector of joint angles of the simulated UR5 robot,

is a function maps joint angles to end effector position in Cartesian space, and is the current goal location in Cartesian space. By including the goal in the state description, we allow an agent to learn a policy parameterized by the goal for a particular episode. Our experimental setup allows us to specify the number of joints controlled by a learning agent, so , where is the number of joints being controlled. The action vector is simply the vector of desired absolute joint angles, and hence . Finally, with a slight abuse of notation, we formulate our tasks’ reward functions as:


This is a fairly standard sort of reward function definition. The first term in (2) penalizes the current Euclidean distance between the agent’s end effector and the goal position, the second term penalizes large actions, and the final term gives a large positive reward when the agent reaches the goal. This reward function is an example of a “shaped” reward, which means that it attempts to steer the learning agent toward regions of high reward. The first and second term in (2) accomplish this by driving the agent to minimize the distance to the goal and to minimize the sequence of controls necessary to accomplish this. Though one could argue that in a position control regime it is inappropriate to penalize large absolute actions, we do so here by analogy to velocity and torque control, in which we would seek to minimize absolute actions to avoid sudden, jerky motions. In contrast, a “sparse” reward would only provide the agent with nonzero rewards upon reaching the goal (e.g., using just the third term in (2) as a reward function).

Iv Results

We examine several general variants of the popular “Reacher” task, exemplified in prior work by the Reacher-v2 (planar) and FetchReach-v0 (3D) tasks in OpenAI Gym [8, 9] 222All results in this paper were produced using the Docker container at We make use of a UR5 sitting on an impermeable plane, which we simulate using ROS and our ROSGym interface. Deep RL methods are commonly tested on discrete tasks, video games, and simple control tasks. For these tasks, established simulation suites such as OpenAI Gym and the Arcade Learning Environment [24] are commonly employed, but there is not currently a commonly accepted testing suite that integrates well with existing control software for commercial robots. This limits the reproducibility and generality of results in robotic RL, as tasks for new robots or tasks must often be hand-engineered. It is our hope that, when open-sourced, ROSGym will help researchers to validate results gathered using existing robotic RL simulation suites (e.g., [10, 9]) on commercially available robots.

As in the established planar case, our experiments start UR5 robot from a fixed position such that the end effector is within the goal constraint region. We consider two broad categories of Reacher task: the unconstrained version, in which goals are sampled from the whole workspace of the robot, and several constrained versions, in which goals are sampled from a goal constraint region. For the unconstrained version of the task, the starting joint angles are . For the constrained versions, the starting joint angles are . These starting configurations are visualized in Figures 2-3. The learning agent is then tasked with controlling the end effector to a randomly sampled goal location. We use joint position control as the action space of the DDPG agent. In general, our ROS–OpenAI Gym interface allows us to use joint position, velocity, or torque control, but in our experiments only joint position control resulted in a non-negligible success rate. It is also worth noting that the choice of reward function had a significant impact on the success rate achieved by DDPG on the Reacher tasks that we tested.

Hyperparameter Symbol Value
Discount factor 0.98
Replay Buffer Size
Batch Size 64
Exploration Rate 0.01
Target Update Ratio 0.001
Actor Learning Rate 0.0001
Critic Learning Rate 0.001
Episodes of Training 20,000
Steps per Episode 100
TABLE I: DDPG training hyperparameters

For all versions of the Reacher task described below, we derive goals by uniformly sampling in the joint space of a simulated UR5 robot. For each set of joint values generated, we compute the end effector position by utilizing the forward kinematics of the UR5. If the experiment involves a constraint region, we reject candidate goals until a set of joint values places the corresponding end effector location within the constraint region. Goals could also be sampled directly in the Cartesian space within the goal constraint region, but this could generate goal locations without solutions for a given robot. Our goal sampling strategy allows us to guarantee that each sampled goal location is hit by at least one configuration of the robot. When we restrict to fewer than 6 joints of the UR5, we sample goals within the reachable space effected by the joints being controlled.

The figures below show success rates for policies learned by DDPG over 20,000 episodes of training. Every 100 episodes of training, we conducted a testing session comprising 100 episodes in which we execute the deterministic policy learned by DDPG without exploration noise. By default, episodes are 100 steps long, which corresponds to 2 seconds of simulated time at a control rate of 50 Hz. Episodes are terminated early upon success, which we define as the robot’s end effector entering an -ball surrounding the goal position, where

. The graphs below display the number of goals (out of 100) that were successfully hit by the learned policy in each testing session. Each graph shows the mean and 65% confidence interval for 5 independent attempts at learning using the same hyperparameters and task setup

333Note that each training run takes approximately 72 hours on an Amazon EC2 c4.xlarge instance.. We use the same actor and critic networks as [14] uses for low-dimensional experiments. Table I summarizes the other hyperparameters for DDPG which are held constant across our experiments.

In all of the following experiments, we strive to follow the prescriptions presented by [7]. We report all of our hyperparameters in Table I, and report all runs of all experiments (i.e., the results shown in this paper were not cherry-picked to demonstrate the trend that we highlight). Unfortunately, the baseline implementation of DDPG given by OpenAI [25] was not available when we conducted these experiments, so we used our own codebase. However, we did replicate existing results using our code before beginning the experiments reported in this work, and we intend to open-source our code (including our implementation of DDPG and ROSGym).

Iv-a Unconstrained

(a) Joint numbers and bounding goal space region.
(b) Joints 1–2.
(c) Joints 1–3.
(d) Joints 1–4.
(e) Joints 1–5.
(f) Joints 1–6.
Fig. 2: Results for the unconstrained case.

Figure 2 shows the results of running the variant of DDPG described in Section III on the unconstrained Reacher task. In this version of the task we place no constraint on the sampling of goal end effector locations; hence, goals are sampled throughout the robot’s workspace. It is easy to see that, while DDPG succeeds in learning a highly capable policy for 2 joint control, for 3–6 joints DDPG fails to learn a policy that can hit more than 50% of the sampled goal locations. This makes some intuitive sense, as the 2 joints at the base of the UR5 effect roughly orthogonal motions and only a simple control policy must be learned, whereas for 3–6 joints a competent policy must involve coupled motion of multiple joints.

It is interesting to note that the asymptotic behavior of DDPG is approximately equivalent for 3–6 joints, as shown in Figure 2. This may not be particularly surprising, however; we note that the Cartesian workspaces for 3, 4, 5, and 6 joints of the UR5 robot are almost identical, since joints 4, 5, and 6 are primarily responsible for the orientation, rather than the position, of the end effector. For the unconstrained experiment we reported training curves for all of these joint configurations in order to verify that learning occurs similarly for all joint variants, but for all following experiments we report 3 and 6 joint results as these provide a lower and upper bound on learning difficulty, respectively444Apart from the 2-joint variant, which was learned easily in the unconstrained case and in all other cases, and so is not further analyzed in this paper..

Iv-B Z-Height and Close Box Constraints

(a) Z-height goal constraint.
(b) Joints 1–3, z-height.
(c) Joints 1–6, z-height.
(d) Close box goal constraint.
(e) Joints 1–3, close box.
(f) Joints 1–6, close box.
(g) Far box goal constraint.
(h) Joints 1–3, far box.
(i) Joints 1–6, far box.
Fig. 3: Z-height and close box constrained experiments.

Figure 3 shows the results of running DDPG on the z-height constrained and close box constrained versions of the Reacher task. In the z-height constrained version, goals are only sampled below a height of 0.4 meters from the plane on which the robot sits. Note that the end effector of the simulated UR5 can normally reach approximately 1.0 meters above the plane on which the robot sits. In the close box constrained version, goals are sampled from a box that includes the robot’s base. See Figure 3(a) and Figure 3(d) for visualizations of these goal constraint regions. We only show the 3 and 6 joint cases here, as DDPG is able to learn the unconstrained task well for 2 joint control, and throughout our experiments we saw little variation in learning among the 3, 4, 5, and 6 joint versions of this task.

Though these two forms of constraint are quite different workspace regions, the asymptotic performance of DDPG on these constraint regions is identical. Note that the asymptotic success rate of the policy learned by DDPG decreases from approximately 40% in the unconstrained case to 20% in the z-height and close box constrained cases. This lends further evidence to the idea that certain regions of the robot’s workspace (such as the region above z = m) are easier to learn than others (such as the constraint regions previously described). Finally, we can qualitatively observe that learning takes longer to converge for 6 joint control, which lends some validation to the conventional wisdom that RL scales poorly to higher dimensions. However, this scaling effect is strongly dominated by the similarly poor asymptotic behavior common to the 3 and 6 joint cases.

Iv-C Far Box Constraint

Figures 3(g) and 3(i) show the results of running DDPG on the far box constrained version of the Reacher task. In this version, goals are only sampled from within a box which is meters along the -axis from the robot’s base. The dimensions of this box were changed because the robot’s workspace does not extend beyond meters from the base of the robot. Though the volume of this constraint region is approximately half of that of the goal constraint region in the close box case, in our experiments an extension of the height of the far box constraint region to meters resulted in a convergent success rate similar to that in the reported far box case. See Figure 3(g) for a visualization of the reported far box constraint region. As before, we only show the 3 and 6 joint cases here.

It is immediately apparent that the far box constraint is easier to learn than any of the previously considered goal regions. Learning converges to nearly 100% success within 10 testing sessions, or 1000 training episodes, for all independent runs in the 3 joint case, and for 4 of 5 runs in the 6 joint case (1 run experienced the type of catastrophic forgetting that DDPG is known for [6]). We also found this rapid learning and near-perfect asymptotic behavior holds even if the z-height constraint of the box is removed. By simply moving the goal sampling region away from the base of the robot, the apparent asymptotic performance and sample complexity of DDPG on this task are significantly improved. Note that this form of constraint is the most similar to the goal constraints in existing Reacher task implementations, and also displays the greatest ease of learning.

(a) Run 1, 0.7 .
(b) Run 2, 0.7 .
Fig. 4: Goal success regions for independent runs on unconstrained 3 joint task.

Iv-D Policy Success Visualization

Finally, Figure 4 shows how two runs of DDPG can learn two very different policies. These visualizations were created by taking two convergent ( success) policies learned by DDPG as described for the unconstrained 3-joint control Reacher task and running them without exploration noise. We measured the two policies’ rates of success on a coarse tiling of the workspace of the UR5 by randomly sampling goal locations in the previously described manner, executing the learnt policy, and binning the successes for each grid cell. The figures above represent only a slice of the robot’s workspace (between and ).

Though they were learned on the same task and with the same hyperparameters, the two policies display very distinct regions of competence (lighter regions, where nearly 100% of goals are successfully hit). This could result from DDPG attempting to learn a single unifying policy for regions of differing difficulty. From the qualitative difference in these two policies, we can infer that biases at the start of training may have an inordinate effect on the resulting policy success regions. This phenomenon could account for recent successes in robotic RL: tasks have been restricted to small enough spaces that a single policy can be learned which accounts for the entire constrained space.

V Discussion

Our results strongly suggest that 1) more exploration is necessary into how Deep Robotic RL benchmarks should be defined and run and that 2) more work is needed before popular Deep RL methods will be capable of learning control policies for general robotic tasks. Standard robotic RL benchmark tasks can elide much of what makes robotic control difficult, and much is still unknown about the true capability of popular algorithms such as DDPG to learn to perform more general tasks. Recent work has established that Deep RL methods may be more difficult to generalize to complex tasks than prior, non-deep methods [22, 26, 27], but it is still unknown how broadly this applies. However, this does not necessarily mean that the more general forms of these tasks are impossible to learn; rather, it suggests that more work needs to be done in assessing how popular learning algorithms perform on non-constrained versions of robotic tasks. Given the efficacy of ensemble methods [28]

for supervised learning, we expect that a regional Deep RL ensemble algorithm may perform better on the Reacher tasks considered in this paper. We fixed our choice of algorithm and hyperparameters in this paper to focus on the effect of varying the “Reacher” task definition on learning, but we have not yet attempted the same analysis with other popular Deep RL algorithms such as TRPO. We expect, however, to observe a similar trend of learning difficulty as the goal constraint region is expanded and the dynamics of the robot’s effective workspace become more complicated. More research is necessary to confirm that the results presented in this paper are general across Deep RL algorithms, but our initial analysis has produced unexpected learning behavior that merits further investigation.


We would like to thank Zak Kingston and Bryce Willey for their help in developing ROSGym and this manuscript.