When random search is not enough: Sample-Efficient and Noise-Robust Blackbox Optimization of RL Policies

03/07/2019 ∙ by Krzysztof Choromanski, et al. ∙ 12

Interest in derivative-free optimization (DFO) and "evolutionary strategies" (ES) has recently surged in the Reinforcement Learning (RL) community, with growing evidence that they match state of the art methods for policy optimization tasks. However, blackbox DFO methods suffer from high sampling complexity since they require a substantial number of policy rollouts for reliable updates. They can also be very sensitive to noise in the rewards, actuators or the dynamics of the environment. In this paper we propose to replace the standard ES derivative-free paradigm for RL based on simple reward-weighted averaged random perturbations for policy updates, that has recently become a subject of voluminous research, by an algorithm where gradients of blackbox RL functions are estimated via regularized regression methods. In particular, we propose to use L1/L2 regularized regression-based gradient estimation to exploit sparsity and smoothness, as well as LP decoding techniques for handling adversarial stochastic and deterministic noise. Our methods can be naturally aligned with sliding trust region techniques for efficient samples reuse to further reduce sampling complexity. This is not the case for standard ES methods requiring independent sampling in each epoch. We show that our algorithms can be applied in locomotion tasks, where training is conducted in the presence of substantial noise, e.g. for learning in sim transferable stable walking behaviors for quadruped robots or training quadrupeds how to follow a path. We further demonstrate our methods on several OpenAI Gym Mujoco RL tasks. We manage to train effective policies even if up to 25% of all measurements are arbitrarily corrupted, where standard ES methods produce sub-optimal policies or do not manage to learn at all. Our empirical results are backed by theoretical guarantees.



There are no comments yet.


page 1

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Consider the following blackbox optimization problem:


where takes as input a sequence of parameters encoding a policy ( and standing for the space and action space respectively) and outputs the total (expected) reward obtained by an agent applying this policy in a given environment. Since typically the environment is a blackbox physics simulator, or even a piece of real hardware, only admits a function evaluation and cannot be paired with explicit analytical gradients. Blackbox/ES, or derivative-free ([1, 2, 3, 4]) algorithms for RL and robotics aim to maximize

by applying various random search techniques, while avoiding explicit gradient computation. Typically, in each epoch a parameter vector

encoding policy is updated by the following general rule ([1]):


where is a subset of the set of chosen random perturbations/samples (for some ) defining perturbed versions of the given policy , function translates rewards obtained by perturbed policies to actual weights and is a step size. Directions

are taken independently at random, usually from multivariate Gaussian distribution

for some . Weight functions include: , and more. Subsets

are chosen according to different filtering heuristics, e.g.

is evaluated in points for and the algorithm chooses perturbations corresponding to policies producing highest rewards .

Fig. 1: Execution of the policy on a real quadruped robot for stable walking. The algorithm learns robust policies with as few as perturbations per epoch in the presence of substantial measurement noise.

Despite not relying on the internal structure of the RL problem, these methods can be highly competitive with state of the art policy gradient approaches ([5], [6], [7], [8]), while admitting much simpler and embarrassingly parallelizable implementations, where different workers compute rewards obtained by different perturbed policies independently.

However, in order to obtain good policies , even in a completely noiseless setting, a large number of long-horizon rollouts may be required, which quickly becomes a computational bottleneck. For instance, ES algorithms proposed in [1] require thousands of CPUs to get competitive results. Furthermore, policies obtained by updates as in Equation 2, are very sensitive to noisy measurements, e.g. when the dynamics model used in the simulator does not accurately represent the true dynamics in certain regions of the state space. Hence, the central motivation for this paper is improving the data efficiency of such methods and proposing an approach that is much more robust to noisy measurements notoriously present in robotics.

We propose a class of new blackbox optimization algorithms for optimizing RL policies that inherit all the benefits of the standard random search methods described above such as conceptual simplicity and agnosticism to the structure of the optimized blackbox functions, yet can handle substantial adversarial measurement noise and are characterized by lower sampling complexity than the above methods. We call them robust blackbox optimization or simply: .

Our approach fundamentally differs from the previously described random search techniques. We propose to estimate gradients of the blackbox function to conduct optimization by solving generalized regression/compressed sensing regularized optimization problems. The related computational overhead is negligible in comparison to time spent for querying blackbox function , and the reconstruction is also accurate in the noisy measurement setting even if of all the measurements of the interactions with the environment are arbitrarily inaccurate, as our theoretical results show (see: Appendix). Our proposed LP decoding-based ES optimization, an instantiation of the class, is particularly resilient to substantial noise in the measurement space (see: Fig. 1).

Our methods can be naturally aligned with sliding trust region techniques for efficient samples reuse to further reduce sampling complexity. In contrast, ES methods require independent sampling in each epoch. We show that can be applied in locomotion tasks, where training is conducted in the presence of substantial noise, e.g. for learning in sim transferable walking behaviors for the quadruped robots or training quadrupeds how to follow a path. We further demonstrate on RL tasks. We manage to train effective policies even if up to of all measurements are arbitrarily corrupted, where standard ES methods produce sub-optimal policies or do not manage to learn at all. Our experiments are backed by theoretical results.

To summarize, we propose and comprehensively benchmark a combination of the following techniques:

  • By sampling the function locally, we recover the gradient via under- or overconstrained linear regression (depending on the noise level) wherein the sparsity and smoothness of unknown gradients can be exploited by L1 or L2 regularizers,

  • Noise coming from sources such as stochastic environment dynamics or rewards, or even deterministic error associated with Taylor approximation, can be handled via robust regression loss functions such as L1, Huber or Least trimmed loss,

  • We use a sliding trust region to sample the blackbox function and reuse samples that overlap with previous iterates. This brings an off-policy flavor to blackbox methods, reminiscent of mature methods developed in the DFO literature [9],

  • In conjunction with the ideas above, we use structured policy networks [10], [11] to bring the problem dimensionality into the DFO “sweetspot”.

This paper is organized as follows. In Section II we introduce our algorithm for ES policy optimization. In Section III

we give convergence results for certain sub-classes of our algorithm based on Linear Programming (LP) decoding techniques with strong noise robustness guarantees. In Section

IV we provide an exhaustive empirical evaluation of our methods on tasks as well as quadruped robot locomotion tasks.

Ii Robust Blackbox Optimization Algorithm

relies on the fact that by using different directions definining perturbed policies, one can obtain good estimates of the dot-product of the gradient of the RL blackbox function in a given point with vectors defined by these directions, simply by performing rollouts of these perturbed policies and collecting obtained rewards. If is not differentiable, the notion of the gradient should be replaced by that of a smoothing of (applied on a regular basis to optimize non-convex non-differentiable functions, see [12]), but since the analysis will be completely similar, from now one without loss of generality we will assume that is smooth (it can be actually showed that by applying our techniques in the non-differentiable setting, one conducts blackbox optimization using different smooth proxies of depending on the probabilistic distributions used to sample perturbations).

Let and . Notice that the following holds:


Thus for with the following is true:


We call the forward finite-difference estimation of the action of gradient on . By the similar analysis as above, we can obtain the antithetic finite-difference estimation of the action of gradient on :


With this characterization, we can formulate the problem of finding an approximate gradient as a regression problem (we will choose finite-difference estimation of the action of the gradient, but completely analogous analysis can be done for the antithetic one).

Given scalars (corresponding to rewards obtained by different perturbed versions of the policy encoded by ), we formulate the regression problem by considering vectors with regression values for . We propose to solve this regression task by solving the following minimization problem:


where , is the matrix with rows encoding perturbations and where the sequences of these rows are sampled from some given joint multivariate distribution , vector consists of regression values (i.e. for ) and is a regularization parameter.

1 Input: , scaling parameter sequence , initial , number of perturbations , step size sequence , sampling distribution , parameters , number of epochs . ;
3 Output: Vector of parameters . ;
5 1. Initialize , (). ;
7 for  do
8       1. Compute all distances from to . ;
10       2. Find the closest -percentage of vectors from and call it . Call the corresponding subset of as . ;
12       3. Sample from . ;
13       4. Compute and for all . ;
15       5. Let be a matrix obtained by concatenating rows given by and those of the form: , where .;
17       6. Let be the vector obtained by concatenating values with those of the form: , where . ;
19       7. Let be the resulting vector after solving the following optimization problem:
20       8. Take ;
22       9. Take .;
23       10. Update to be the set of the form , where s are rows of and , and to be the set of the corresponding values and .
Algorithm 1 Robust Blackbox Optimization Algorithm

Note that various known regression methods arise by instantiating the above optimization problems with different values of and . In particular,

leads to the ridge regression algorithm (

[13]), , to the Lasso method ([14]) and , to LP decoding ([15]). The latter one is especially important since it is applied in several applications ranging from dimensionality reduction techniques ([16]) to database retrieval algorithms ([15]) and compressive sensing methods in medical imaging ([17]). We show empirically and theoretically that in the ES context it leads to the most robust policy learning algorithms, insensitive to substantial adversarial noise. Note also that as opposed to standard random search ES methods described before, perturbations do not need to be taken from the Gaussian multivariate distribution and they do not even need to be independent. This observation will play a crucial role when we propose to reduce sampling complexity by reusing policy rollouts.

Ii-a Sliding Trust Region for ES optimization

The main advantage of the above regression based optimization algorithms for RL blackbox function gradient approximation is that it is not necessary to sample from a fixed distribution at each step in order to apply them. Instead, we use the idea that at any vector encoding the current policy, a good quality estimator of the corresponding gradient can be deduced from the blackbox function at any parameter point cloud around it. This prompts the idea of using a trust region approach [18], where perturbations are reused from epoch to epoch. Reusing samples allows us to reduce sampling complexity since it reduces the number of times blackbox function is being called, e.g. the number of times the simulator is used. We propose two simple trust region techniques for sample reuse and show that they work very well in practice (see: Section IV). Denote by current parameter vector obtained throughout the optimization process. In the first strategy, called static trust region method, all perturbed policies of the form for that are within radius from are resued to approximate gradient of in (where

is a tuned hyperparameter). In the second strategy that we denote as

dynamic trust region method, only a fixed fraction (where is another hyperparameter) of policies of the form that are closest to are reused.

Our algorithm taking advantage of the above techniques is presented in the 1 box (we present dynamic trust region version). At each epoch of the procedure the regularized regression problem is solved to estimate the gradient. The estimate is then used to update policy parameters. Step in the -loop of the algorithm is conducted to make sure that the resulting parameter vector is in the domain of allowed parameter vectors .

Iii Convergence Results for Robust Gradient Recovery

In this section we provide results regarding the convergence of the sub-class of algorithms with and (i.e. using LP decoding to reconstruct the gradient of the blackbox function ) in the noisy setting. This noisy setting is valid in a variety of practically relevant scenarios, e.g. in when a couple of the trajectories used during training suffer from noisy state measurements, or a constant fraction of the trajectories are generated via a simulator while the remaining ones are produced in the real system, and when the training procedure has access to a fairly robust simulator that may substantially diverge from real dynamics for up to twenty three percent of all the trajectories. To obtain rigorous theoretical guarantees, we will need certain structural assumptions regarding . However, as we see in Section IV, those are actually not required and furthermore other sub-classes of algorithms are also capable of learning good policies. All proofs are given in the Appendix.

We need the following definitions.

Definition 1 (coefficient )

Let and denote: . Let be the of and be its function. Define . Function is continuous and decreasing in the interval and furthermore . Since , there exists such that . We define as:


Its exact numerical value is .

Definition 2 (-smoothness)

A differentiable concave function is smooth with parameter if for every pair of points :

If is twice differentiable it is equivalent to for all .

Definition 3 (-Lipschitz)

We say that is Lipschitz with parameter if for all it satisfies .

We are ready to state our main theoretical result.

Theorem 1

Consider a blackbox function . Assume that is concave, Lipschitz with parameter and smooth with smoothness parameter . Assume furthermore that domain is convex and has diameter . Consider 1 with and the noisy setting in which at each step a fraction of at most of all measurements are arbitrarily corrupted for . There exists a universal constant such that for any and

, the following holds with probability at least


where .

If presents extra curvature properties such as being strongly concave, we can get a linear convergence rate.

Definition 4 (Strong concavity)

A function is strongly concave with parameter if:

The following theorem holds:

Theorem 2

Assume conditions from Theorem 1 and furthermore that is strongly concave with parameter . Take 1 with acting in the noisy environment in which at each step a fraction of at most of all measurements are arbitrarily corrupted for . There exists a universal constant such that for any and , with probability at least :

Iv Experiments

We conducted an exhaustive analysis of the proposed class of algorithms on the following [19] benchmark RL tasks: , , , , and . We also used to learn policies for the quadruped locomotion tasks (see: Section IV-A). We tested both large noise settings, where as many as of all the measurements were corrupted and heavily underconstrained settings, where the number of chosen perturbations per epoch satisfies . In these settings we still learned good quality policies with a small number of rollouts, whereas other methods either failed or needed more perturbations. We trained locomotion policies for the quadruped robot in the PyBullet simulator ([20]).

Our main focus is to compare with other ES methods, in particular the state-of-the-art Augmented Random Search () algorithm from [2]. ARS can be easily adjusted to different policy architectures, is characterized by lower sampling complexity than other ES algorithms and was showed to outperform other algorithms such as [1] on many different robotics tasks. This is possible due to the use of various efficient ES heuristics such as state and reward renormalization as well as perturbation filtering methods to substantially lower sampling complexity. ARS was also used to train quadruped locomotion tasks in [21]. Additionally, we added a comparison with non-ES policy gradient methods such as and , even though this class uses MDP-structure of the RL blackbox function under consideration that and are agnostic to.

We tested two policy architectures: linear (on its own as for some tasks or as a part of larger hybrid policy as for quadruped locomotion tasks) and nonlinear with two hidden layers, each of size , with nonlinearities and with connections encoded by low displacement rank () Toeplitz matrices from [10]. Below we will refer to them as: linear and /nonlinear policies respectively.

On some plots we presented different variants of the algorithm, whereas in other we showed one (e.g. if all where giving similar training curves or a particular version was most suitable as LP-decoding for the overconstrained setting with substantial noise). Presented results were obtained using dynamic trust region method with .

Iv-a Locomotion for quadruped robots

We tested our algorithm on two different quadruped locomotion tasks derived from a real quadruped robot (‘Minitaur’ from Ghost Robotics, see: Fig. 1), with different reward functions encouraging different behaviors. Minitaur has legs and degrees of freedom, where each leg has the ability to swing and extend to a certain degree using the PD controller provided with the robot. We train our policies in simulation using the pybullet environment modeled after the robot ([20]). To learn walking for quadrupeds, we use architectures called Policies Modulating Trajectory Generators (PMTGs) that have been recently proposed in [21]. The architecture incorporates basic cyclic characteristics of the locomotion and leg movement primitives by using trajectory generators, a parameterized function that provides cyclic leg positions. The policy is responsible for modulating and adjusting leg trajectories as needed for the environment.

Straight walking with changing speeds

The robot is rewarded for walking at the desired speed which is changed during the episode. This task is identical to the original one tested with PMTG. More details about the environment such as speed profile during the episode, reward calculation as well as observation and action space definitions can be found in [21].

(a) : phase I (b) : phase II (c) : phase III
Fig. 2: A policy trained by the algorithm in action. We used LP decoding to handle arbitrarily corrupted measurements per epoch and perturbations per epoch (giving noise level ). The training was run on a cluster of machines, each handling one perturbation. The algorithm manages to recover the gradient of the blackbox function , as our theoretical results predict, and produces stable walking behaviors (demonstration in the attached video). We obtained similar results for as few as only perturbations per epoch and a noise level which shows that handles substantial noise even if .

The algorithm was compared with other methods on the task of training a linear component (-dimensional policy) of the PMTG architecture. We compared with different variants of the algorithm (with the filtering heuristic turned on or off), since it was a method of choice in [21]. In the noiseless setting all methods produce similar training curves as we see in Fig. 3 (a). The results are very different if of all measurements are arbitrarily corrupted (Fig. 3 (b)). Then, manages to learn good walking behavior much faster than . In that setting with filtering turned on (used in [21]), much more sensitive to noise, does not work at all and thus we did not present it on a plot. Results from Fig. 3 were obtained for perturbations per epoch, we had similar results for other configurations of training (see: Fig. 2).

(a) -noiseless (b) -25% noise
Fig. 3: Comparison of with other policy learning methods for learning quadruped stable walking. Even though in the noiseless setting all methods produce similar training curves, when noise is injected learns stable walking policies much faster.

Path tracking problem for quadruped robots

The second task is to learn walking while staying within some specific 2D curved path (shown in black in Fig. 4), where the path forces the robot to steer left and right. The task is much harder because the legs of the robot do not have the third degree of freedom which would provide the sideways motion to the legs to turn easily. The robot is rewarded for staying within the path boundary and moving forward. The observations include robot IMU data, its position and orientation. Robot’s joint motor positions are output as action. We use the same PMTG architecture as before.

(a) : phase I (b) : phase II (c) : phase III (d) : phase IV (e) : phase V (f) : phase VI
Fig. 4: Policy steering the quadruped to follow the path trained by the algorithm in action. The algorithm is using only perturbations per epoch to train a -dimensional policy.

As for learning stable walking behaviors, was responsible here for training a linear -dimensional component of a larger hybrid policy. We obtained similar results as before. The -policy successfully steering the robot to follow the path is presented on Fig. 4. Training curves obtained for two random seeds for this task with are presented in Fig. 5. Rewards translate to good steering policies.

(a) : seed I (b) : seed II
Fig. 5: Training curves for learning steering quadruped policies guiding the robot to follow the path. In both runs epochs with perturbations each suffice to train policies successfully completing the task.

Iv-B tasks

We focused on the locomotion tasks (, , , ) since worked particularly well on locomotion tasks for quadruped robots described in detail above. However we showed that it is not only limited to locomotion, applying it successfully to the environment, where with the similar number of perturbations per epoch did not manage to learn good quality policies. All the plots were created by averaging over random seeds.

(a) -noiseless (b) -noiseless
Fig. 6: Comparison of algorithms with other policy learning methods on and tasks with no extra noise injected.
(a) : phase I (b) : phase II (c) : phase III (d) : phase I (e) : phase II (f) : phase III
Fig. 7: Comparison of and policies trained with only perturbations per epoch. The policy is suboptimal and drives the agent to move to the side. The policy steers the agent to efficiently move forward.
(a) : phase I (b) : phase II (c) : phase III (d) : phase I (e) : phase II (f) : phase III
Fig. 8: As in Fig. 7, but this time for environment. The policy does not manage to steer the robotic arm close to the object.

The first set of results is presented in Fig. 6, where we compare with other methods on the and tasks in the noiseless underconstrained setting, where the number of perturbations is substantially smaller than the number of parameters of the nonlinear policy, which is and respectively. In both cases learns effective policies whereas other methods do not. In particular, learns the suboptimal policy for , where the agent moves more to the side instead of moving forward, which we show in the first row of Fig. 7. In the second row of Fig. 7 we show the effective policy learned by . An analogous pictorial representation of the policies learned for are presented in Fig. 8. The policy does not manage to steer the robotic arm close to the object, but the policy does it successfully.

(a) -noiseless (b) -15% noise
Fig. 9: Comparison of algorithm with other policy learning methods on the task: (a) no noise injected, (b): of all the measurements are arbitrarily corrupted.

On Fig. 9 we present results for in the noiseless and noisy setting, where as many as of all the measurements are arbitrarily corrupted. In the noiseless setting and provide similar training curves, being superior to and , but in the noisy setting outperforms . All trained policies are nonlinear.

(a) : phase I (b) : phase II (c) : phase III (d) : phase I (e) : phase II (f) : phase III
Fig. 10: The first row presents the policy learned for and the second row for . Both policies lead to optimal behaviors and were learned with the use of only perturbations per epoch.
(a) -noiseless (b) -15% noise (c) -20% noise
Fig. 11: Comparison of with other methods on the , and tasks. provides clear gains for (underconstrained setting, LDR-policy) and (overconstrained setting, linear policy) with the presence of substantial noise.

In Fig. 11 we present the results for three more tasks: , and . For the first one we tested a noiseless setting with perturbations per epoch (for a -dimensional LDR policy). With these number of perturbations learns policies getting rewards close to , outperforming . For with of the measurements corrupted and the same policy architecture and give similar training curves. For we chose a linear policy architecture which was showed to work well for most tasks in [2] for the algorithm. The setting is overconstrained with perturbations per epoch. We injected substantial noise leading to measurements per epoch to be arbitrarily inaccurate. In that setting does not manage to learn at all (as it is the case also for and algorithms not showed on the plot), whereas (using LP-decoding) learns policies with rewards .

-policies in action trained with only perturbations per epoch for and tasks with LDR nonlinear architectures of sizes are presented in Fig. 10. Both policies lead to optimal behaviors.

V Conclusion

We proposed a new class of algorithms called for RL blackbox optimization with better sampling complexity than baselines applying standard random search ES methods. They rely on careful gradient reconstructions via regularized regression/LP decoding methods. We show empirically and theoretically that not only do our algorithms learn good quality policies faster than state of the art, but they are much less sensitive to noisy measurements regimes, notoriously present in robotics applications.


Vi Appendix

Vi-a Proof of Theorem 1

Let be the real gradient of at .

Lemma 1

with .

This follows immediately from a Taylor expansion and the smoothness assumption on .

Lemma 2

For any if up to fraction of the entries of are arbitrarily corrupted, the gradient recovery optimization problem with input satisfies:


Whenever and with probability

The proof of Lemma 2 and the constants follow the direct application of Theorem 1 in [15].

As a consequence, we can show the first order Taylor approximation of around that uses the true gradient and the one using the RBO gradient are uniformly close:

Lemma 3

The following bound holds: For all :

The next lemma provides us with the first step in our convergence bound:

Lemma 4

For any in , it holds that:

Recall that is the projection of to a convex set . And that . As a consequence:


Lemma 2 and the triangle inequality imply:

This observation plus Lemma 3 applied to Equation 11 implies:

Since concavity of implies , the result follows.

We proceed with the proof of Theorem 1:


where we set . The first inequality is a direct consequence of Lemma 4. The second inequality follows because and for all .

As long as, and we have:

Since , Theorem 1 follows.

Vi-B Proof of Theorem 2

In this section we flesh out the convergence results for robust gradient descent when is assumed to be Lipschitz with parameter , smooth with parameter and strongly concave with parameter .

Lemma 5

For any in , it holds that:

Recall that is the projection of to a convex set . And that . As a consequence:


Lemma 2 and the triangle inequality imply:

This observation plus Lemma 3 applied to Equation 11 implies:

Since strong concavity of implies the result follows.

The proof of Theorem 2 follows from it. Indeed, we have:

where we set . the first inequality is a direct consequence of Lemma 5. The second inequality follows because and for all . Since , for all the term labeled I in the inequality above vanishes.

As long as , we have:

Since , Theorem 2 follows.