I Introduction
Consider the following blackbox optimization problem:
(1) 
where takes as input a sequence of parameters encoding a policy ( and standing for the space and action space respectively) and outputs the total (expected) reward obtained by an agent applying this policy in a given environment. Since typically the environment is a blackbox physics simulator, or even a piece of real hardware, only admits a function evaluation and cannot be paired with explicit analytical gradients. Blackbox/ES, or derivativefree ([1, 2, 3, 4]) algorithms for RL and robotics aim to maximize
by applying various random search techniques, while avoiding explicit gradient computation. Typically, in each epoch a parameter vector
encoding policy is updated by the following general rule ([1]):(2) 
where is a subset of the set of chosen random perturbations/samples (for some ) defining perturbed versions of the given policy , function translates rewards obtained by perturbed policies to actual weights and is a step size. Directions
are taken independently at random, usually from multivariate Gaussian distribution
for some . Weight functions include: , and more. Subsetsare chosen according to different filtering heuristics, e.g.
is evaluated in points for and the algorithm chooses perturbations corresponding to policies producing highest rewards .Despite not relying on the internal structure of the RL problem, these methods can be highly competitive with state of the art policy gradient approaches ([5], [6], [7], [8]), while admitting much simpler and embarrassingly parallelizable implementations, where different workers compute rewards obtained by different perturbed policies independently.
However, in order to obtain good policies , even in a completely noiseless setting, a large number of longhorizon rollouts may be required, which quickly becomes a computational bottleneck. For instance, ES algorithms proposed in [1] require thousands of CPUs to get competitive results. Furthermore, policies obtained by updates as in Equation 2, are very sensitive to noisy measurements, e.g. when the dynamics model used in the simulator does not accurately represent the true dynamics in certain regions of the state space. Hence, the central motivation for this paper is improving the data efficiency of such methods and proposing an approach that is much more robust to noisy measurements notoriously present in robotics.
We propose a class of new blackbox optimization algorithms for optimizing RL policies that inherit all the benefits of the standard random search methods described above such as conceptual simplicity and agnosticism to the structure of the optimized blackbox functions, yet can handle substantial adversarial measurement noise and are characterized by lower sampling complexity than the above methods. We call them robust blackbox optimization or simply: .
Our approach fundamentally differs from the previously described random search techniques. We propose to estimate gradients of the blackbox function to conduct optimization by solving generalized regression/compressed sensing regularized optimization problems. The related computational overhead is negligible in comparison to time spent for querying blackbox function , and the reconstruction is also accurate in the noisy measurement setting even if of all the measurements of the interactions with the environment are arbitrarily inaccurate, as our theoretical results show (see: Appendix). Our proposed LP decodingbased ES optimization, an instantiation of the class, is particularly resilient to substantial noise in the measurement space (see: Fig. 1).
Our methods can be naturally aligned with sliding trust region techniques for efficient samples reuse to further reduce sampling complexity. In contrast, ES methods require independent sampling in each epoch. We show that can be applied in locomotion tasks, where training is conducted in the presence of substantial noise, e.g. for learning in sim transferable walking behaviors for the quadruped robots or training quadrupeds how to follow a path. We further demonstrate on RL tasks. We manage to train effective policies even if up to of all measurements are arbitrarily corrupted, where standard ES methods produce suboptimal policies or do not manage to learn at all. Our experiments are backed by theoretical results.
To summarize, we propose and comprehensively benchmark a combination of the following techniques:

By sampling the function locally, we recover the gradient via under or overconstrained linear regression (depending on the noise level) wherein the sparsity and smoothness of unknown gradients can be exploited by L1 or L2 regularizers,

Noise coming from sources such as stochastic environment dynamics or rewards, or even deterministic error associated with Taylor approximation, can be handled via robust regression loss functions such as L1, Huber or Least trimmed loss,

We use a sliding trust region to sample the blackbox function and reuse samples that overlap with previous iterates. This brings an offpolicy flavor to blackbox methods, reminiscent of mature methods developed in the DFO literature [9],
This paper is organized as follows. In Section II we introduce our algorithm for ES policy optimization. In Section III
we give convergence results for certain subclasses of our algorithm based on Linear Programming (LP) decoding techniques with strong noise robustness guarantees. In Section
IV we provide an exhaustive empirical evaluation of our methods on tasks as well as quadruped robot locomotion tasks.Ii Robust Blackbox Optimization Algorithm
relies on the fact that by using different directions definining perturbed policies, one can obtain good estimates of the dotproduct of the gradient of the RL blackbox function in a given point with vectors defined by these directions, simply by performing rollouts of these perturbed policies and collecting obtained rewards. If is not differentiable, the notion of the gradient should be replaced by that of a smoothing of (applied on a regular basis to optimize nonconvex nondifferentiable functions, see [12]), but since the analysis will be completely similar, from now one without loss of generality we will assume that is smooth (it can be actually showed that by applying our techniques in the nondifferentiable setting, one conducts blackbox optimization using different smooth proxies of depending on the probabilistic distributions used to sample perturbations).
Let and . Notice that the following holds:
(3) 
Thus for with the following is true:
(4) 
We call the forward finitedifference estimation of the action of gradient on . By the similar analysis as above, we can obtain the antithetic finitedifference estimation of the action of gradient on :
(5) 
With this characterization, we can formulate the problem of finding an approximate gradient as a regression problem (we will choose finitedifference estimation of the action of the gradient, but completely analogous analysis can be done for the antithetic one).
Given scalars (corresponding to rewards obtained by different perturbed versions of the policy encoded by ), we formulate the regression problem by considering vectors with regression values for . We propose to solve this regression task by solving the following minimization problem:
(6) 
where , is the matrix with rows encoding perturbations and where the sequences of these rows are sampled from some given joint multivariate distribution , vector consists of regression values (i.e. for ) and is a regularization parameter.
Note that various known regression methods arise by instantiating the above optimization problems with different values of and . In particular,
leads to the ridge regression algorithm (
[13]), , to the Lasso method ([14]) and , to LP decoding ([15]). The latter one is especially important since it is applied in several applications ranging from dimensionality reduction techniques ([16]) to database retrieval algorithms ([15]) and compressive sensing methods in medical imaging ([17]). We show empirically and theoretically that in the ES context it leads to the most robust policy learning algorithms, insensitive to substantial adversarial noise. Note also that as opposed to standard random search ES methods described before, perturbations do not need to be taken from the Gaussian multivariate distribution and they do not even need to be independent. This observation will play a crucial role when we propose to reduce sampling complexity by reusing policy rollouts.Iia Sliding Trust Region for ES optimization
The main advantage of the above regression based optimization algorithms for RL blackbox function gradient approximation is that it is not necessary to sample from a fixed distribution at each step in order to apply them. Instead, we use the idea that at any vector encoding the current policy, a good quality estimator of the corresponding gradient can be deduced from the blackbox function at any parameter point cloud around it. This prompts the idea of using a trust region approach [18], where perturbations are reused from epoch to epoch. Reusing samples allows us to reduce sampling complexity since it reduces the number of times blackbox function is being called, e.g. the number of times the simulator is used. We propose two simple trust region techniques for sample reuse and show that they work very well in practice (see: Section IV). Denote by current parameter vector obtained throughout the optimization process. In the first strategy, called static trust region method, all perturbed policies of the form for that are within radius from are resued to approximate gradient of in (where
is a tuned hyperparameter). In the second strategy that we denote as
dynamic trust region method, only a fixed fraction (where is another hyperparameter) of policies of the form that are closest to are reused.Our algorithm taking advantage of the above techniques is presented in the 1 box (we present dynamic trust region version). At each epoch of the procedure the regularized regression problem is solved to estimate the gradient. The estimate is then used to update policy parameters. Step in the loop of the algorithm is conducted to make sure that the resulting parameter vector is in the domain of allowed parameter vectors .
Iii Convergence Results for Robust Gradient Recovery
In this section we provide results regarding the convergence of the subclass of algorithms with and (i.e. using LP decoding to reconstruct the gradient of the blackbox function ) in the noisy setting. This noisy setting is valid in a variety of practically relevant scenarios, e.g. in when a couple of the trajectories used during training suffer from noisy state measurements, or a constant fraction of the trajectories are generated via a simulator while the remaining ones are produced in the real system, and when the training procedure has access to a fairly robust simulator that may substantially diverge from real dynamics for up to twenty three percent of all the trajectories. To obtain rigorous theoretical guarantees, we will need certain structural assumptions regarding . However, as we see in Section IV, those are actually not required and furthermore other subclasses of algorithms are also capable of learning good policies. All proofs are given in the Appendix.
We need the following definitions.
Definition 1 (coefficient )
Let and denote: . Let be the of and be its function. Define . Function is continuous and decreasing in the interval and furthermore . Since , there exists such that . We define as:
(7) 
Its exact numerical value is .
Definition 2 (smoothness)
A differentiable concave function is smooth with parameter if for every pair of points :
If is twice differentiable it is equivalent to for all .
Definition 3 (Lipschitz)
We say that is Lipschitz with parameter if for all it satisfies .
We are ready to state our main theoretical result.
Theorem 1
Consider a blackbox function . Assume that is concave, Lipschitz with parameter and smooth with smoothness parameter . Assume furthermore that domain is convex and has diameter . Consider 1 with and the noisy setting in which at each step a fraction of at most of all measurements are arbitrarily corrupted for . There exists a universal constant such that for any and
, the following holds with probability at least
:where .
If presents extra curvature properties such as being strongly concave, we can get a linear convergence rate.
Definition 4 (Strong concavity)
A function is strongly concave with parameter if:
The following theorem holds:
Theorem 2
Assume conditions from Theorem 1 and furthermore that is strongly concave with parameter . Take 1 with acting in the noisy environment in which at each step a fraction of at most of all measurements are arbitrarily corrupted for . There exists a universal constant such that for any and , with probability at least :
Iv Experiments
We conducted an exhaustive analysis of the proposed class of algorithms on the following [19] benchmark RL tasks: , , , , and . We also used to learn policies for the quadruped locomotion tasks (see: Section IVA). We tested both large noise settings, where as many as of all the measurements were corrupted and heavily underconstrained settings, where the number of chosen perturbations per epoch satisfies . In these settings we still learned good quality policies with a small number of rollouts, whereas other methods either failed or needed more perturbations. We trained locomotion policies for the quadruped robot in the PyBullet simulator ([20]).
Our main focus is to compare with other ES methods, in particular the stateoftheart Augmented Random Search () algorithm from [2]. ARS can be easily adjusted to different policy architectures, is characterized by lower sampling complexity than other ES algorithms and was showed to outperform other algorithms such as [1] on many different robotics tasks. This is possible due to the use of various efficient ES heuristics such as state and reward renormalization as well as perturbation filtering methods to substantially lower sampling complexity. ARS was also used to train quadruped locomotion tasks in [21]. Additionally, we added a comparison with nonES policy gradient methods such as and , even though this class uses MDPstructure of the RL blackbox function under consideration that and are agnostic to.
We tested two policy architectures: linear (on its own as for some tasks or as a part of larger hybrid policy as for quadruped locomotion tasks) and nonlinear with two hidden layers, each of size , with nonlinearities and with connections encoded by low displacement rank () Toeplitz matrices from [10]. Below we will refer to them as: linear and /nonlinear policies respectively.
On some plots we presented different variants of the algorithm, whereas in other we showed one (e.g. if all where giving similar training curves or a particular version was most suitable as LPdecoding for the overconstrained setting with substantial noise). Presented results were obtained using dynamic trust region method with .
Iva Locomotion for quadruped robots
We tested our algorithm on two different quadruped locomotion tasks derived from a real quadruped robot (‘Minitaur’ from Ghost Robotics, see: Fig. 1), with different reward functions encouraging different behaviors. Minitaur has legs and degrees of freedom, where each leg has the ability to swing and extend to a certain degree using the PD controller provided with the robot. We train our policies in simulation using the pybullet environment modeled after the robot ([20]). To learn walking for quadrupeds, we use architectures called Policies Modulating Trajectory Generators (PMTGs) that have been recently proposed in [21]. The architecture incorporates basic cyclic characteristics of the locomotion and leg movement primitives by using trajectory generators, a parameterized function that provides cyclic leg positions. The policy is responsible for modulating and adjusting leg trajectories as needed for the environment.
Straight walking with changing speeds
The robot is rewarded for walking at the desired speed which is changed during the episode. This task is identical to the original one tested with PMTG. More details about the environment such as speed profile during the episode, reward calculation as well as observation and action space definitions can be found in [21].
The algorithm was compared with other methods on the task of training a linear component (dimensional policy) of the PMTG architecture. We compared with different variants of the algorithm (with the filtering heuristic turned on or off), since it was a method of choice in [21]. In the noiseless setting all methods produce similar training curves as we see in Fig. 3 (a). The results are very different if of all measurements are arbitrarily corrupted (Fig. 3 (b)). Then, manages to learn good walking behavior much faster than . In that setting with filtering turned on (used in [21]), much more sensitive to noise, does not work at all and thus we did not present it on a plot. Results from Fig. 3 were obtained for perturbations per epoch, we had similar results for other configurations of training (see: Fig. 2).
Path tracking problem for quadruped robots
The second task is to learn walking while staying within some specific 2D curved path (shown in black in Fig. 4), where the path forces the robot to steer left and right. The task is much harder because the legs of the robot do not have the third degree of freedom which would provide the sideways motion to the legs to turn easily. The robot is rewarded for staying within the path boundary and moving forward. The observations include robot IMU data, its position and orientation. Robot’s joint motor positions are output as action. We use the same PMTG architecture as before.
As for learning stable walking behaviors, was responsible here for training a linear dimensional component of a larger hybrid policy. We obtained similar results as before. The policy successfully steering the robot to follow the path is presented on Fig. 4. Training curves obtained for two random seeds for this task with are presented in Fig. 5. Rewards translate to good steering policies.
IvB tasks
We focused on the locomotion tasks (, , , ) since worked particularly well on locomotion tasks for quadruped robots described in detail above. However we showed that it is not only limited to locomotion, applying it successfully to the environment, where with the similar number of perturbations per epoch did not manage to learn good quality policies. All the plots were created by averaging over random seeds.
The first set of results is presented in Fig. 6, where we compare with other methods on the and tasks in the noiseless underconstrained setting, where the number of perturbations is substantially smaller than the number of parameters of the nonlinear policy, which is and respectively. In both cases learns effective policies whereas other methods do not. In particular, learns the suboptimal policy for , where the agent moves more to the side instead of moving forward, which we show in the first row of Fig. 7. In the second row of Fig. 7 we show the effective policy learned by . An analogous pictorial representation of the policies learned for are presented in Fig. 8. The policy does not manage to steer the robotic arm close to the object, but the policy does it successfully.
On Fig. 9 we present results for in the noiseless and noisy setting, where as many as of all the measurements are arbitrarily corrupted. In the noiseless setting and provide similar training curves, being superior to and , but in the noisy setting outperforms . All trained policies are nonlinear.
In Fig. 11 we present the results for three more tasks: , and . For the first one we tested a noiseless setting with perturbations per epoch (for a dimensional LDR policy). With these number of perturbations learns policies getting rewards close to , outperforming . For with of the measurements corrupted and the same policy architecture and give similar training curves. For we chose a linear policy architecture which was showed to work well for most tasks in [2] for the algorithm. The setting is overconstrained with perturbations per epoch. We injected substantial noise leading to measurements per epoch to be arbitrarily inaccurate. In that setting does not manage to learn at all (as it is the case also for and algorithms not showed on the plot), whereas (using LPdecoding) learns policies with rewards .
policies in action trained with only perturbations per epoch for and tasks with LDR nonlinear architectures of sizes are presented in Fig. 10. Both policies lead to optimal behaviors.
V Conclusion
We proposed a new class of algorithms called for RL blackbox optimization with better sampling complexity than baselines applying standard random search ES methods. They rely on careful gradient reconstructions via regularized regression/LP decoding methods. We show empirically and theoretically that not only do our algorithms learn good quality policies faster than state of the art, but they are much less sensitive to noisy measurements regimes, notoriously present in robotics applications.
References
 [1] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever, “Evolution strategies as a scalable alternative to reinforcement learning,” 2017.
 [2] H. Mania, A. Guy, and B. Recht, “Simple random search provides a competitive approach to reinforcement learning,” CoRR, vol. abs/1803.07055, 2018. [Online]. Available: http://arxiv.org/abs/1803.07055

[3]
J. Lehman, J. Chen, J. Clune, and K. O. Stanley, “ES is more than just a
traditional finitedifference approximator,” in
Proceedings of the Genetic and Evolutionary Computation Conference, GECCO 2018, Kyoto, Japan, July 1519, 2018
, 2018, pp. 450–457. [Online]. Available: https://doi.org/10.1145/3205455.3205474  [4] S. Ha and C. K. Liu, “Evolutionary optimization for parameterized wholebody dynamic motor skills,” in 2016 IEEE International Conference on Robotics and Automation, ICRA 2016, Stockholm, Sweden, May 1621, 2016, 2016, pp. 1390–1397. [Online]. Available: https://doi.org/10.1109/ICRA.2016.7487273
 [5] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 20162018.

[6]
J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region
policy optimization,” in
International Conference on Machine Learning (ICML)
, 2015.  [7] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint:1509.02971, 2015.
 [8] P. Hämäläinen, A. Babadi, X. Ma, and J. Lehtinen, “PPOCMA: proximal policy optimization with covariance matrix adaptation,” CoRR, vol. abs/1810.02541, 2018. [Online]. Available: http://arxiv.org/abs/1810.02541
 [9] A. R. Conn, K. Scheinberg, and L. N. Vicente, Introduction to derivativefree optimization. Siam, 2009, vol. 8.
 [10] K. Choromanski, M. Rowland, V. Sindhwani, R. E. Turner, and A. Weller, “Structured evolution with compact architectures for scalable policy optimization,” in Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, 2018, pp. 969–977. [Online]. Available: http://proceedings.mlr.press/v80/choromanski18a.html
 [11] M. Rowland, K. Choromanski, F. Chalus, A. Pacchiano, T. Sarlos, R. E. Turner, and A. Weller, “Geometrically coupled monte carlo sampling,” in accepted to NIPS’18, 2018.
 [12] Y. Nesterov and V. Spokoiny, “Random gradientfree minimization of convex functions,” Found. Comput. Math., vol. 17, no. 2, pp. 527–566, Apr. 2017.
 [13] H. Avron, K. L. Clarkson, and D. P. Woodruff, “Sharper bounds for regression and lowrank approximation with regularization,” CoRR, vol. abs/1611.03225, 2016. [Online]. Available: http://arxiv.org/abs/1611.03225
 [14] F. Santosa and W. W. Symes, “Linear inversion of bandlimited reflection seismograms,” in SIAM Journal on Scientific and Statistical Computing, 1986, pp. 1307–1330.

[15]
C. Dwork, F. McSherry, and K. Talwar, “The price of privacy and the limits of
lp decoding,” in
Proceedings of the thirtyninth annual ACM symposium on Theory of computing
. ACM, 2007, pp. 85–94.  [16] E. C, J. Romberg, and T. Tao, “Stable signal recovery from incomplete and inaccurate measurements,” 2005.

[17]
L. Ning, K. Setsompop, O. V. Michailovich, N. Makris, C. Westin, and Y. Rathi, “A compressedsensing approach for superresolution reconstruction of diffusion MRI,” in
Information Processing in Medical Imaging  24th International Conference, IPMI 2015, Sabhal Mor Ostaig, Isle of Skye, UK, June 28  July 3, 2015, Proceedings, 2015, pp. 57–68. [Online]. Available: https://doi.org/10.1007/9783319199924_5  [18] H. Ghanbari and K. Scheinberg, “Blackbox optimization in machine learning with trust region based derivative free algorithm,” CoRR, vol. abs/1703.06925, 2017. [Online]. Available: http://arxiv.org/abs/1703.06925
 [19] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “OpenAI Gym,” 2016.
 [20] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” http://pybullet.org, 2017.
 [21] A. Iscen, K. Caluwaerts, J. Tan, T. Zhang, E. Coumans, V. Sindhwani, and V. Vanhoucke, “Policies modulating trajectory generators,” in 2nd Annual Conference on Robot Learning, CoRL 2018, Zürich, Switzerland, 2931 October 2018, Proceedings, 2018, pp. 916–926. [Online]. Available: http://proceedings.mlr.press/v87/iscen18a.html
Vi Appendix
Via Proof of Theorem 1
Let be the real gradient of at .
Lemma 1
with .
This follows immediately from a Taylor expansion and the smoothness assumption on .
Lemma 2
For any if up to fraction of the entries of are arbitrarily corrupted, the gradient recovery optimization problem with input satisfies:
(8) 
Whenever and with probability
As a consequence, we can show the first order Taylor approximation of around that uses the true gradient and the one using the RBO gradient are uniformly close:
Lemma 3
The following bound holds: For all :
The next lemma provides us with the first step in our convergence bound:
Lemma 4
For any in , it holds that:
Recall that is the projection of to a convex set . And that . As a consequence:
(9) 
Lemma 2 and the triangle inequality imply:
Since concavity of implies , the result follows.
We proceed with the proof of Theorem 1:
(10) 
where we set . The first inequality is a direct consequence of Lemma 4. The second inequality follows because and for all .
As long as, and we have:
Since , Theorem 1 follows.
ViB Proof of Theorem 2
In this section we flesh out the convergence results for robust gradient descent when is assumed to be Lipschitz with parameter , smooth with parameter and strongly concave with parameter .
Lemma 5
For any in , it holds that:
Recall that is the projection of to a convex set . And that . As a consequence:
(11) 
Lemma 2 and the triangle inequality imply:
Since strong concavity of implies the result follows.
The proof of Theorem 2 follows from it. Indeed, we have:
where we set . the first inequality is a direct consequence of Lemma 5. The second inequality follows because and for all . Since , for all the term labeled I in the inequality above vanishes.
As long as , we have:
Since , Theorem 2 follows.
Comments
There are no comments yet.