A typical approach to machine learning problems is to define an objective function and optimize it over a dataset. First-order optimization methods are widely used for this purpose since they can scale to high-dimensional problems and their convergence rates are independent of problem dimensionality in most cases. However, gradients are not available in many important settings such as control, black-box optimization, and interactive learning with humans in the loop. Derivative-free optimization (DFO) can be used to tackle such problems. The challenge is that the sample complexity of DFO scales poorly with the problem dimensionality. The design of DFO methods that solve high-dimensional problems with low sample complexity is a major open problem.
The success of deep learning methods suggests that high-dimensional data that arises in real-world settings can commonly be represented in low-dimensional spaces via learned nonlinear features. In other words, while the problems of interest are high-dimensional, the data typically lies on low-dimensional manifolds. If we could perform the optimization directly in the manifold instead of the full space, intuition suggests that we could reduce the sample complexity of DFO methods since their convergence rates are generally a function of the problem dimensionality(Nesterov & Spokoiny, 2017; Dvurechensky et al., 2018). In this paper, we focus on high-dimensional data distributions that are drawn from low-dimensional manifolds. Since the manifold is typically not known prior to the optimization, we pose the following question. Can we develop an adaptive derivative-free optimization algorithm that learns the manifold in an online fashion while performing the optimization?
. However, they are limited to linear subspaces. In contrast, we propose to use expressive nonlinear models (specifically, neural networks) to represent the manifold. Our approach not only increases expressivity but also enables utilization of domain knowledge concerning the geometry of the problem. For example, if the function of interest is known to be translation invariant, convolutional networks can be used to represent the underlying manifold structure. On the other hand, the high expressive power and flexibility brings challenges. Our approach requires solving for the parameters of the nonlinear manifold at each iteration of the optimization. To address this, we develop an efficient online method that learns the underlying manifold while the function is being optimized.
We specifically consider random search methods and extend them to the nonlinear manifold learning setting. Random search methods choose a set of random directions and perform perturbations to the current iterate in these directions. Differences of the function values computed at perturbed points are used to compute an estimator for the gradient of the function. We first extend this to random search over a known manifold and show that sampling directions in the tangent space of the manifold provides a similar estimate. We then propose an online learning method that estimates this manifold while jointly performing the optimization. We theoretically analyze sample complexity and show that our method reduces it. We conduct extensive experiments on continuous control problems, continuous optimization benchmarks, and gradient-free optimization of an airfoil. The results indicate that our method significantly outperforms state-of-the-art derivative-free optimization algorithms from multiple research communities.
We are interested in high-dimensional stochastic optimization problems of the form
where is the optimization variable and is the function of interest, which is defined as expectation over a noise variable . We assume that the stochastic function is bounded (), -Lipschitz, and -smooth111-Lipschitz and -smooth: with respect to for all
, and has uniformly bounded variance (). In DFO, we have no access to the gradients. Instead, we only have zeroth-order access by evaluating the function (i.e. sampling for the input ).
We are specifically interested in random search methods in which an estimate of the gradient is computed using function evaluations at points randomly sampled around the current iterate. Before we formalize this, we introduce some definitions. Denote the -dimensional unit sphere and unit ball by and , respectively. We define a smoothed function following Flaxman et al. (2005). For a function , its -smoothed version is . The main workhorse of random search is the following result by Flaxman et al. (2005). Let. Then
is an unbiased estimate of the gradient of the smoothed function:
A simple way to optimize the function of interest is to use the gradient estimate in stochastic gradient descent (SGD), as summarized in Algorithm1. This method has been analyzed in various forms and its convergence is characterized well for nonconvex smooth functions. We restate the convergence rate and defer the constants and proof to Appendix A.2.
Let be differentiable, -Lipschitz, and -smooth. Consider running random search (Algorithm 1) for steps. Let for simplicity. Then
|1:for to do 2: 3: 4:end for||1:for to do 2: 3: 4:end for|
|1:procedure GradEst(, ) 2: Sample: 3: Query: 4: Estimator: 5: return 6:end procedure||1:procedure ManifoldGradEst(, ,) 2: Normalize: = GramSchmidt() 3: Sample: 4: Query: 5: Estimator: 6: return 7:end procedure|
3 Online Learning to Guide Random Search
Proposition 1 implies that the sample complexity of random search scales linearly with the dimensionality. This dependency is problematic when the function of interest is high-dimensional. We argue that in many practical problems, the function of interest lies on a low-dimensional nonlinear manifold. This structural assumption will allow us to significantly reduce the sample complexity of random search, without knowing the manifold a priori.
Assume that the function of interest is defined on an -dimensional manifold () and this manifold can be defined via a nonlinear parametric family (e.g. a neural network). Formally, we are interested in derivative-free optimization of functions with the following properties:
Smoothness: is -smooth and -Lipschitz for all .
Manifold: is defined on an -dimensional manifold for all .
Representability: The manifold and the function of interest can be represented using parametrized function classes and . Formally, given , there exist and such that .
We will first consider an idealized setting where the manifold is already known (i.e. we know ). Then we will extend the developed method to the practical setting where the manifold is not known in advance and must be estimated with no prior knowledge as the optimization progresses.
3.1 Warm-up: Random Search over a Known Manifold
If the manifold is known a priori, we can perform random search directly over the manifold instead of the full space. Consider the chain rule applied toas , where and . The gradient of the function of interest lies in the column space of the Jacobian of the parametric family. In light of this result, we can perform random search in the column space of the Jacobian, which has lower dimensionality than the full space.
For numerical stability, we will first orthonormalize the Jacobian using the Gram-Schmidt procedure, and perform the search in the column space of this orthonormal matrix since it spans the same space. We denote the orthonormalized version of by .
In order to perform random search, we sample an -dimensional vector uniformly () and lift it to the input space via . As a consequence of the manifold Stokes’ theorem, using the lifted vector as a random direction gives an unbiased estimate of the gradient of the smoothed function as
where the smoothed function is defined as . We show this result as Lemma 1 in Appendix A.1. We use the resulting gradient estimate in SGD. The following proposition summarizes the sample complexity of this method. The constants and the proof are given in Appendix A.2.
Let be differentiable, -Lipschitz, and -smooth. Consider running manifold random search (Algorithm 2) for steps. Let for simplicity. Then
3.2 Joint Optimization and Manifold Learning
When , the reduction in the sample complexity of random search (summarized in Proposition 2) is significant. However, the setting of Algorithm 2 and Proposition 2 is impractical since the manifold is generally not known a priori. We thus propose to minimize the function and learn the manifold jointly. In other words, we start with an initial guess of the parameters and solve for them at each iteration using all function evaluations that have been performed so far.
Our major objective is to improve the sample efficiency of random search. Hence, minimizing the sample complexity with respect to manifold parameters is an intuitive way to approach the problem. We analyze the sample complexity of SGD using biased gradients in Appendix A.3.1 and show the following informal result. Consider running manifold random search with a sequence of manifold parameters for steps. Then the additional suboptimality caused by biased gradients, defined as , is bounded as follows:
where is the suboptimality of the oracle case (Algorithm 2). Our aim is to minimize the additional suboptimality with respect to and . However, we do not have access to since we are in a derivative-free setting. Hence we cannot directly minimize (4).
At each iteration, we observe . Moreover, , due to the smoothness. Since we observe the projection of the gradient onto the chosen directions, we minimize the projection of (4) onto these directions. Formally, we define our one-step loss as
to minimize the aforementioned loss function and learn the manifold parameters:
where the regularizer is a temporal smoothness term that penalizes sudden changes in the gradient estimates.
Algorithm 3 summarizes our algorithm. We add exploration by sampling a mix of directions from the manifold and the full space. In each iteration, we sample directions and produce two gradient estimates using the samples from the tangent space and the full space, respectively. We mix them to obtain the final estimate . We discuss the implementation details of the FTRL step in Section 4. In our theoretical analysis, we assume that (6) can be solved optimally. Although this is a strong assumption, experimental results suggest that neural networks can easily fit any training data (Zhang et al., 2017). Our experiments also support this observation.
Theorem 1 states our main result concerning the sample complexity of our method. As expected, the sample complexity includes both the input dimensionality and the manifold dimensionality . On the other hand, the sample complexity only depends on rather than . Thus our method significantly decreases sample complexity when .
Let be bounded, -Lipschitz, and -smooth. Consider running learned manifold random search (Algorithm 3) for steps. Let and for simplicity. Then
We provide a short proof sketch here and defer the detailed proof and constants to Appendix A.3. We start by analyzing SGD with bias. The additional suboptimality of using instead of can be bounded by (4).
The empirical loss we minimize is the projection of (4) onto randomly chosen directions. Next, we show that the expectation of the empirical loss is (4) when the directions are chosen uniformly at random from the unit sphere:
A crucial argument in our analysis is the concentration of the empirical loss around its expectation. In order to study this concentration, we use Freedman’s inequality (Freedman, 1975), inspired by the analysis of generalization in online learning by Kakade & Tewari (2009). Our analysis bounds the difference , where .
Next, we use the FTL-BTL Lemma (Kalai & Vempala, 2005) to analyze the empirical loss . We bound the empirical loss in terms of the distances between the iterates . Such a bound would not be useful in an adversarial setting since the adversary chooses , but we set appropriate step sizes, which yield sufficiently small steps and facilitate convergence.
Our analysis of learning requires the directions in (7) to be sampled from a unit sphere. On the other hand, our optimization method requires directions to be chosen from the tangent space of the manifold. We mix exploration (directions sampled from ) and exploitation (directions sampled from the tangent space of the manifold) to address this mismatch. We show that mixing weight yields both fast optimization and no-regret learning. Finally, we combine the analyses of empirical loss, concentration, and SGD to obtain the statement of the theorem. ∎
4 Implementation Details and Limitations
. We initialize our models with standard normal distributions. Our method thus starts with random search at initialization and transitions to manifold random search as the learning progresses.
Solving FTRL. Results on training deep networks suggest that local SGD-based methods perform well. We thus use SGD with momentum as a solver for FTRL in (6). We do not solve each learning problem from scratch but initialize with the previous solution. Since this process may be vulnerable to local optima, we fully solve (6) from scratch for every iteration of the method.
Computational complexity. Our method increases the amount of computation since we need to learn a model while performing the optimization. However, in DFO, the major computational bottleneck is typically the function evaluation. When efficiently implemented on a GPU, the time spent on learning the manifold is negligible in comparison to function evaluations.
Parallelization. Random search is highly parallelizable since directions can be processed independently. Communication costs include i) sending the current iterate to workers, ii) sending directions to each corresponding worker, and iii) workers sending the function values back. When the directions are chosen independently, they can be indicated to each worker via a single integer by first creating a shared noise table in preprocessing. For a -dimensional problem with random directions, these costs are , , and , respectively. The total communication cost is therefore . In our method, each worker also needs a copy of the Jacobian, resulting in a communication cost of . Hence our method increases communication cost from to . 1:for to do 2: 3: 4: 5: 6: 7:end for See Algorithms 1 & 2 for definitions of GradEst and ManifoldGradEst.
We empirically evaluate the presented method (referred to as Learned Manifold Random Search (LMRS)) on the following sets of problems. i) We use the MuJoCo simulator (Todorov et al., 2012) to evaluate our method on high-dimensional control problems. ii) We use 46 single-objective unconstrained functions from the Pagmo suite of continuous optimization benchmarks (Biscani et al., 2019). iii) We use the XFoil simulator (Drela, 1989) to benchmark gradient-free optimization of an airfoil.
We consider the following baselines. i) Augmented Random Search (ARS): Random search with all the augmentations from Mania et al. (2018). ii) Guided ES (Maheswaranathan et al., 2019): A method to guide random search by adapting the covariance matrix. iii) CMA-ES (Hansen, 2016): Adaptive derivative-free optimization based on evolutionary search. iv) REMBO (Wang et al., 2016): A Bayesian optimization method which uses random embeddings in order to scale to high-dimensional problems. Although CMA-ES and REMBO are not based on random search, we include them for the sake of completeness. Additional implementation details are provided in Appendix B.
5.1 Learning Continuous Control
Following the setup of Mania et al. (2018), we use random search to learn control of highly articulated systems. The MuJoCo locomotion suite (Todorov et al., 2012) includes six problems of varying difficulty. We evaluate our method and the baselines on all of them. We use linear policies and include all the tricks (whitening the observation space and scaling the step size using the variance of the rewards) from Mania et al. (2018). We report average reward over five random experiments versus the number of episodes (i.e. number of function evaluations) in Figure 1. We also report the average number of episodes required to reach the prescribed reward threshold at which the task is considered ‘solved’ in Table 1. We include proximal policy optimization (PPO) (Schulman et al., 2017; Hill et al., 2018) for reference. Note that our results are slightly different from the numbers reported by Mania et al. (2018) as we use 5 random seeds instead of 3.
|Task||Threshold||LMRS||ARS||CMA-ES||Guided ES||PPO||REMBO||No learning||Offline l.|
The results suggest that our method improves upon ARS in all environments. Our method also outperforms all other baselines. The improvement is particularly significant for high-dimensional problems such as Humanoid. Our method is at least twice as efficient as ARS in all environments except Swimmer, which is the only low-dimensional problem in the suite. Interestingly, Guided-ES fails to solve the Humanoid task, which we think is due to biased gradient estimation. Furthermore, CMA-ES performs similarly to ARS. These results suggest that a challenging task like Humanoid is out of reach for heuristics like local adaptation of the covariance matrix due to high stochasticity and nonconvexity.
REMBO only solves the Swimmer task and fails to solve others. We believe this is due to the fact that these continuous control problems have no global structure and are highly nonsmooth. The number of possible sets of contacts with the environment is combinatorial in the number of joints, and each contact configuration yields a distinct reward surface. This contradicts the global structure assumption in Bayesian optimization.
Wall-clock time analysis. Our method performs additional computation as we learn the underlying manifold. In order to quantify the effect of the additional computation, we perform wall-clock time analysis and plot average reward vs wall-clock time in Figure 2. Our method outperforms all baselines with similar margins to Figure 1. The trends and shapes of the curves in Figures 1 and 2 are similar. This is not surprising since computation requirements of all the optimizers are rather negligible when compared with the simulation time of MuJoCo.
The only major differences we notice are on the Hopper task. Here the margin between our method and the baselines narrows and the relative ordering of Guided-ES and ARS changes. This is due to the fact that simulation stops when the agent falls. In other words, the simulation time depends on the current solution. Methods that query the simulator for these unstable solutions lose less wall-clock time.
Quantifying manifold learning performance. In order to evaluate the learning performance, we project the gradient of the function to the tangent space of the learned manifold and plot the norm of the residual. Since we do not have access to the gradients, we estimate them at time instants, evenly distributed through the learning process. We perform accurate gradient estimation using a very large number of directions (). We compute the norm of the residual of the projection as , where is projection onto the column space of . The results are visualized in Figure 3. Our method successfully and quickly learns the manifold in all cases.
Ablation studies. Our method uses three major ideas. i) We learn a manifold that the function lies on. ii) We learn this manifold in an online fashion. iii) We perform random search on the learned manifold. To study the impact of each of these ideas, we perform the following experiments. i) No learning. We randomly initialize the manifold by sampling the entries of from the standard normal distribution. Then we perform random search on this random manifold. ii) No online learning. We collect an offline training dataset by sampling values uniformly at random from a range that includes the optimal solutions. We evaluate function values at sampled points and learn the manifold. We perform random search on this manifold without updating the manifold model. iii) No search. We use the gradients of the estimated function () as surrogate gradients and minimize the function of interest using first-order methods.
We list the results in Table 1. We do not include the no-search baseline since it fails to solve any of the tasks. Failure of the no-search baseline suggests that the estimated functions are powerful enough to guide the search, but not accurate enough for optimization. The no-learning baseline outperforms ARS on the simplest problem (Swimmer), but either fails completely or increases sample complexity on other problems, suggesting that random features are not effective, especially on high-dimensional problems. Although the offline learning baseline solves more tasks than the no-learning one, it has worse sample complexity since initial offline sampling is expensive. This study indicates that all three of the ideas that underpin our method are important.
5.2 Continuous Optimization Benchmarks
We use continuous optimization problems from the Pagmo problem suite (Biscani et al., 2019). This benchmark includes minimization of 46 functions such as Rastrigin, Rosenbrock, Schwefel, etc. (See Appendix B for the complete list.) We use ten random starting points and report the average number of function evaluations required to reach a stationary point. Figure 4 reports the results as performance profiles (Dolan & Moré, 2002). Performance profiles represent how frequently a method is within distance of optimality. Specifically, if we denote the number of function evaluations that method requires to solve problem by and the number of function evaluations used by the best method by , the performance profile is the fraction of problems for which method is within distance of the best: , where is the indicator function and
is the number of problems.
As can be seen in Figure 4, our method outperforms all baselines. The success of our method is not surprising since the functions are typically defined as nonconvex functions of some statistics, inducing manifold structure by construction. REMBO (Bayesian optimization) is close to our method and outperforms the other baselines. We believe this is due to the global geometric structure of the considered functions. Both CMA-ES and Guided-ES outperform ARS.
5.3 Optimization of an Airfoil
We apply our method to gradient-free optimization of a 2D airfoil. We use a computational fluid dynamics (CFD) simulator, XFoil (Drela, 1989), which can simulate an airfoil using its contour plot. We parametrize the airfoils using smooth polynomials of up to degrees. We model the upper and lower parts of the airfoil with different polynomials. The dimensionality of the problem is thus . XFoil can simulate various viscosity properties, speeds, and angles of attack. The details are discussed in Appendix B. We plot the resulting airfoil after 1500 simulator calls in Table 2. We also report the lift and drag of the resulting shape. The objective we optimize is . Table 2 suggests that all methods find airfoils that can fly (). Our method yields the highest . Bayesian optimization outperforms the other baselines.
|Airfoil after 1500 Simulations|
5.4 Effect of Manifold and Problem Dimensionality
In this section, we perform a controlled experiment to understand the effect of problem dimensionality () and manifold dimensionality (
). We generate a collection of synthetic optimization problems. All synthesized functions follow the manifold hypothesis:, where is a multilayer perceptron with the architecture and
is a randomly sampled convex quadratic function. In order to sample a convex quadratic function, we sample the parameters of the quadratic function from a Gaussian distribution and project the result to the space of convex quadratic functions.
We choose and plot the objective value with respect to the number of function evaluations for various manifold dimensionalities in Figure 5. The results suggest that for a given ambient dimensionality , the lower the dimensionality of the data manifold , the more sample-efficient our method. In concordance with our theoretical analysis, the improvement is very significant when , as can be seen in the cases and .
Interestingly, our method is effective even when the manifold assumption is violated (). We hypothesize that this is due to anisotropy in the geometry of the problem. Although all directions are important when , some will result in faster search since the function changes more along them. It appears that our method can identify these direction and thus accelerate the search.
6 Related Work
Derivative-free optimization. We summarize the work on DFO that is relevant to our paper. For a complete review, readers are referred to Custódio et al. (2017) and Conn et al. (2009). We are specifically interested in random search methods, which have been developed as early as Matyas (1965) and Rechenberg (1973). Convergence properties of these methods have recently been analyzed by Agarwal et al. (2010), Bach & Perchet (2016), Nesterov & Spokoiny (2017), and Dvurechensky et al. (2018). A lower bound on the sample complexity for the convex case has been given by Duchi et al. (2015) and Jamieson et al. (2012). Bandit convex optimization is also highly relevant and we utilize the work of Flaxman et al. (2005) and Shamir (2013).
Random search for learning continuous control.
Learning continuous control is an active research topic that has received significant interest in the reinforcement learning community. Recently,Salimans et al. (2017) and Mania et al. (2018) have shown that random search methods are competitive with state-of-the-art policy gradient algorithms in this setting. Vemula et al. (2019) analyzed this phenomenon theoretically and characterized the sample complexity of random search and policy gradient methods for continuous control.
Adaptive random search. There are various methods in the literature that adapt the search space by using anisotropic covariance as in the case of CMA-ES (Hansen et al., 2003; Hansen, 2016), guided evolutionary search (Maheswaranathan et al., 2019), and active subspace methods (Choromanski et al., 2019). There are also methods that enforce structure such as orthogonality in the search directions (Choromanski et al., 2018). Other methods use information geometry tools as in Wierstra et al. (2014) and Glasmachers et al. (2010). Lehman et al. (2018) use gradient magnitudes to guide neuro-evolutionary search. Staines & Barber (2012) use a variational lower bound to guide the search. In contrast to these methods, we explicitly posit nonlinear manifold structure and directly learn this latent manifold via online learning. Our method is the only one that can learn an arbitrary nonlinear search space given a parametric class that characterizes its geometry.
Adaptive Bayesian optimization. Bayesian optimization (BO) is another approach to zeroth-order optimization with desirable theoretical properties (Srinivas et al., 2010). Although we are only interested in methods based on random search, some of the ideas we use have been utilized in BO. Calandra et al. (2016)
used the manifold assumption for Gaussian processes. In contrast to our method, they use autoencoders for learning the manifold and assume initial availability of offline data. Similarly,Djolonga et al. (2013) consider the case where the function of interest lies on some linear manifold and collect offline data to identify this manifold. In contrast, we only use online information and our models are nonlinear. Wang et al. (2016) and Kirschner et al. (2019) propose using random low-dimensional features instead of adaptation. Rolland et al. (2018) design adaptive BO methods for additive models. Major distinctions between our work and the adaptive BO literature include our use of nonlinear manifolds, no reliance on offline data collection, and formulation of the problem as online learning.
We presented Learned Manifold Random Search (LMRS): a derivative-free optimization algorithm. Our algorithm learns the underlying geometric structure of the problem online while performing the optimization. Our experiments suggest that LMRS is effective on a wide range of problems and significantly outperforms prior derivative-free optimization algorithms from multiple research communities.
- Agarwal et al. (2010) Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In Conference on Learning Theory (COLT), 2010.
- Bach & Perchet (2016) Francis R. Bach and Vianney Perchet. Highly-smooth zero-th order online optimization. In Conference on Learning Theory (COLT), 2016.
- Biscani et al. (2019) Francesco Biscani, Dario Izzo, Wenzel Jakob, Marcus Martens, Alessio Mereta, Cord Kaldemeyer, Sergey Lyskov, Sylvain Corlay, Benjamin Pritchard, Kishan Manani, et al. Esa/Pagmo2: Pagmo 2.10, 2019.
- Calandra et al. (2016) Roberto Calandra, Jan Peters, Carl Edward Rasmussen, and Marc Peter Deisenroth. Manifold Gaussian processes for regression. In International Joint Conference on Neural Networks (IJCNN), 2016.
- Choromanski et al. (2018) Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard E. Turner, and Adrian Weller. Structured evolution with compact architectures for scalable policy optimization. In International Conference on Machine Learning (ICML), 2018.
- Choromanski et al. (2019) Krzysztof Choromanski, Aldo Pacchiano, Jack Parker-Holder, Yunhao Tang, and Vikas Sindhwani. From complexity to simplicity: Adaptive ES-active subspaces for blackbox optimization. In Neural Information Processing Systems, 2019.
- Conn et al. (2009) Andrew R Conn, Katya Scheinberg, and Luis N Vicente. Introduction to Derivative-Free Optimization. SIAM, 2009.
- Custódio et al. (2017) Ana Luísa Custódio, Katya Scheinberg, and Luís Nunes Vicente. Methodologies and software for derivative-free optimization. Advances and Trends in Optimization with Engineering Applications, 2017.
- Djolonga et al. (2013) Josip Djolonga, Andreas Krause, and Volkan Cevher. High-dimensional Gaussian process bandits. In Neural Information Processing Systems, 2013.
- Dolan & Moré (2002) Elizabeth D. Dolan and Jorge J. Moré. Benchmarking optimization software with performance profiles. Mathematical Programming, 91, 2002.
- Drela (1989) Mark Drela. XFOIL: An analysis and design system for low Reynolds number airfoils. In Low Reynolds Number Aerodynamics, 1989.
- Duchi et al. (2015) John C. Duchi, Michael I. Jordan, Martin J. Wainwright, and Andre Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 2015.
- Dvurechensky et al. (2018) Pavel Dvurechensky, Alexander Gasnikov, and Eduard Gorbunov. An accelerated method for derivative-free smooth stochastic convex optimization. arXiv:1802.09022, 2018.
- Flaxman et al. (2005) Abraham Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex optimization in the bandit setting: Gradient descent without a gradient. In Symposium on Discrete Algorithms (SODA), 2005.
David A Freedman.
On tail probabilities for martingales.Annals of Probability, 3(1), 1975.
- Gardner et al. (2018) Jacob R Gardner, Geoff Pleiss, David Bindel, Kilian Q Weinberger, and Andrew Gordon Wilson. GPyTorch: Blackbox matrix-matrix Gaussian process inference with GPU acceleration. In Neural Information Processing Systems, 2018.
- Ghadimi & Lan (2013) Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4), 2013.
- Ginsbourger et al. (2010) David Ginsbourger, Rodolphe Le Riche, and Laurent Carraro. Kriging is well-suited to parallelize optimization. In Computational Intelligence in Expensive Optimization Problems. Springer, 2010.
Glasmachers et al. (2010)
Tobias Glasmachers, Tom Schaul, Yi Sun, Daan Wierstra, and Jürgen
Exponential natural evolution strategies.
Genetic and Evolutionary Computation Conference (GECCO), 2010.
- Hansen (2016) Nikolaus Hansen. The CMA evolution strategy: A tutorial. arXiv:1604.00772, 2016.
- Hansen et al. (2003) Nikolaus Hansen, Sibylle D. Müller, and Petros Koumoutsakos. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES). Evolutionary Computation, 11(1), 2003.
- Hansen et al. (2019) Nikolaus Hansen, Youhei Akimoto, and Petr Baudis. CMA-ES/pycma on Github. Zenodo, DOI:10.5281/zenodo.2559634, February 2019.
- Hazan (2016) Elad Hazan. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4), 2016.
- Hill et al. (2018) Ashley Hill, Antonin Raffin, Maximilian Ernestus, Adam Gleave, Anssi Kanervisto, Rene Traore, Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, et al. Stable baselines. https://github.com/hill-a/stable-baselines, 2018.
- Jamieson et al. (2012) Kevin G. Jamieson, Robert D. Nowak, and Benjamin Recht. Query complexity of derivative-free optimization. In Neural Information Processing Systems, 2012.
- Kakade & Tewari (2009) Sham M Kakade and Ambuj Tewari. On the generalization ability of online strongly convex programming algorithms. In Neural Information Processing Systems, 2009.
- Kalai & Vempala (2005) Adam Tauman Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3), 2005.
- Kirschner et al. (2019) Johannes Kirschner, Mojmír Mutný, Nicole Hiller, Rasmus Ischebeck, and Andreas Krause. Adaptive and safe Bayesian optimization in high dimensions via one-dimensional subspaces. In International Conference on Machine Learning (ICML), 2019.
- Kroese et al. (2013) Dirk P Kroese, Thomas Taimre, and Zdravko I Botev. Handbook of Monte Carlo Methods. John Wiley & Sons, 2013.
Lehman et al. (2018)
Joel Lehman, Jay Chen, Jeff Clune, and Kenneth O. Stanley.
Safe mutations for deep and recurrent neural networks through output gradients.In Genetic and Evolutionary Computation Conference (GECCO), 2018.
Maheswaranathan et al. (2019)
Niru Maheswaranathan, Luke Metz, George Tucker, and Jascha Sohl-Dickstein.
Guided evolutionary strategies: Escaping the curse of dimensionality in random search.In International Conference on Machine Learning (ICML), 2019.
- Mania et al. (2018) Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search of static linear policies is competitive for reinforcement learning. In Neural Information Processing Systems, 2018.
- Marmin et al. (2015) Sébastien Marmin, Clément Chevalier, and David Ginsbourger. Differentiating the multipoint expected improvement for optimal batch design. In Workshop on Machine Learning, Optimization, and Big Data, 2015.
- Matyas (1965) J Matyas. Random optimization. Automation and Remote Control, 26(2), 1965.
- Nesterov & Spokoiny (2017) Yurii Nesterov and Vladimir G. Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2), 2017.
- Rechenberg (1973) Ingo Rechenberg. Evolutionsstrategie - optimierung technischer systeme nach prinzipien der biologischen information. Stuttgart-Bad Cannstatt: Friedrich Frommann Verlag, 1973.
Rolland et al. (2018)
Paul Rolland, Jonathan Scarlett, Ilija Bogunovic, and Volkan Cevher.
High-dimensional Bayesian optimization via additive models with
International Conference on Artificial Intelligence and Statistics (AISTATS), 2018.
- Rontsis et al. (2017) Nikitas Rontsis, Michael A. Osborne, and Paul J. Goulart. Distributionally ambiguous optimization techniques for batch Bayesian optimization. arXiv:1707.04191, 2017.
- Salimans et al. (2017) Tim Salimans, Jonathan Ho, Xi Chen, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv:1703.03864, 2017.
- Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv:1707.06347, 2017.
- Shalev-Shwartz (2012) Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2), 2012.
- Shamir (2013) Ohad Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In Conference on Learning Theory (COLT), 2013.
- Srinivas et al. (2010) Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias W. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning (ICML), 2010.
- Staines & Barber (2012) Joe Staines and David Barber. Variational optimization. arXiv:1212.4507, 2012.
- Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems (IROS), 2012.
- Vemula et al. (2019) Anirudh Vemula, Wen Sun, and J. Andrew Bagnell. Contrasting exploration in parameter and action space: A zeroth-order optimization perspective. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
- Wang et al. (2016) Ziyu Wang, Frank Hutter, Masrour Zoghi, David Matheson, and Nando de Freitas. Bayesian optimization in a billion dimensions via random embeddings. Journal of Artificial Intelligence Research, 55, 2016.
- Wierstra et al. (2014) Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen Schmidhuber. Natural evolution strategies. Journal of Machine Learning Research (JMLR), 15, 2014.
Wilson & Nickisch (2015)
Andrew Gordon Wilson and Hannes Nickisch.
Kernel interpolation for scalable structured Gaussian processes (KISS-GP).In International Conference on Machine Learning (ICML), 2015.
- Zhang et al. (2017) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
Appendix A Proofs
a.1 Gradient Estimator
In this section, we show that when random directions are sampled in the column space of the orthonormal matrix , perturbations give biased gradient estimates of the manifold smoothed function. Moreover, when , resulting gradients are unbiased. We formalize this with the following lemma.
Let be an orthonormal basis for the tangent space of the -dimensional manifold at point , be another orthonormal matrix, and be a function defined on this manifold. Fix . Then
where . Moreover, bias is and the resulting estimator is unbiased when .
Without loss of generality, we can assume and . Using this remark, we can state the proof of the lemma as the straightforward application of the manifold Stoke’s theorem
where vol denotes volume, and we use the definition of the expectation in (a, e), orthonormality of in (b,d), manifold Stoke’s theorem in (c) and the fact that the ratio of volume to the surface area of a dimensional ball of radius is in (e). Moreover, bias vanishes when . ∎
a.2 Sample Complexity for Random Search and Manifold Random Search
In this section, we bound the sample complexity of the random search (Algorithm 1) and the manifold random search (Algorithm 2). Our analysis starts with studying the relationship between the function () and its smoothed () as well as manifold smoothed () versions in section A.2.2. We show that Lipschitzness and
smoothness of the function extend to the smoothed functions. Moreover, we also bound the difference between the gradients of the original function and the gradients of the smoothed versions. Next, we study the second moment of the gradient estimator in sectionA.2.3. Finally, we state the sample complexity of SGD on non-convex functions in section A.2.1. Combining these results, we state the final sample complexity of random search and manifold random search in section A.2.4&A.2.5.
a.2.1 Convergence of SGD for non-convex fuctions
The convergence of the SGD has been widely studied and here we state its convergence result for non-convex functions from Ghadimi & Lan (2013) as a Lemma and give its proof for the sake of completeness.
Consider running SGD on that is -smooth and -Lipschitz for steps starting with initial solution . Denote where is the globally optimal point and assume that the unbiased gradient estimate has second moment bounded with . Then,
We denote the step size as and the unbiased gradient estimate as . We analyze the step at as;
where we used the smoothness of the function. Taking expectation of the inequality,
Using the bounded second moment of the gradient, and summing from step to ,
Re-arranging the terms, we obtain,
Set , and divide the inequality to in order to obtain the required inequality as
a.2.2 Preliminary Results on Smoothed Functions
First, we will show that the smoothness and Lipschitness properties of applies to and .
where we use Lipschitz continuity of in (a), and,
where we use Jensen’s inequality and convexity of the norm in (b) and smoothness of in (c). Hence, smoothness and Lipschitness applies to for any . Take and , then the smoothness and Lipschitness applies to .
Next, we will study the impact of using the gradients of the smoothed function instead of the original function.
where we use . We further bound the left term as
using dominated convergence theorem in (a), the smoothness of in (b) and orthonormality of and the fact that norm of any point in a unit ball is bounded by . By taking and , this result also implies the same for . Hence,
a.2.3 Second Moment of the Gradient Estimator
We will start with studying the second moment of our gradient estimate for the manifold case. We bound the expected square norm of the gradient estimate as
where we use orthonormality of and unit norm property of in (a), add and substract and use in (b), use the bounded variance of and the Lipschitz smoothness of in (c).
Second moment of the random search estimator can also computed similarly. And, the resulting bound would be