Structured Monte Carlo Sampling for Nonisotropic Distributions via Determinantal Point Processes

05/29/2019 ∙ by Krzysztof Choromanski, et al. ∙ Google Columbia University berkeley college 8

We propose a new class of structured methods for Monte Carlo (MC) sampling, called DPPMC, designed for high-dimensional nonisotropic distributions where samples are correlated to reduce the variance of the estimator via determinantal point processes. We successfully apply DPPMCs to problems involving nonisotropic distributions arising in guided evolution strategy (GES) methods for RL, CMA-ES techniques and trust region algorithms for blackbox optimization, improving state-of-the-art in all these settings. In particular, we show that DPPMCs drastically improve exploration profiles of the existing evolution strategy algorithms. We further confirm our results, analyzing random feature map estimators for Gaussian mixture kernels. We provide theoretical justification of our empirical results, showing a connection between DPPMCs and structured orthogonal MC methods for isotropic distributions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 20

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Structured Monte Carlo (MC) sampling has recently received significant attention [42, 12, 13, 14, 33, 11, 34] as a universal tool to improve MC methods for applications ranging from dimensionality reduction techniques and random feature map (RFM) kernel approximation [14, 11]

to evolution strategy methods for reinforcement learning (RL)

[33, 34] and estimating sliced Wasserstein distances between high-dimensional probabilistic distributions [34]

. Structured MC methods rely on choosing samples from joint distributions where different samples are correlated in a particular way to reduce the variance of the estimator. They are also related to the class of

Quasi Monte Carlo (QMC) methods that aim to improve concentration properties of MC estimators by using low discrepancy sequences of samples to reduce integration error [41, 21].

However, the key limitation of the above techniques is that they can only be applied to isotropic distributions, since they rely on samples’ orthogonalization. For this class of methods the unbiasedness or asymptotic near-unbiasedness (for large enough dimensionality ) of the resulted orthogonal estimator follows directly from the isotropicity of the corresponding multivariate distribution.

We propose a new class of structured methods for MC sampling, called DPPMC, designed for high-dimensional non-isotropic distributions where samples are correlated to reduce the variance of the estimator via learned or non-adaptive determinantal point processes (DPPs) [23, 17]. DPPMCs are designed to work with highly non-isotropic distributions, yet they inherit accuracy gains coming from structured estimators for the isotropic ones. As opposed to other sampling mechanisms using DPPs [25, 39], we propose a general hybrid DPP-MC architecture that can be applied in a wide range of scenarios from kernel estimation to RL.

We successfully applied DPPMCs to problems involving high-dimensional nonisotropic distributions naturally arising in guided evolution strategy (GES) methods for RL [26], CMA-ES techniques and trust region methods for blackbox optimization, improving state-of-the-art in all of these settings. In particular, we show that DPPMCs drastically improve exploration profiles of the existing evolution strategy algorithms. We further confirm our results analyzing RFM-estimators for Gaussian mixture kernels [40, 36], presenting detailed comparison with state-of-the-art density quantization methods. We use MC sampling as a preprocessing step from which a DPP downsamples to construct a final set of samples. Furthermore, we provide theoretical justification of our empirical results, showing a connection between DPPMCs and structured orthogonal MC methods for isotropic distributions.

To motivate our approach, we mention the striking result from [6] showing that mixing quadratures with repulsive sampling provided by DPPs provably improves convergence rates of MC estimators. However, our algorithm is different - we do not rely on sampling from DPPs associated with multivariate orthogonal polynomials which requires cubic time. To the best of our knowledge, we are also the first to provide an extensive empirical evaluation showing that our approach is not only theoretically sound, but leads to efficient algorithms across a variety of settings.

This paper is organized as follows: (1) In Section 2 we introduce Monte Carlo methods and Determinantal Point Processes, (2) In Section 3 we introduce our DPPMC algorithm, (3) In Section 4 we present theoretical guarantees for the class of DPPMC estimators, (4) In Section 5 we present all experimental results, in particular applications to a wide spectrum of reinforcement learning tasks.

2 Towards DPPMCs: MC Methods and Determinantal Point Processes

2.1 Unstructured and Structured MC Sampling

Consider a function defined as follows:

(1)

where: is a -dimensional (not necessarily isotropic) distribution and

is some function. Several important machine learning quantities can be expressed as in Equation

1. For instance, many classes of kernel functions admit representation given by Equation 1. The celebrated Bochner’s theorem [31] states for every shift-invariant kernel :

(2)

for some distribution with density function (sometimes called spectral density

) which is a Fourier Transform of

defined as . According to Equation 2, values of the stationary kernel can be written as: for some distribution . If furthermore a stationary kernel

is a radial basis function (RBF) kernel, i.e. there exists

such that , then the above distribution is isotropic. RBF kernels include in particular the classes of Gaussian, Matérn and Laplace kernels. Other prominent classes of kernels such as angular kernels or more general Pointwise Nonlinear Gaussian kernels [14] can be also expressed via Equation 1.

Finally, in evolution strategies (ES), a blackbox optimization method frequently applied to learn policies for reinforcement learning and robotics [35, 13, 33, 10], gradients of Gaussian -smoothings of blackbox functions (ES gradients) are defined as:

(3)

An unbiased baseline MC estimator of from Equation 1 relies on independent sampling from distribution and is of the form:

(4)

where and stands for the number of samples used. In the context of dot-product kernel approximation that estimator leads to the so-called Johnson-Lindenstrauss Transforms [1, 15] and for nonlinear kernel approximation to the celebrated class of random feature map methods (see: [31]). In blackbox optimization domains it is a core part of many state-of-the-art ES methods [35, 27, 10].

In all the above applications distributions from which samples were taken are isotropic. For such , we can further enforce different samples to be exactly orthogonal, while preserving their marginal distributions. This leads to the class of the so-called orthogonal estimators [42], often characterized by lower variance than their unstructured counterparts [12, 14] followed by downstream gains (in ES optimization [13]

, Wassterstein GAN and autoencoder algorithms

[34]

or even complicated hybrid predictive state recurrent neural network architectures as in

[9]).

2.2 The Landscape of Nonisotropic Distributions

Two fundamental limitations of the class of estimators is that they need the underlying distributions to be isotropic for their (near)unbiasedness and they require the number of samples to satisfy . Unfortunately, in practice the number of MC samples required even for a relatively modest task of spherical Gaussian kernel approximation with precision

with any constant probability is of the order

(see: [31]). That problem can be addressed by stacking independent orthogonal blocks of samples. However the former problem cannot be solved since the geometry of orthogonal structured transforms is intrinsically intertwined with the isotropicity of .

Nonisotropic distributions arise in many important applications of machine learning. Several classes of non-RBF kernels are used as a more expressive tool to apply Gaussian processes (GPs) for learning hidden representation in data

[40]

. The effectiveness of GPs depends on the quality of the interpolation mechanism applying given kernel function. As noticed in

[32], RBF kernels lead to neighborhood-dominated interpolation that is unable of modelling different parts of the input space in several domains such as: geostatistics, bioinformations, signal processing.

A much more expressive family of non-monotonic (yet still stationary) kernels can be obtained by modelling corresponding spectral density (leading straightforwardly to MC estimators) with the use of Gaussian mixture distributions that are no longer isotropic.

To be more specific, take the family of Gaussian mixture kernels defined as:

(5)

where: , , is the number of Gaussian mixture components, weights define their relative contributions, and finally and stand for the mean and covariance matrix of the component. The spectral distribution for that class of kernels

is a mixture Gaussian distributions with relative weights

, means and covariance matrices of different mixture components. Thus the values of these kernels can be expressed as: for the nonisotropic defined above.

Since mixtures of Gaussians are dense in the set of distribution functions (in a weak topology sense), by applying Bochner’s theorem, we can conclude that Gaussian mixture kernels are dense in the space of all stationary kernels. The generalizations of Gaussian mixture kernels were also proved to be dense in the space of all non-stationary kernels [36].

Nonisotropic distributions also play a very important role in blackbox optimization, for instance in the CMA-ES algorithm [3, 2]

to create the populations of samples of parameters to be evaluated in each epoch of the algorithm. Finally, learned nonisotropic distributions are applied on a regular basis in guided ES algorithms for policy optimization

[26] that estimate gradients of Gaussian smoothings of the RL function by sampling from nonisotropic distributions.

2.3 Determinantal Point Processes

Consider a finite set of datapoints , where . A determinantal point process is a distribution over the subsets of of such that for some real, symmetric matrix indexed by the elements of the following holds for every :

(6)

where is sampled from and stands for the submatrix of obtained by taking rows and columns indexed by the elements of . Note that is positive semidefinite since all principal minors are nonnegative. Determinantal point processes (DPPs) satisfy several so-called negative dependence property conditions, such as: for , which can be directly derived from their algebraical definition. This makes them an interesting mechanism in applications where the goal is to subsample a diverse set of samples from a given set. To see it even more clearly, we can consider a restricted class of DPPs, the so-called L-ensembles [7], where the probability that a particular subset is chosen satisfies:

(7)

for some matrix that as before, has to be positive semidefinite. If we interpret as a kernel matrix , where is a corresponding feature map and stands for the dot-product form in the corresponding Hilbert space, then we see that under the DPP sampling process the sets of near-orthogonal samples in the Hilbert space are favorable over nearly-collinear ones. For instance, if for some (as it is the case for example for random feature map representations from [31]) then probabilities

are proportional to squared volumes of the parallelepipeds defined by feature vectors

for . Thus samples that are similar according to a given kernel are less likely to appear together in the subsampled set than those that correspond to the orthogonal elements in the corresponding Hilbert space (see also Subsection 4.1).

The DPPs described above construct subsampled sets of different sizes, but if a fixed-size subset is needed a variant of the DPP called a k-DPP can be used (see: [22]).

3 DPPMC Algorithm

We propose to estimate the expression from Equation 1 by the following procedure. We first choose the number of samples that we will average over (as in a standard baseline MC method). We then conduct oversampling by sampling independently at random samples from for some fixed multiplier

(which is the hyperparameter of the algorithm) to obtain set

. Optionally, we renormalize datapoints of so that they are all of equal lengths. We then downsample from the using -DPP and get an -element set . Finally, we estimate as:

(8)

In most practical applications it suffices to use a DPP determined by a fixed kernel function (see for instance: [28]) and we show in Section 5.2 this approach is successful for RL tasks. However, for completeness we also present a learning framework. In order to learn the right kernel determining matrix for the DPP (see: Subsection 2.3), we model this kernel as , where function is the output of the feedforward fully connected neural network.

There is an extensive literature on learning DPPs via learned mappings produced by neural networks (see: [17]

). However, most approaches focus on a different setting, where the goal is to learn the DPP from the subsets it produces (via negative maximal log-likelihood loss functions). Our neural network training is conducted as follows.

We approximate distribution by the Gaussian mixture distribution . In most interesting practical applications the nonisotropic distributions under consideration are already Gaussian mixtures (thus no approximation is needed), but in principle the method can also be applied to other nonisotropic distributions. Then we fix a training set of datapoints . In practice we use publicly available datasets (see: Subsection 5.1) with dimensionalities matching that of distribution . One can also consider synthetic datasets. Next we train the neural network to minimize the empirical mean squared error () of the DDPMC estimator of the Gaussian mixture kernel from Equation 5 corresponding to on the pairs of points from the training set (this is just one of many loss functions that can be effectively used here).

For given datapoints , the empirical of the DPPMC approximator of the Gaussian mixture kernel is given as: where , and sets are constructed by independent runs of the above procedure, where is a fixed hyperparameter determining accuracy of the estimation of

. The final loss function that we backpropagate through is the average empirical MSE over pairs of points from

.

The empirical mean squared error of kernels associated with nonisotropic distributions under consideration was chosen on purpose as an objective function minimized during training. For isotropic distributions the orthogonal structure (see: discussion about in Subsection 2.1) that was first introduced as an effective tool for minimizing mean squared error of associated kernels (via random feature map mechanism) was later rediscovered as superior to baseline methods in other downstream tasks, as we discussed in Subsection 2.1.

4 Theoretical Results

In this section we consider functions from Equation 1. All proofs of the presented results are given in the Appendix. We start by showing that DPPs can be used to provably reduce the MSE of downsampled estimators. Let be evaluation points of 111An important special case is when for all although it is not necessary for some of the results in this section to hold.. Consider the case where each datapoint is selected as part of the estimator with probability . More formally, let

be an ensemble of Bernoulli random variable with values in

and marginal probabilities . Define the unbiased downsampled estimator as:

(9)

Notice that . Let be a set of importance weights with . We show that ensembles of Bernoulli random variables sampled from a DPP can yield downsampling estimators with better variance than if these are produced i.i.d. with . Let be a marginal kernel matrix defining a DPP with marginal probabilities and such that the ensemble follows the DPP process. We consider the following subsampled ES estimator:

(10)

where . Recall that here we have: and for . We define in the analogous way, where this time samples are i.i.d. Bernoulli with parameters . In the theorem below we assume that :

Theorem 4.1.

If for all , there exists a Marginal Kernel such that:

(11)

and furthermore: .

Thus DPP-based mechanism provides more accurate estimators. As a consequence of the above theorem, we obtain guarantees for estimators of gradients of Gaussian smoothings. Let and let be its Gaussian smoothing. Let denote the ES gradient of , as defined in equation 3, and call and the corresponding unbiased downsampled iid and DPP versions of the estimator of .

Corollary 4.1.

Let be

iid normally distributed perturbations and let

such that for all be an ensemble of downsampling parameters. For any there is a marginal kernel such that:

where: the first expectation is taken with respect to both and and the second expectation is taken with respect to both and . The variance satisfies:

where the variance on the LHS of the inequality is computed with respect to and the variance on the RHS is computed with respect to .

This implies that provided we select an appropriate DPP-Kernel matrix

, DPPMC yields an unbiased estimator of the gradient of the Gaussian smoothing

of smaller variance than iid estimator. The proof of this theorem can be turned into a procedure to produce such a Kernel . When the probabilities for all , the importance weighted estimator is equivalent (with high probability) to the downsampled estimators we use in Section 5 that already outperform other methods.

4.1 Connections with Orthogonality

In this section we formalize the intuition that the most likely sets sampled under a Determinantal Point Process correspond to subsets of the dataset with orthogonal features in the kernel space. In [13] the authors study the benefits of coupling sensing directions used to build ES estimators by enforcing orthogonality between the sampling directions while preserving Gaussian marginals. It can be shown this strategy provably reduces the variance of the resulting gradient estimators. We shed light on this phenomenon through the perspective of DPPs. In what follows assume with and let be a possibly infinite feature map defining a kernel.

Theorem 4.2.

Let be an ensemble, where for all . Let with and assume there exist samples in satisfying for all . If denotes the measure over subsets of size of defined by , the most likely outcomes from are the size- pairwise orthogonal subsets of .

5 Experiments

We aim to address here the following questions: (1) Do DPPMCs help to achieve better concentration results for MC estimation? (2) Do DPPMCs provide benefits for downstream tasks? To address (1), we consider estimating kernels using random features. To address (2), we analyze applications of DPPMCs for high-dimensional blackbox optimization. We present extended ablation studies regarding parameter in the Appendix (see: Section 8.2).

Complexity:

We emphasize the conceptual simplicity of our algorithm. Improving state-of-the-art in the RL setting, where we fix an RBF kernel defining the DPP (i.e. learning is not needed) requires adding few lines of code (we include a generic 11-line example of standard DPP python implementation in Section 8.1). Learning a DPP follows the standard supervised framework. Sampling from DPPs requires a priori the eigen-decomposition of matrix , however we use fast sub-cubic (k)-DPP sampling mechanisms [19, 24]. For blackbox optimization, time complexity of DPP sampling was negligible in comparison with that for function querying. Thus wall-clock time is accurately measured by the number of timesteps/function evaluations and we show that DPPMC enhancements need substantially fewer of them. For kernel approximation, time complexity of estimating kernel values is exactly the same for the DPPMC and baseline estimator (and reduces to that of matrix-vector multiplication). DPPMC requires DPP sampling, but in that setting it is a one-time cost.

5.1 Kernel Estimation

We compare the accuracy of the baseline MC estimator of values of Gaussian mixture kernels from Equation 5 using independent samples () with those applying Quasi Monte Carlo methods () [5], estimators based on state-of-the-art quantization methods: [4], [29] and our DPPMC mechanism. We applied different QMC estimators and on each plot show the best one. We compare empirical mean squared errors of the above methods. The results are presented on dataset. DPP mechanism was trained on dataset. Mapping was encoded by standard feedforward fully connected neural network architectures with two hidden layers of size each and with nonlinearities. We analyzed Gaussian mixture kernels with different number of components . Fig. 1 shows that in all settings, DPPMC substantially outperforms all other methods. We did not include orthogonal sampling method, since it did not work for the considered kernels.

(a) (b) (c) (d)
Figure 1: Comparison of different estimators of Gaussian mixture kernels for different number of components: on dataset. On the horizontal axis: the ratio of the number of samples used and dimensionality of the datapoints. On the vertical axis: obtained empirical mean squared error.

5.2 Blackbox Optimization

ES blackbox optimization algorithms rely on sampling perturbation directions for function evaluations to optimize sets of parameters [35, 13]. We propose to improve these baseline algorithms by augmenting their sampling subroutines with DPPMCs. We consider the following baseline methods: (1) recently proposed guided ES methods, such as Guided Evolution Strategies [26], (2) Trust-Region based ES methods resusing certain samples for better time complexity [10], (3) Covariance Matrix Adaptation Evolution Strategy -, a state-of-the-art blackbox optimization algorithm [18].

In each setting, the key difference between the baseline algorithm and our proposed method is that the former carries out uniform sampling from a given distribution , while our method diversifies the set of samples using DPPMC. Using a diverse set of samples leads to more efficient exploration in the parameter space and benefits downstream training, as we show later. We used a fixed Gaussian kernel with tuned variance to determine DPP. We consider two sets of benchmark problems.

Reinforcement Learning:

In reinforcement learning (RL), at each time step an agent observes state , takes action , receives reward and transitions to the next state . A policy is a mapping from states to actions that will be conducted in that states and is parameterized by vector . The goal is to optimize that mapping to maximize expected cumulative reward over given time horizon . When framing RL as a blackbox optimization problem, the input to the blackbox function is usually a vectorized neural network and the output is a noisy estimate of the cumulative reward, obtained by executing policy in a particular environment. We consider environments: -, -, - and from the library and trained policies encoded by fully connected feedforward neural networks.

Nevergrad Functions:

Blackbox functions from the recently open-sourced library [37], using the well-known open-source implementation of - (from \(\mathrm{https://github.com/CMA}\)-). We tested functions: , , and .

We are ready to describe the considered ES algorithms.

Figure 2: Standard Guided ES versus their DPPMC enhancements on tasks. Presented are median-curves from

seeds and with inter-quartile ranges as shadowed regions.

Guided ES:

In each iteration, Guided ES methods sample perturbation vectors from the non-isotropic Gaussian distribution with an adaptive covariance matrix computed from the empirical covariance matrix of gradients obtained via a biased oracle [26] or previous estimation, as it is the case in recently proposed approaches based on ES-active subspaces. Such an adaptive non-isotropic sensing often leads to more sample-efficient estimation of the gradient by exploring subspaces where the true gradients are most likely to be. In the DPPMC enhancement of those techniques, we first sample vectors from for , and down-sample to get a subset of vectors via DPPs.

In Fig.2, we compare baseline Guided ES with its enhanced DPPMC version. The vertical axis shows the expected cumulative reward during training and the horizontal axis - the number of time steps. Each plot shows the average performance with shaded area indicating inter-quartiles across random seeds. DPPMC leads to substantially better training curves. To achieve reward in HalfCheetah-v2, baseline algorithm requires steps while DPPMC only .

Trust Region ES:

Trust Region ES methods, as those recently proposed in [10], rely on reusing perturbations from previous epochs for some and applying regression techniques to estimate gradients of blackbox functions. Those methods do not require perturbations to be independent. DPPMCs can be applied in this setting by sampling new perturbations (instead of ) and then downsampling from the set of all perurbations ( new and reused) only of them. By doing it, we do not reuse all samples, but obtain much more diverse set of perturbations that ultimately improves sampling complexity. We take .

Figure 3: trust region method using MC/ridge gradients versus its DPPMC enhancements on tasks. All curves are median-curves from seeds and with inter-quartile ranges as shadowed regions.

As we can see in Fig.3, for most of the cases DPPMC-based Trust Region ES method outperforms algorithm from [10] that uses standard Trust Region ES mechanism and was already showed to outperform vanilla ES baselines. In particular, for - the only method that manages to learn in a given timeframe is based on DPPMC sampling.

Figure 4: - (baseline) versus its DPPMC version for functions. Presented are median-curves from seeds and with inter-quartile ranges as shadowed regions.
Cma-Es:

In each iteration, - samples a set of perturbation vectors from a non-isotropic Gaussian distribution for function evaluations. Unlike for the above Guided ES methods, the covariance matrix is adapted by running weighted regression over sampled perturbations, where the weights are the function evaluations for different perturbations. Such an adaptive mechanism allows also for efficient exploration in the parameter space, and has performed robustly even for high-dimensional tasks [18, 16]. To construct the candidate pool for -, we first sample non-isotropic Gaussian vectors for , and then downsample elements via DPPs.

We compare - baseline with its DPPMC enhancement in Fig. 4. The horizontal axis shows the cumulative number of function evaluations we make as the optimization progresses, while the vertical axis shows the expected loss. Each plot shows the average performance with shaded area indicating inter-quartiles across random seeds. DPPMC achieves consistent gains across all presented benchmarks. We remark that since the open source implementation of is highly optimized, obtaining even marginal improvements across multiple benchmarks is not trivial.

6 Conclusion

We presented new sampling mechanism DPPMC based on determinantal point processes to improve standard MC methods for nonisotropic distributions. We furthermore showed the effectiveness of our approach on several downstream tasks (guided ES search, CMA-ES and trust-region methods for blackbox optimization) and provided theoretical guarantees.

References

  • [1] N. Ailon and E. Liberty. An almost optimal unrestricted fast Johnson-Lindenstrauss transform. In SODA, 2011.
  • [2] Y. Akimoto and N. Hansen. CMA-ES and advanced adaptation mechanisms. In

    Proceedings of the Genetic and Evolutionary Computation Conference Companion, GECCO 2018, Kyoto, Japan, July 15-19, 2018

    , pages 720–744, 2018.
  • [3] Y. Akimoto, Y. Nagata, I. Ono, and S. Kobayashi. Theoretical foundation for CMA-ES from information geometry perspective. Algorithmica, 64(4):698–716, 2012.
  • [4] M. Alamgir, G. Lugosi, and U. Luxburg. Density-preserving quantization with application to graph downsampling. In M. F. Balcan, V. Feldman, and C. Szepesvári, editors, Proceedings of The 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research, pages 543–559, Barcelona, Spain, 13–15 Jun 2014. PMLR.
  • [5] H. Avron, V. Sindhwani, J. Yang, and M. W. Mahoney. Quasi-Monte Carlo feature maps for shift-invariant kernels. The Journal of Machine Learning Research, 17(1):4096–4133, 2016.
  • [6] R. Bardenet and A. Hardy. Monte carlo with determinantal point processes. 2016.
  • [7] A. Borodin and E. Rains. Eynardmehta theorem, schur process, and their pfaffian analogs. In Journal of Statistical Physics, page 291–317, 2005.
  • [8] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  • [9] K. Choromanski, C. Downey, B. Boots, D. Holtmann-Rice, and S. Kumar. Initialization matters: Orthogonal predictive state recurrent neural networks. In ICLR, 2018.
  • [10] K. Choromanski, A. Pacchiano, J. Parker-Holder, J. Hsu, A. Iscen, D. Jain, and V. Sindhwani. When random search is not enough: Sample-efficient and noise-robust blackbox optimization of rl policies. In https://arxiv.org/pdf/1903.02993.pdf, 2019.
  • [11] K. Choromanski, A. Pacchiano, J. Pennington, and Y. Tang. Kama-nns: low-dimenaional rotation-based neural networks. In

    International Conference on Artificial Intelligence and Statistics, AISTATS

    , 2019.
  • [12] K. Choromanski, M. Rowland, T. Sarlos, V. Sindhwani, R. Turner, and A. Weller. The geometry of random features. In AISTATS 2018, 2018.
  • [13] K. Choromanski, M. Rowland, V. Sindhwani, R. E. Turner, and A. Weller. Structured evolution with compact architectures for scalable policy optimization. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 969–977, 2018.
  • [14] K. M. Choromanski, M. Rowland, and A. Weller. The unreasonable effectiveness of structured random orthogonal embeddings. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 218–227, 2017.
  • [15] A. Dasgupta, R. Kumar, and T. Sarlós. A sparse Johnson-Lindenstrauss transform. In STOC, 2010.
  • [16] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329–1338, 2016.
  • [17] M. Gartrell, U. Paquet, and N. Koenigstein. Low-rank factorization of determinantal point processes. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pages 1912–1918, 2017.
  • [18] N. Hansen, S. D. Müller, and P. Koumoutsakos. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (cma-es). Evol. Comput., 11(1):1–18, Mar. 2003.
  • [19] B. Kang. Fast determinantal point process sampling with application to clustering. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 2319–2327, 2013.
  • [20] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [21] P. Kritzer, H. Niederreiter, F. Pillichshammer, and A. Winterhof, editors. Uniform Distribution and Quasi-Monte Carlo Methods - Discrepancy, Integration and Applications, volume 15 of Radon Series on Computational and Applied Mathematics. De Gruyter, 2014.
  • [22] A. Kulesza and B. Taskar. k-dpps: Fixed-size determinantal point processes. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pages 1193–1200, 2011.
  • [23] A. Kulesza and B. Taskar. Determinantal point processes for machine learning. Foundations and Trends in Machine Learning, 5(2-3):123–286, 2012.
  • [24] C. Li, S. Jegelka, and S. Sra. Efficient sampling for k-determinantal point processes. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016, Cadiz, Spain, May 9-11, 2016, pages 1328–1337, 2016.
  • [25] C. Li, S. Jegelka, and S. Sra. Fast DPP sampling for nystrom with application to kernel methods. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2061–2070, 2016.
  • [26] N. Maheswaranathan, L. Metz, G. Tucker, D. Choi, and J. Sohl-Dickstein. Guided evolutionary strategies: augmenting random search with surrogate gradients. ICML, 2019.
  • [27] H. Mania, A. Guy, and B. Recht. Simple random search provides a competitive approach to reinforcement learning. CoRR, abs/1803.07055, 2018.
  • [28] Z. Mariet and S. Sra. Diversity networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
  • [29] B. Mirzasoleiman, A. Karbasi, A. Badanidiyuru, and A. Krause. Distributed submodular cover: Succinctly summarizing massive data. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 2881–2889, 2015.
  • [30] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan, et al. Ray: A distributed framework for emerging AI applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 561–577, 2018.
  • [31] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, 2007.
  • [32] S. Remes, M. Heinonen, and S. Kaski. Non-stationary spectral kernels. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 4645–4654, 2017.
  • [33] M. Rowland, K. Choromanski, F. Chalus, A. Pacchiano, T. Sarlós, R. E. Turner, and A. Weller. Geometrically coupled monte carlo sampling. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pages 195–205, 2018.
  • [34] M. Rowland, J. Hron, Y. Tang, K. Choromanski, T. Sarlos, and A. Weller. Orthogonal estimation of wasserstein distances. In International Conference on Artificial Intelligence and Statistics, AISTATS, 2019.
  • [35] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. 2017.
  • [36] Y.-L. K. Samo and S. J. Roberts. Generalized spectral kernels. In arXiv:1506.02236, 2015.
  • [37] O. Teytaud and J. Rapin. Nevergrad: An open source tool for derivative-free optimization. https://code.fb.com/ai-research/nevergrad/, 2018.
  • [38] S. Van Der Walt, S. C. Colbert, and G. Varoquaux. The numpy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2):22, 2011.
  • [39] C. Wachinger and P. Golland. Sampling from determinantal point processes for scalable manifold learning. In Inf Process Med Imaging, page 687–698, 2015.
  • [40] A. G. Wilson and R. P. Adams. Gaussian process kernels for pattern discovery and extrapolation. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pages 1067–1075, 2013.
  • [41] J. Yang, V. Sindhwani, H. Avron, and M. W. Mahoney. Quasi-monte carlo feature maps for shift-invariant kernels. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 485–493, 2014.
  • [42] F. Yu, A. Suresh, K. Choromanski, D. Holtmann-Rice, and S. Kumar. Orthogonal random features. In NIPS, pages 1975–1983, 2016.

7 APPENDIX: Structured Monte Carlo Sampling for Nonisotropic Distributions via Determinantal Point Processes

7.1 Variance Reduction for Evolution Strategies using DPPs

The goal of this section is to show that it is possible to use DPPs to reduce the variance of Evolution Strategies gradient estimators.

7.1.1 One dimensional variance reduction using DPPs

We start by showing an auxiliary sequence of one dimensional lemmas. We consider the problem of computing an estimator of the sum of real numbers . In Lemma 7.1 we first show that using DPPs it is always possible to produce an unbiased estimator of the sum of a sequence of real numbers with less or equal variance than the i.i.d estimator that samples each element of the sequence i.i.d. with probability . We then show in Lemma 7.2 that it is possible to produce a DPP kernel such that the corresponding sum estimator has strictly less variance than the i.i.d. one.

We follow the discussion regarding Determinantal Point Process from [23]. Recall that a Determinantal Point Pricess (DPP) on a ground set with is a probability measure over power set . When is a random subset drawn according to , we have, for every .

for some real symmetric matrix indexed by the elements of . Here and adopt . is known as the marginal kernel.

Notice that whenever , and that .

We start by showing a basic variance reduction result regarding DPPs. Let be set of real numbers. Let be their sum. We are interested in analyzing the following two estimators of :

  1. where are sampled independent from each other with .

  2. where is a subset of sampled from a DPP with kernel satisfying for all .

Notice that and and therefore and are unbiased estimators of .

Lemma 7.1.

If for all , the estimator has smaller variance than whenever for all .

Proof.

Since and

are unbiased, it is enough to compare the second moments of the said estimators.

The last inequality holds whenever for all .

We can also show that under appropriate conditions there exists a kernel matrix such that such that the inequality is strict.

Lemma 7.2.

If , for all and there exists such that , then there exists a matrix defining a DPP over - not necessarily nonnegative- satisfying and such that .

Proof.

Let be a matrix defining a DPP with for all Following the exact same proof as in Lemma 7.1, we conclude that iff:

(12)

We show the existence of a kernel matrix for which the inequality 12 holds and for all .

Indeed, let be such that:

For some . Under this definition, notice that and notice that since , there exists a choice of such that , thus defining a valid DPP kernel matrix .

7.1.2 Towards variance reduction for vector estimators using DPPs.

In this section we extend the results of the previous section to the multi dimensional case of Monte Carlo gradient estimators. We start with an auxiliary lemma that will be used in the variance reduction Theorems of the following sections. The following Lemma characterizes the maximum number of vectors that can all be pairwise negatively correlated. This Lemma will be used later on to argue the existence of a DPP kernel for which its subsampling estimator of the Evolution Strategies gradient estimator achieves less variance than the i.i.d. subsampling estimator.

Lemma 7.3.

Let vectors such that for all . Then .

Proof.

We proceed with a proof by contradiction. Let’s assume . Let be a subset of vectors of . There exist such that:

If for all then which would result in a contradiction. If are not all nonnegative, there exist disjoint subsets and such that , and and and with for all (with at least one ) and (with at least one ) for all such that:

Therefore by assumption which would cause a contradiction since .

Recall the gradient estimator corresponding to Evolution Strategies. If , the ES gradient estimator at equals:

We denote by where are all samples from a standard Gaussian .

7.1.3 Subsampling strategies in ES

In this section we consider subsampling strategies for Evolution strategies when we have a dictionary of sensing directions . Let be the ensemble of probabilities with which to sample (according to a Bernoulli trial with probability ) each sensing .

We recognize two cases:

  1. Unbiased sampling In this case we consider a subsampled-importance sampling weighted version of the empirical estimator of the form .

  2. Biased In this case we consider a subsampled version of the empirical estimator of the form where is a set of importance weights, not necessarily equal to .

The crucial observation behind these estimators is that the evaluation of need not be performed at the points that are not subsampled. This allows us to trade off computation with variance (or mean squared error). We would like to achieve the optimal tradeoff.

Unbiased subsampling

The goal of this section is to show that for any i.i.d. subsampling strategy to build an unbiased estimator for the ES gradient, there exists a DPP kernel such that the DPP unbiased subsampling estimator achieves less variance than the i.i.d. one.

The main result of this section, Theorem 7.4 concerns the estimation of functions of the form as defined in Section 1, and shows that for any fixed subsampling i.i.d. strategy (encoded by subsampling probabilities ), there exists a marginal kernel whose corresponding estimator achieves the same mean but has (strictly) less variance. We prove Theorem 7.5 which specializes Theorem 7.4 to the case of ES gradients. A simple notational change would render the proof valid for Theorem 7.4.

The following corresponds to Theorem 4.1 in the main text.

Theorem 7.4.

If and for all , there exists a Marginal Kernel such that:

And:

We show the corresponding result for the case when . The proof is exactly the same as in the case when considering any other type of function defined as in Section 1.

Let be a marginal kernel matrix defining a DPP whose samples we index as with and such that the ensemble follows the DPP process. We consider the following subsampled ES estimator:

Theorem 7.5.

There exists a marginal kernel such that

Proof.

Since , it is enough to show the desired statement for the square norms of these vectors.

Therefore:

Let for all . The expression above becomes: