1 Introduction
An evolution strategy [Schwefel, , Rechenberg, 1978] is a blackbox optimization algorithm which iteratively updates a population of candidates within the feasible region of the search space. The population is updated by a process of random mutation, followed by fitness evaluation, and subsequent recombination of bestperforming members to form the next generation. The focus of this paper is on the natural evolution strategies (NES) algorithm [Wierstra et al., 2014]
and its quantum variants, in which the population is represented by a smoothly parameterized family of search distributions defined on the search space. The mutation is achieved by sampling new candidates from the search distribution, which yields a gradient estimator of the expected fitness. The recombination step involves updating the parameters of the search distribution in the direction of steepest ascent, with respect to the information geometry implicit in the choice of search distribution.
Natural evolution strategies has recently demonstrated considerable progress in solving blackbox optimization problems in high dimensions, including continuous optimization problems relevant to reinforcement learning
[Salimans et al., 2017]. Comparatively little work has been done on the discrete optimization side (see however [Malagò et al., 2008, 2011, Ollivier et al., 2017]). It was very recently shown by Gomes et al. [2019] that techniques from quantum variational Monte Carlo literature [Carleo and Troyer, 2017] can be adapted for approximately solving combinatorial optimization problems. Meanwhile, in the quantum computation literature, considerable effort has focused on approximately solving combinatorial optimization problem using lowdepth quantum circuits [Farhi et al., 2014].The unifying principle shared by the above algorithms is their utilization of MonteCarlo samples drawn from a particular probability distribution, the choice of which is determined by a local optimization problem over a variational family of search distributions. The algorithms differ in choice of variational family, as well as the geometry underlying the local optimization problem, and the use of quantum or classical resources to perform sampling. In this work, we provide a unified view on the relationship between the geometries which underpin these algorithms. In addition, we revisit the experiments of
Gomes et al. [2019] and show that natural evolution strategies can achieve stateofart approximation ratios for MaxCut, albeit at the expense of significantly increased computational cost. The key factor impacting performance is identified to be the batch size of the stochastic gradient estimator. It is hoped that these findings will provide guidance in the search for quantum advantage in optimization problems using noisy intermediatescale quantum devices.The paper is structured as follows. First, the natural evolution strategies algorithm is reviewed, highlighting its informationgeometric origin. Next, we clarify the relationship between natural evolution strategies and quantum approximate optimization including the classical/quantum hybrid approach introduced in [Gomes et al., 2019]. Finally, numerical experiments are presented, focusing on the problem of combinatorial optimization for the MaxCut problem.
2 Background
Consider the problem of optimizing a realvalued, but otherwise arbitrary function defined on a search space . For simplicity of presentation we focus on , although the method generalizes in a straightforward way to infinite search spaces, as required for continuous optimization. Now consider the probability simplex defined as,
(1) 
The natural evolution strategies algorithm can be motivated by the following two observations, which concern the convex and Riemannian geometry of the set :
1. The optimization problem admits the following equivalent convex relaxation,
(2) 
whose global minimizers form the subsimplex consisting of the following convex hull of Dirac distributions,
(3) 
2. There is a natural Riemannian metric called the FisherRao metric defined on the interior
of the probability simplex, consisting of strictly positive probability vectors. The distance between
is defined as(4) 
where denotes the elementwise square root of the probability vector .
Given a smoothly parametrized family of search distributions , one obtains a trivial variational bound as follows,
(5) 
Natural evolution strategies seeks to obtain a good approximation ratio by optimizing the above variational upper bound using Riemannian gradient descent in the geometry induced by the FisherRao metric, otherwise known as natural gradient descent [Amari, 1998].
Specifically, given a learning rate and initial condition , one considers the deterministic sequence in defined by,
(6) 
where denotes the Fisher information matrix, evaluated at the parameter vector ,
(7) 
Indeed, it can be shown that
is the local coordinate representation of the Riemannian metric tensor induced by (
4), and the restriction to the interior of the probability simplex ensures that the logarithm is defined^{1}^{1}1This discussion ignored the fact that, unlike a bone fide Riemannian metric, the Fisher information matrix can be degenerate, in which case we choose the minimizer of (6) to be , where denotes the pseudoinverse of ..Observe that the iteration (6) defining the sequence involves the unknown function . The natural evolution strategies can now be defined as the randomized algorithm inspired by (6), in which is replaced by a stochastic gradient estimator obtained by sampling from the search distribution . Likewise, if the Fisher information (7) cannot be evaluated in closed form, then it can be replaced by an associated estimator.
3 Quantum approximate optimization as natural evolution strategies
The passage to quantization of natural evolution strategies proceeds by replacing the search space by a complex Euclidean space,
(8) 
whose orthonormal basis elements are . Moreover, the probability simplex is replaced by the convex set of density operators,
(9) 
where denotes the set of Hermitian operators on . It is clear that any classical probability distribution can be encoded as the following diagonal density operator . It is equally clear that this does not extinguish the space of admissible density operators: another possibility being the rank1 projection operator onto the onedimensional subspace spanned by the vector . Any admissible density operator gives rise to a valid probability distribution, which we call , obtained from the diagonal matrix representation in the standard basis.
It is conceptually useful to introduce the following Hermitian operator (diagonalized by the standard basis),
(10) 
whose groundstate subspace encodes the solution of the optimization problem,
(11) 
Then for any density operator we see that the quantum expectation value of evaluated in the state , is computed by the following classical expectation value,
(12) 
which is an obvious upper bound for . The above identity demonstrates the widely known fact that for diagonal operators of the form (10), unconstrained optimization over the space of quantum states is equivalent to optimization over the probability simplex .
Gomes et al. [2019]
asks if constrained optimization within a parametrized subset of density operators provides a useful heuristic for approximate combinatorial optimization. In particular, they consider the case of rank1 projectors, for which there exists a natural Riemannian metric called the FubiniStudy metric defined as follows,
(13) 
Thus, given a smoothly parametrized subset of a complex Euclidean space, one can define quantum natural evolution strategies as the local optimization of the following variational upper bound, via Riemannian gradient descent in the geometry induced by the FubiniStudy metric,
(14) 
The choice to restrict to rank1 projection operators involves no loss of generality compared to classical natural evolution strategies because if we choose , then the FubiniStudy geometry reduces to FisherRao. Thus, if the parametric family is chosen to be strictly positive , then Riemannian gradient descent in the FubiniStudy geometry coincides with natural gradient descent^{2}^{2}2The local coordinate representation of the FubiniStudy and FisherRao metric tensor agree if the Berry connection vanishes [Stokes et al., 2019]. and we recover classical natural evolution strategies. Therefore we henceforth use terminology ‘natural gradient’ to refer to both geometries interchangeably.
The natural gradient has been thoroughly explored in the variational Monte Carlo for groundstate optimization of nondiagonal Hermitian operators [Sorella, 1998, Carleo and Troyer, 2017], and more recently in the variational quantum algorithm literature [Yuan et al., 2019, Stokes et al., 2019, Koczor and Benjamin, 2019]. It is clearly of interest to explore the natural gradient for variational optimization of the Quantum Approximate Optimization Algorithm (QAOA)^{3}^{3}3In the QAOA one identifies so that and is parametrized by unitary rotations of a normalized reference tensor product ; that is, . [Farhi et al., 2014]. Finally, we note the possibility of generalizing to higherrank nonclassical density operators, the investigation of which is left to future work.
4 Experiments
Consider combinatorial optimization problems defined on . For concreteness we focus on the MaxCut problem corresponding to an undirected graph of size . The function to be minimized is simply,
(15) 
where is the size of the cut corresponding to the configuration . After fixing a parametrized family of wavefunctions, we locally optimize the variational upper bound (14) using stochastic natural gradient descent. Following Gomes et al. [2019], we choose the variational wavefunction to be of Boltzmann form [Carleo and Troyer, 2017],
(16) 
where the variational parameters and denotes either or . In the case we have and the optimization problem is equivalent to natural evolution strategies [Wierstra et al., 2014]
. An example illustrating the increased of expressiveness of complex restricted Boltzmann machines compared to their realvalued counterparts for representing classical probability distributions is presented in the supplementary material.
4.1 Performance Evaluation on MaxCut problem
Although no polynomialtime algorithm is known for solving MaxCut on general graphs, many approximation algorithms have been developed in the past decades. Random Cut Algorithm is a simple randomized 0.5approximation algorithm that randomly assigns each node to a partition (e.g., see Mitzenmacher and Upfal [2005]). Goemans and Williamson [1995] improved the performance ratio from 0.5 to at least 0.87856, by making use of the semidefinite programming (SDP) relaxation of the original integer quadratic program. Burer and Monteiro [2001] reformulated the SDP for MaxCut into a nonconvex problem, with a benefit of having lower dimension and no conic constraint.
The above algorithms were used as benchmark solvers for comparison with quantum natural evolution strategies. The implementation of GoemansWilliamson Algorithm used the CVXPY [Diamond and Boyd, 2016, Agrawal et al., 2018] package and the BurerMonteiro reformulation with the Riemannian TrustRegion method [Absil et al., 2007] used Manopt toolbox [Boumal et al., 2014], which essentially implements the optimization algorithm proposed by [Journée et al., 2010].
The classical and quantum variants of natural evolution strategies were realized using the variational Monte Carlo (VMC) [McMillan, 1965] method with Stochastic Reconfiguration (SR) [Sorella, 1998], as implemented in the NetKet toolbox [Carleo et al., 2019]. The SR optimization was performed using a regularization parameter and a learning rate
, for 90 iterations. At each iteration, the number of Monte Carlo sampled observables (batch size for training) is 4096. The mean performance of the observable batch drawn from the trained model is reported. For Restriced Boltzmann Machine (RBM) model, the number of hidden variables is set to be the same as the number of spins; the weights are complexvalued, initialized with Gaussian distribution of mean 0 and standard deviation (std) 0.01. Throughout the experiments, the timing benchmarks are performed on a core of a 8core processor, Intel(R) Xeon(R) CPU E52650 v2 @ 2.60GHz, with 128 GB of memory.
For evaluation, we constructed a problem instance for each graph size by randomly generating graph Laplacians with edge density , for . This defines a set of problem instances, indexed by , which are held fixed throughout the experiments. For fixed problem instance, each algorithm was executed 10 times using 10 random seeds/initializations. In Table 1, we report the mean and std of the performance over graph instances of different sizes. Since the optimal cut for a given problem instance cannot be computed for large scale problems, we approximate it with an upper bound [Boumal et al., 2016], which is the optimal value of the SDP relaxation, according to the arguments in [Goemans and Williamson, 1995]. In Figure 1(L), we present the ratio for the same algorithms and graph instances in Table 1 with box plot.
[width=8.5em]Cut Number# of Nodes  50  70  100  150  200  250 
Cut Number  
Random  149.60 7.41  297.10 11.48  614.30 16.94  1436.80 27.14  2467.30 30.27  3888.00 39.06 
GoemansWilliamson  203.40 3.61  380.90 8.48  752.50 9.22  1685.70 13.10  2875.10 22.34  4439.90 26.07 
Burer–Monteiro  206.30 0.46  390.90 0.54  776.60 1.56  1719.90 1.58  
Natural Evolution Strategies  2927.82 12.10  4515.72 11.41  
Time elapsed (sec)  
Random  *  *  *  *  *  * 
GoemansWilliamson  0.32 0.11  0.51 0.04  1.59 0.02  5.34 0.20  12.24 0.34  21.09 0.54 
Burer–Monteiro  0.77 0.12  0.96 0.08  1.36 0.08  1.52 0.16  2.32 0.13  2.63 0.14 
Natural Evolution Strategies  964.20 12.58  1883.46 20.39  3683.93 65.30  8606.25 203.14  15693.06 431.68  23621.38 847.84 
The choice of batch size was found to be a crucial factor controlling the performance of the algorithm. Intuitively, it quantifies the exploration capability in the state space: the algorithm has a better chance to discover the ground state if it is allowed to explore more. The performance (as measured by approximation ratio) as well as the training time, as a function of batch size are reported in Figure 1(R) on the graph instance with 150 nodes. The ablation study also reveals that increasing hidden unit density is correlated with performance, provided that training batch size is correspondingly increased. This is expected behaviour since increased model capacity is required to capture the increasing complexity from the sampled observables.
[width=8.5em]ArchitectureOptimizer  Adadelta  Adamax  Momentum  RMSprop  SGD 

w.o. Natural Gradient  
cRBM1  1692.25 6.13  1648.19 12.67  1656.60 6.83  1642.60 15.07  1608.20 52.05 
Natural Gradient  
rRBM1  1646.96 6.80  1687.16 10.34  1694.02 7.71  1692.76 4.79  1704.16 7.89 
cRBM1  1699.74 11.60  1697.04 17.24  1701.32 8.32  1700.39 10.00  1704.06 5.28 
cRBM3  1713.13 8.78  1693.07 5.39  1700.33 6.77  1686.20 10.91  1709.14 8.70 
FC  1700.01 7.50  1697.11 8.84  1702.02 6.02  1702.20 14.54  1702.95 8.13 
The role of optimization algorithm and model architecture was also investigated, focusing on the following optimizers (with and without natural gradient updates): Adadelta [Zeiler, 2012], Adamax () [Kingma and Ba, 2015], Momentum () [Sutskever et al., 2013], RMSprop (), SGD (). The architecture was chosen to be the restricted Boltzmann form (16) with hidden unit density (RBM), with realvalued weights (rRBM) and complex weights (cRBM).
The natural gradient descent [Amari, 1998, Sorella, 1998]
proved essential for converging to a good local optimimum; results on cRBM1 suggest that optimizers equipped with natural gradient updates consistently outperformed those without. The use complex RBM yielded some improvement relative to the realvalued case, although the performance gap largely disappeared when natural gradient updates were applied. In addition, we found that architectures other than RBM can achieve high performance: the singlelayer perceptron (FC) is compatible with RBM across all optimizers. In general, Stochastic Natural Gradient Descent consistently achieves optimal performance over all architectures, in comparison with other optimizers.
5 Conclusions
The MaxCut approximation ratio achieved by natural evolution strategies is competitive with stateofart solvers, although this comes at the expense of significantly increased computation time. It will be interesting to investigate other combinatorial optimization problems, particularly from the spinglass literature, which do not admit semidefinite program relaxations.
It is legitimate to inquire about possible advantages for classical stochastic optimization, given access to efficiently simulable subsets of quantum states, such as the complex Boltzmann machine. In the case of natural evolution strategies our ablative study indicates that use of natural gradient descent levels the playing field between the quantum and classical variants.
Not all classical probability distributions can be efficiently and accurately sampled using classical resources. It remains an interesting open question if access to probability distributions corresponding to lowdepth quantum circuits can impact the above conclusions.
Appendix A Supplementary material
The following example illustrates how the complex restricted Boltzmann machine can efficiently represent probability distributions that are hard to represent using classical neural networks.
Let be the set of all bitstrings of length and denote the subset consisting of all bitstring with even Hamming weight. Define as follows,
(17) 
where is the indicator function for the subset . Approximate representation of by a realvalued restricted Boltzmann machine requires a number of hidden units scaling exponentially in [Montúfar et al., 2011] (see also [Stokes and Terilla, 2019]). In contrast, can be exactly represented by a complex RBM with hidden unit.
References
 Absil et al. [2007] P.A. Absil, C. G. Baker, and K. A. Gallivan. Trustregion methods on Riemannian manifolds. Foundations of Computational Mathematics, 7(3):303–330, 2007.
 Agrawal et al. [2018] A. Agrawal, R. Verschueren, S. Diamond, and S. Boyd. A rewriting system for convex optimization problems. Journal of Control and Decision, 5(1):42–60, 2018.
 Amari [1998] S.I. Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
 Boumal et al. [2014] N. Boumal, B. Mishra, P.A. Absil, and R. Sepulchre. Manopt, a matlab toolbox for optimization on manifolds. J. Mach. Learn. Res., 15(1), 2014. ISSN 15324435.
 Boumal et al. [2016] N. Boumal, V. Voroninski, and A. S. Bandeira. The nonconvex burer–monteiro approach works on smooth semidefinite programs. In NeurIPS, 2016.
 Burer and Monteiro [2001] S. Burer and R. D. Monteiro. A nonlinear programming algorithm for solving semidefinite programs via lowrank factorization. Mathematical Programming (series B, 95:2003, 2001.
 Carleo and Troyer [2017] G. Carleo and M. Troyer. Solving the quantum manybody problem with artificial neural networks. Science, 355(6325):602–606, 2017.

Carleo et al. [2019]
G. Carleo, K. Choo, D. Hofmann, J. E. Smith, T. Westerhout, F. Alet, E. J.
Davis, S. Efthymiou, I. Glasser, S.H. Lin, and et al.
Netket: A machine learning toolkit for manybody quantum systems.
SoftwareX, 10:100311, 2019. ISSN 23527110.  Diamond and Boyd [2016] S. Diamond and S. Boyd. CVXPY: A Pythonembedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.
 Farhi et al. [2014] E. Farhi, J. Goldstone, and S. Gutmann. A quantum approximate optimization algorithm. arXiv preprint arXiv:1411.4028, 2014.
 Goemans and Williamson [1995] M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM (JACM), 42(6):1115–1145, 1995.
 Gomes et al. [2019] J. Gomes, K. A. McKiernan, P. Eastman, and V. S. Pande. Classical quantum optimization with neural network quantum states. arXiv preprint arXiv:1910.10675, 2019.
 Journée et al. [2010] M. Journée, F. Bach, P.A. Absil, and R. Sepulchre. Lowrank optimization on the cone of positive semidefinite matrices. SIAM J. on Optimization, 20(5):2327–2351, May 2010.
 Kingma and Ba [2015] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 Koczor and Benjamin [2019] B. Koczor and S. C. Benjamin. Quantum natural gradient generalised to nonunitary circuits. arXiv preprint arXiv:1912.08660, 2019.

Malagò et al. [2008]
L. Malagò, M. Matteucci, and B. Dal Seno.
An information geometry perspective on estimation of distribution
algorithms: boundary analysis.
In
Proceedings of the 10th annual conference companion on Genetic and evolutionary computation
, pages 2081–2088, 2008. 
Malagò et al. [2011]
L. Malagò, M. Matteucci, and G. Pistone.
Towards the geometry of estimation of distribution algorithms based
on the exponential family.
In
Proceedings of the 11th workshop proceedings on Foundations of genetic algorithms
, pages 230–242, 2011.  McMillan [1965] W. L. McMillan. Ground state of liquid . Phys. Rev., 138:A442–A451, 1965.
 Mitzenmacher and Upfal [2005] M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, USA, 2005. ISBN 0521835402.
 Montúfar et al. [2011] G. F. Montúfar, J. Rauh, and N. Ay. Expressive power and approximation errors of restricted boltzmann machines. In Advances in neural information processing systems, pages 415–423, 2011.
 Ollivier et al. [2017] Y. Ollivier, L. Arnold, A. Auger, and N. Hansen. Informationgeometric optimization algorithms: A unifying picture via invariance principles. The Journal of Machine Learning Research, 18(1):564–628, 2017.
 Rechenberg [1978] I. Rechenberg. Evolutionsstrategien. In Simulationsmethoden in der Medizin und Biologie, pages 83–114. Springer, 1978.
 Salimans et al. [2017] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
 [24] H.P. Schwefel. Numerische Optimierung von ComputerModellen mittels der Evolutionsstrategie: mit einer vergleichenden Einführung in die HillClimbingund Zufallsstrategie, volume 1. Springer.
 Sorella [1998] S. Sorella. Green function monte carlo with stochastic reconfiguration. Physical Review Letters, 80(20):4558–4561, 1998.
 Stokes and Terilla [2019] J. Stokes and J. Terilla. Probabilistic modeling with matrix product states. Entropy, 21(12):1236, 2019.
 Stokes et al. [2019] J. Stokes, J. Izaac, N. Killoran, and G. Carleo. Quantum natural gradient. Quantum, in press, arXiv preprint arXiv:1909.02108, 2019.

Sutskever et al. [2013]
I. Sutskever, J. Martens, G. Dahl, and G. Hinton.
On the importance of initialization and momentum in deep learning.
PMLR, 2013.  Wierstra et al. [2014] D. Wierstra, T. Schaul, T. Glasmachers, Y. Sun, J. Peters, and J. Schmidhuber. Natural evolution strategies. The Journal of Machine Learning Research, 15(1):949–980, 2014.
 Yuan et al. [2019] X. Yuan, S. Endo, Q. Zhao, Y. Li, and S. C. Benjamin. Theory of variational quantum simulation. Quantum, 3:191, 2019.
 Zeiler [2012] M. D. Zeiler. Adadelta: An adaptive learning rate method, 2012.