1 Introduction
This paper concerns the development of heuristic approximation algorithms for determining the ground state (minimal eigenpair) of certain large random Hermitian matrices drawn from known ensembles of relevance to Physics. Concretely, we focus on the real vector subspace of Hermitian matrices consisting of 2local,
qubit Hamiltonia, which are enumerated by a number of parameters only polynomial in
. These 2local Hamiltonia in particular are expressive enough to capture all quadratic unconstrained binary optimization (QUBO) problems.The Variational Monte Carlo (VMC) mcmillanpr65 is a heuristic algorithm which, given a local Hamiltonian
, produces an estimate for the minimal eigenvalue
and a description of an associated eigenvector. By exploiting neural networks as trial wavefunctions, Carleo and Troyer
carleo2017solving showed that VMC can achieve stateoftheart results for the ground state energies of physically important magnetic spin models. The domain of applicability of socalled neuralnetwork quantum states has since been expanded considerably. It was recently shown, for example, that in the case of binary optimization, VMC is equivalent to Natural Evolution Strategies (NES) gomes2019classical and stateoftheart results can be achieved for the MaxCut Hamiltonian zhao2020natural , albeit at the expense of significantly increased computation time compared to the best known classical heuristic approximation algorithms. The slowdown is attributed to the gradientbased training loop within the VMC. In this step, the neural network parametersare guided in the direction of steepest descent of the following unbiased estimator of the Rayleigh quotient which upper bounds the lowest eigenvalue:
(1) 
where denotes the computational basis of and the vector denotes a neuralnetwork quantum state in the sense that the amplitude is computed by a neural network with variational parameters
. The expectation value is taken with respect to the associated Born probabilities, which are proportional to
. In practice, the objective function is optimized using a variant of stochastic minibatch gradient descent called Stochastic Reconfiguration sorella_aps98 .In contrast to the canonical formulation of VMC, which accepts a single Hamiltonian as input, in this paper we propose the metaVMC, which asks for an approximation of the ground energy for an ensemble of 2local Hamiltonia indexed by a random disorder parameter sampled from a known distribution . The simplest strategy of retraining a separate neuralnetwork quantum state from scratch for each realization of the disorder parameter is clearly impractical. The goal is thus shifted to finding a neural network that is maximally adaptive to new realizations of the disorder. For experimental support, we focus in this paper on the special case of QUBO problems, which can be identified with a Hamiltonian that is diagonal in the Pauli basis. These preliminary results are viewed as a stepping stone to the more physically interesting problem of random quantum spin models, in which we anticipate similar optimization considerations to apply.
The formulation of metaVMC exhibits obvious parallels with metalearning or learning to learn
in the Machine Learning literature
thrun2012learning , where data from previously encountered learning tasks is employed to accelerate performance on new tasks, drawn from an underlying task distribution . In the language of metalearning, indexes the learning task and denotes the distribution over all tasks. In metaVMC, we assume the task distribution is known to the learner; in contrast, conventional metalearning assumes is unknown but possesses sufficient regularity to render metalearning feasible. A second, less significant distinction is that the pertask objective function for metaVMC is an unbiased estimator for the population objective, whereas metalearning typically focuses on Empirical Risk Minimization objectives, which suffer from nonzero bias.2 Theory
A simple strategy that has proven successful in metalearning of deep neural networks is multitask transfer learning
caruana1997multitask ; baxter2000model , which aims to learn an initialization for subsequent tasks by jointly optimizing the learning objective of multiple tasks simultaneously, using a minibatch training strategy that interleaves batches across the tasks. Multitask learning is, however, prone to catastrophic interference mccloskey1989catastrophic , making it unsuitable for generalization to the VMC. The problem is exemplified by some of the simplest examples of disordered spin systems: suppose is a random Hamiltonian whose expected value under the disorder parameter vanishes . As a concrete example, consider the SheringtonKirkpatrick Hamiltonian, in whichrepresents a collection of i.i.d. centered Gaussian random variables
representing the exchange energies. If we denote by the objective function corresponding to disorder parameter , then the multitask learning objective function, expressed in the population limit, is given by(2) 
The fact that the multitask learning objective loses dependence on in the population limit implies that the associated minibatch algorithm makes no progress asymptotically.
In order to define an objective function which is asymptotically nonvacuous and which promotes adaptation to new realizations of disorder, we propose to optimize the following metalearning objective function, again presented in population form for simplicity andrychowicz2016learning ,
(3) 
where denotes the fold application of a task adaptation operator , which in the simplest case of gradient descent with step size , is given by . Optimization of the metalearning objective ensures that when a new realization of the disorder parameter is drawn, the initialization performs well after performing one or more steps of gradient descent. Loosely speaking, metalearning can be justified when one has a budget for running a few steps of gradient descent.
The fact that the metalearning objective function manages to avoid the catastrophic interference phenomenon can be illustrated by the following toy model^{1}^{1}1This quadratic model has also been analyzed in the context of convergence theory in fallah2020convergence .. Rather than considering the Rayleigh quotient (1), consider the following ensemble of quadratic functions specified by a random positivedefinite matrix and a random vector ,
(4) 
where the random variable now corresponds to the task label. In the simplest setting of singlestep () metalearning with vanilla update operator , the optimal solution of the multitask and metalearning objectives can be found in closed form,
(5) 
In the limit corresponding to multitask learning, the optimal solution is found to only depend on the mean value of the random variable , whereas the metalearner (
) exploits information in the higherorder moments of
.In the case of metaVMC, we consider gradientbased optimization. Specifically, we focus on modelagnostic metalearning (MAML) finn2017model
which is a gradientbased algorithm that has been proposed for optimizing the metalearning objective. Straightforward application of the chain rule gives rise to the following gradient estimator for the metalearning objective,
(6) 
where denotes the Jacobian matrix of the function . The pseudocode for MAML is outlined in Algorithm 1. In order to facilitate readability, we have presented the algorithm with batching only in the task index, leaving the remaining expectation values (with respect to Born probabilities) in population form. In a practical algorithm, the intermediate variables and are estimated stochastically using independent batches of data generated by the same task . Since the computation of the Jacobian involves an expensive backpropagation, firstorder MAML (foMAML) has been proposed (e.g., finn2017model ; nichol2018first
), which is a simplification of MAML in which the Jacobian matrix is approximated by the identity matrix.
3 Relationship with previous work
In this section, we differentiate our proposal from the uses of metalearning that have been proposed in the quantum computing literature. In wilson2019optimizing
, for example, metalearning has been proposed to mitigate various sources of noise, specifically shot noise and parameter noise. In the context of VMC, shot noise is analogous to variance associated with finite minibatches, whereas parameter noise has no clear analogue. Ref.
verdon2019learningis the most similar to ours in that they consider metalearning from known distributions. They differ by the choice to focus on variational quantum algorithms such as VQE and QAOA and by the fact that they do not use modelagnostic metalearning. Instead, the metalearning outerloop involves training a separate recurrent neural network, similar to
andrychowicz2016learning .4 Experiments
The experiments focus on the MaxCut problem which is defined in terms of a simple, undirected graph with binaryindicator adjacency matrix . In order to attack the MaxCut problem with the VMC, we encode the solution in the ground state of the following classical antiferromagnetic Ising Hamiltonian with exchange interaction energy matrix given by ,
(7) 
where denotes the Pauli operator acting locally to the th qubit. Since the MaxCut Hamiltonian is diagonal in the Pauli basis, it acts as a multiplication operator and the ground state can be chosen as a computational basis vector
corresponding to a maximal cut. The variational wavefunction was chosen to be a realvalued Boltzmann machine with
hidden units. The task distribution defining the ensemble of MaxCut Hamiltonians was defined by the following procedure. The adjacency matrix for a Bernoulli random graph with edge probability was first chosen and fixed throughout the experiments. Sampling an adjacency matrix from the task distribution is performed by rounding the matrix to a 0/1 matrix, where denotes an matrix with entrywise Gaussian noise. The hyperparameter controlling the task diversity for this ensemble is thus the variance
.During training, each iteration of the metalearning loop involved independently sampling a batch of tasks from . During testing, adjacency matrices were sampled from and fixed for evaluation purposes. The inner loop used iterations of vanilla SGD (step size batch size 128), while the outer loop training used 100 iterations of vanilla SGD (step size , batch size 16). The metalearning experiments were conducted using MAML finn2017model and firstorder MAML (foMAML) finn2017model ; nichol2018first . For baselines, we compared against training a neuralnetwork quantum state from scratch as well as a pretrained initialization with finetuning. The learning curves, illustrated for different values of in Fig. 1, clearly show that modelagnostic metalearning dramatically accelerates training compared to the baselines on the testing MaxCut instances, consistent with the goal of MAML in promoting adaptivity. The networks trained using MAML also found larger cut values compared to the baselines which appear to converge prematurely to suboptimal states.
5 Discussion and future directions
The preliminary experimental results for MaxCut ensembles indicate that MAML effectively solves metaVMC by accelerating training and improving convergence. While the MaxCut problem is exactly solvable for the graph sizes considered here (e.g., by the Branch and Bound method rendl2010solving ), our work paves the way to investigate physicallyinteresting matrix ensembles that cannot be diagonalized by a local change of basis. The ideas presented in this paper naturally extend also to variational quantum algorithms (VQAs) such as the variational quantum eigensolver. The key difference in the case of VQAs is that the denominator in the Rayleigh quotient (1) is automatically normalized and stochastic estimation of the quantum expectation value involves performing measurements in multiple bases if the Hamiltonian contains noncommuting terms. The exploration of metaVQA and associated learning algorithms is left to future work.
Broader Impact
The authors have not identified any ethical impacts or future societal consequences of this work.
The authors would like to thank Giuseppe Carleo for many helpful discussions. Authors gratefully acknowledge support from NSF under grant DMS2038030.
References
 [1] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, pages 3981–3989, 2016.

[2]
Jonathan Baxter.
A model of inductive bias learning.
Journal of artificial intelligence research
, 12:149–198, 2000.  [3] Giuseppe Carleo and Matthias Troyer. Solving the quantum manybody problem with artificial neural networks. Science, 355(6325):602–606, 2017.
 [4] Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
 [5] Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. On the convergence theory of gradientbased modelagnostic metalearning algorithms. In International Conference on Artificial Intelligence and Statistics, pages 1082–1092, 2020.
 [6] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1126–1135. JMLR. org, 2017.
 [7] Joseph Gomes, Keri A McKiernan, Peter Eastman, and Vijay S Pande. Classical quantum optimization with neural network quantum states. arXiv preprint arXiv:1910.10675, 2019.
 [8] Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989.
 [9] W. L. McMillan. Ground state of liquid . Phys. Rev., 138:A442–A451, 1965.
 [10] Alex Nichol, Joshua Achiam, and John Schulman. On firstorder metalearning algorithms. arXiv preprint arXiv:1803.02999, 2018.
 [11] Franz Rendl, Giovanni Rinaldi, and Angelika Wiegele. Solving maxcut to optimality by intersecting semidefinite and polyhedral relaxations. Mathematical Programming, 121(2):307, 2010.
 [12] Sandro Sorella. Green function monte carlo with stochastic reconfiguration. Physical Review Letters, 80(20):4558–4561, 1998.
 [13] Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 2012.
 [14] Guillaume Verdon, Michael Broughton, Jarrod R McClean, Kevin J Sung, Ryan Babbush, Zhang Jiang, Hartmut Neven, and Masoud Mohseni. Learning to learn with quantum neural networks via classical neural networks. arXiv preprint arXiv:1907.05415, 2019.
 [15] Max Wilson, Sam Stromswold, Filip Wudarski, Stuart Hadfield, Norm M Tubman, and Eleanor Rieffel. Optimizing quantum heuristics with metalearning. arXiv preprint arXiv:1908.03185, 2019.
 [16] Tianchen Zhao, Giuseppe Carleo, James Stokes, and Shravan Veerapaneni. Natural evolution strategies and variational Monte Carlo. arXiv preprint arXiv:2005.04447, 2020.