# Meta Variational Monte Carlo

An identification is found between meta-learning and the problem of determining the ground state of a randomly generated Hamiltonian drawn from a known ensemble. A model-agnostic meta-learning approach is proposed to solve the associated learning problem and a preliminary experimental study of random Max-Cut problems indicates that the resulting Meta Variational Monte Carlo accelerates training and improves convergence.

• 14 publications
• 13 publications
• 1 publication
• 12 publications
• 14 publications
04/17/2020

### Meta-Meta-Classification for One-Shot Learning

We present a new approach, called meta-meta-classification, to learning ...
07/16/2020

### Collision Avoidance Robotics Via Meta-Learning (CARML)

This paper presents an approach to exploring a multi-objective reinforce...
02/26/2020

### Adversarial Monte Carlo Meta-Learning of Optimal Prediction Procedures

We frame the meta-learning of prediction procedures as a search for an o...
10/21/2017

### A Learning-to-Infer Method for Real-Time Power Grid Topology Identification

Identifying arbitrary topologies of power networks in real time is a com...
02/12/2020

### Distribution-Agnostic Model-Agnostic Meta-Learning

The Model-Agnostic Meta-Learning (MAML) algorithm <cit.> has been celebr...
02/23/2021

### Identifying Physical Law of Hamiltonian Systems via Meta-Learning

Hamiltonian mechanics is an effective tool to represent many physical pr...
09/30/2011

### Combination Strategies for Semantic Role Labeling

This paper introduces and analyzes a battery of inference models for the...

## 1 Introduction

This paper concerns the development of heuristic approximation algorithms for determining the ground state (minimal eigenpair) of certain large random Hermitian matrices drawn from known ensembles of relevance to Physics. Concretely, we focus on the real vector subspace of Hermitian matrices consisting of 2-local,

-qubit Hamiltonia, which are enumerated by a number of parameters only polynomial in

. These 2-local Hamiltonia in particular are expressive enough to capture all quadratic unconstrained binary optimization (QUBO) problems.

The Variational Monte Carlo (VMC) mcmillan-pr65 is a heuristic algorithm which, given a local Hamiltonian

, produces an estimate for the minimal eigenvalue

and a description of an associated eigenvector. By exploiting neural networks as trial wavefunctions, Carleo and Troyer

carleo2017solving showed that VMC can achieve state-of-the-art results for the ground state energies of physically important magnetic spin models. The domain of applicability of so-called neural-network quantum states has since been expanded considerably. It was recently shown, for example, that in the case of binary optimization, VMC is equivalent to Natural Evolution Strategies (NES) gomes2019classical and state-of-the-art results can be achieved for the Max-Cut Hamiltonian zhao2020natural , albeit at the expense of significantly increased computation time compared to the best known classical heuristic approximation algorithms. The slowdown is attributed to the gradient-based training loop within the VMC. In this step, the neural network parameters

are guided in the direction of steepest descent of the following unbiased estimator of the Rayleigh quotient which upper bounds the lowest eigenvalue:

 L(θ):=⟨ψθ|H|ψθ⟩⟨ψθ|ψθ⟩=Ex∼|ψθ(⋅)|2[⟨x|H|ψθ⟩ψθ(x)]≥λmin(H), (1)

where denotes the computational basis of and the vector denotes a neural-network quantum state in the sense that the amplitude is computed by a neural network with variational parameters

. The expectation value is taken with respect to the associated Born probabilities, which are proportional to

. In practice, the objective function is optimized using a variant of stochastic minibatch gradient descent called Stochastic Reconfiguration sorella_aps98 .

In contrast to the canonical formulation of VMC, which accepts a single Hamiltonian as input, in this paper we propose the meta-VMC, which asks for an approximation of the ground energy for an ensemble of 2-local Hamiltonia indexed by a random disorder parameter sampled from a known distribution . The simplest strategy of retraining a separate neural-network quantum state from scratch for each realization of the disorder parameter is clearly impractical. The goal is thus shifted to finding a neural network that is maximally adaptive to new realizations of the disorder. For experimental support, we focus in this paper on the special case of QUBO problems, which can be identified with a Hamiltonian that is diagonal in the Pauli- basis. These preliminary results are viewed as a stepping stone to the more physically interesting problem of random quantum spin models, in which we anticipate similar optimization considerations to apply.

The formulation of meta-VMC exhibits obvious parallels with meta-learning or learning to learn

in the Machine Learning literature

thrun2012learning , where data from previously encountered learning tasks is employed to accelerate performance on new tasks, drawn from an underlying task distribution . In the language of meta-learning, indexes the learning task and denotes the distribution over all tasks. In meta-VMC, we assume the task distribution is known to the learner; in contrast, conventional meta-learning assumes is unknown but possesses sufficient regularity to render meta-learning feasible. A second, less significant distinction is that the per-task objective function for meta-VMC is an unbiased estimator for the population objective, whereas meta-learning typically focuses on Empirical Risk Minimization objectives, which suffer from nonzero bias.

## 2 Theory

A simple strategy that has proven successful in meta-learning of deep neural networks is multi-task transfer learning

caruana1997multitask ; baxter2000model , which aims to learn an initialization for subsequent tasks by jointly optimizing the learning objective of multiple tasks simultaneously, using a minibatch training strategy that interleaves batches across the tasks. Multi-task learning is, however, prone to catastrophic interference mccloskey1989catastrophic , making it unsuitable for generalization to the VMC. The problem is exemplified by some of the simplest examples of disordered spin systems: suppose is a random Hamiltonian whose expected value under the disorder parameter vanishes . As a concrete example, consider the Sherington-Kirkpatrick Hamiltonian, in which

represents a collection of i.i.d. centered Gaussian random variables

representing the exchange energies. If we denote by the objective function corresponding to disorder parameter , then the multi-task learning objective function, expressed in the population limit, is given by

 LMTL(θ):=Eτ∼T[Lτ(θ)]=Eτ∼T{Ex∼|ψθ(⋅)|2[⟨x|Hτ|ψθ⟩ψθ(x)]}=⟨ψθ|E[Hτ]|ψθ⟩⟨ψθ|ψθ⟩=0. (2)

The fact that the multi-task learning objective loses dependence on in the population limit implies that the associated minibatch algorithm makes no progress asymptotically.

In order to define an objective function which is asymptotically non-vacuous and which promotes adaptation to new realizations of disorder, we propose to optimize the following meta-learning objective function, again presented in population form for simplicity andrychowicz2016learning ,

 LML(θ):=Eτ∼T[Lτ(Utτ(θ))]=Eτ∼T[Lτ(Uτ∘⋯∘Uτt times(θ))], (3)

where denotes the -fold application of a task adaptation operator , which in the simplest case of gradient descent with step size , is given by . Optimization of the meta-learning objective ensures that when a new realization of the disorder parameter is drawn, the initialization performs well after performing one or more steps of gradient descent. Loosely speaking, meta-learning can be justified when one has a budget for running a few steps of gradient descent.

The fact that the meta-learning objective function manages to avoid the catastrophic interference phenomenon can be illustrated by the following toy model111This quadratic model has also been analyzed in the context of convergence theory in fallah2020convergence .. Rather than considering the Rayleigh quotient (1), consider the following ensemble of quadratic functions specified by a random positive-definite matrix and a random vector ,

 Lτ(θ)=12⟨θ,Aθ⟩−⟨b,θ⟩, (4)

where the random variable now corresponds to the task label. In the simplest setting of single-step () meta-learning with vanilla update operator , the optimal solution of the multi-task and meta-learning objectives can be found in closed form,

 argminθ∈RdLML(θ) =E[A(I−βA)2]−1E[(I−βA)2b]. (5)

In the limit corresponding to multi-task learning, the optimal solution is found to only depend on the mean value of the random variable , whereas the meta-learner (

) exploits information in the higher-order moments of

.

In the case of meta-VMC, we consider gradient-based optimization. Specifically, we focus on model-agnostic meta-learning (MAML) finn2017model

which is a gradient-based algorithm that has been proposed for optimizing the meta-learning objective. Straightforward application of the chain rule gives rise to the following gradient estimator for the meta-learning objective,

 ∇LML(θ)=Eτ∼T[(Utτ)′(θ)∇Lτ(Utτ(θ))], (6)

where denotes the Jacobian matrix of the function . The pseudocode for MAML is outlined in Algorithm 1. In order to facilitate readability, we have presented the algorithm with batching only in the task index, leaving the remaining expectation values (with respect to Born probabilities) in population form. In a practical algorithm, the intermediate variables and are estimated stochastically using independent batches of data generated by the same task . Since the computation of the Jacobian involves an expensive back-propagation, first-order MAML (foMAML) has been proposed (e.g., finn2017model ; nichol2018first

), which is a simplification of MAML in which the Jacobian matrix is approximated by the identity matrix.

## 3 Relationship with previous work

In this section, we differentiate our proposal from the uses of meta-learning that have been proposed in the quantum computing literature. In wilson2019optimizing

, for example, meta-learning has been proposed to mitigate various sources of noise, specifically shot noise and parameter noise. In the context of VMC, shot noise is analogous to variance associated with finite minibatches, whereas parameter noise has no clear analogue. Ref.

verdon2019learning

is the most similar to ours in that they consider meta-learning from known distributions. They differ by the choice to focus on variational quantum algorithms such as VQE and QAOA and by the fact that they do not use model-agnostic meta-learning. Instead, the meta-learning outer-loop involves training a separate recurrent neural network, similar to

andrychowicz2016learning .

## 4 Experiments

The experiments focus on the Max-Cut problem which is defined in terms of a simple, undirected graph with binary-indicator adjacency matrix . In order to attack the Max-Cut problem with the VMC, we encode the solution in the ground state of the following classical antiferromagnetic Ising Hamiltonian with exchange interaction energy matrix given by ,

 Hτ=∑1≤i

where denotes the Pauli- operator acting locally to the th qubit. Since the Max-Cut Hamiltonian is diagonal in the Pauli- basis, it acts as a multiplication operator and the ground state can be chosen as a computational basis vector

corresponding to a maximal cut. The variational wavefunction was chosen to be a real-valued Boltzmann machine with

hidden units. The task distribution defining the ensemble of Max-Cut Hamiltonians was defined by the following procedure. The adjacency matrix for a Bernoulli random graph with edge probability was first chosen and fixed throughout the experiments. Sampling an adjacency matrix from the task distribution is performed by rounding the matrix to a 0/1 matrix, where denotes an matrix with entrywise Gaussian noise

. The hyperparameter controlling the task diversity for this ensemble is thus the variance

.

During training, each iteration of the meta-learning loop involved independently sampling a batch of tasks from . During testing, adjacency matrices were sampled from and fixed for evaluation purposes. The inner loop used iterations of vanilla SGD (step size batch size 128), while the outer loop training used 100 iterations of vanilla SGD (step size , batch size 16). The meta-learning experiments were conducted using MAML finn2017model and first-order MAML (foMAML) finn2017model ; nichol2018first . For baselines, we compared against training a neural-network quantum state from scratch as well as a pre-trained initialization with fine-tuning. The learning curves, illustrated for different values of in Fig. 1, clearly show that model-agnostic meta-learning dramatically accelerates training compared to the baselines on the testing Max-Cut instances, consistent with the goal of MAML in promoting adaptivity. The networks trained using MAML also found larger cut values compared to the baselines which appear to converge prematurely to suboptimal states.

## 5 Discussion and future directions

The preliminary experimental results for Max-Cut ensembles indicate that MAML effectively solves meta-VMC by accelerating training and improving convergence. While the Max-Cut problem is exactly solvable for the graph sizes considered here (e.g., by the Branch and Bound method rendl2010solving ), our work paves the way to investigate physically-interesting matrix ensembles that cannot be diagonalized by a local change of basis. The ideas presented in this paper naturally extend also to variational quantum algorithms (VQAs) such as the variational quantum eigensolver. The key difference in the case of VQAs is that the denominator in the Rayleigh quotient (1) is automatically normalized and stochastic estimation of the quantum expectation value involves performing measurements in multiple bases if the Hamiltonian contains non-commuting terms. The exploration of meta-VQA and associated learning algorithms is left to future work.