I Introduction
Recently, the fields of machine learning and quantum information science have seen a lot of crossbreeding. On the one hand, a number of promising results have been obtained suggesting the potential for performing quantum or classical machine learning tasks on a quantum computer Biamonte et al. (2017). In particular, the variational quantum eigensolver Peruzzo et al. (2014) – perhaps the most promising quantum algorithms for first generation quantum computers – is based on the variational optimization of a cost function to be evaluated on a quantum device, providing a new playground for hybrid quantumclassical learning McClean et al. (2016); Kandala et al. (2017). However, arguably the most significant advances have been in the field of classical variational algorithms for quantum many body systems. A number of studies have shown that machine learning inspired sampling algorithms can reach state of the art precision; including ground state energy estimation Carleo and Troyer (2017); Choo et al. (2019), time evolution Carleo and Troyer (2017); Carleo et al. (2012), identifying phase transitions Van Nieuwenburg et al. (2017); Carrasquilla and Melko (2017); Broecker et al. (2017), and decoding quantum error correcting codes Sweke et al. (2018); Andreasson et al. (2019).
A model that has gathered a particularly large amount of attention is the complex restricted Boltzmann machine (RBM) state Ansatz with stochastic reconfiguration optimization introduced by Carleo and Troyer Carleo and Troyer (2017)
. The authors show that ground state energy evaluations can match state of the art tensor network methods on benchmark problems.
At present, however, there lacks a theoretical underpinning for explaining why the complex RBM wavefunctions – or any other machine learning inspired parametrization – is a good Ansatz for describing ground states of physical Hamiltonians. In particular, it is difficult to assess and quantify the role of entanglement in these new classes of wavefunctions. This is sometimes referred to as the ‘black box’ problem with machine learning inspired approaches, that success is based on loose heuristic arguments, rather than a proper theoretical underpinning. Such a situation might be sufficient for real world commercial applications of machine learning but is unsatisfactory when it comes to describing physical models, where we specifically hope to gain insight about some underlying or emergent physical principles.
Some studies try to relate complex RBM states to Tensor Network states Chen et al. (2018); Collura et al. (2019) where the role of entanglement is naturally built into the model. But these studies are mostly based on constructing abstract mappings between RBM wavefunctions and tensor network states, and usually provide at best existence proofs.
In this paper, we aim to obtain a better understanding of the learning dynamics with complex RBM wavefunctions by analyzing the geometry induced in parameter space. Indeed, the stochastic reconfiguration method updates the variational parameters of the wavefunction by gradient descent of the energy, weighted by a ‘quantum Fisher matrix’, which is the quantum analogue of the Fisher information matrix. The Fisher information matrix is know to be the unique Riemannian metric associated to a probability space invariant under sufficient statistics
Cencov (2000). Hence it is the natural candidate for associating an ‘information geometry’ to a statistical model.We analyse the spectral properties of the ‘quantum Fisher matrix’ for a variety of physics models. We argue that the information geometry provides us with clues of both the expressibility of the Ansatz state and of the underlying physics, provided the optimization converges. In particular, we identify a number of features which we believe to be universal for spin models:
(i) The spectrum of the quantum Fisher matrix becomes singular in phases connected to a product state (in the computational basis). The singularity is more pronounced the closer one gets to the product state;
(ii) Critical phases have a smooth and extended spectrum, which is also reminiscent of image recognition models in classical machine learning;
(iii) Kinks in the spectrum reveal symmetries in the state.
(iv) The eigenvalues are exponentially decaying in value. The largest eigenvalues have eigenvectors that are dominated by first moments; i.e. they do not contain much information about correlations in the system. This feature is accentuated the sharper the spectrum profile of the quantum Fisher matrix.
The above insight was extracted from extensive numerical data calculated using quantum spin Hamiltonians such as transverse field Ising and Heisenberg spinXXZ models as well as coherent Gibbs states for the two dimensional classical Ising model. Various Monte Carlo sampling strategies were used to optimize the results on large system sizes.
Importantly, we observe that the bare values of the variational parameters reveal very little information about the physical properties of the system, contrary to what is often claimed that ‘activations indicate regions of activity in the underlying data’. We take this as evidence that there are many equivalent representations of the states in the vicinity of the ground state, suggesting that the optimizer preferentially choses robust representation of the ground state. Robustness of the Monte Carlo methods might be related to the generalization property in supervised learning. Our methods promise to be an essential diagnostic tool for further exploration with complex RBM wavefunctions as well as with other machine learning inspired wavefunctions.
i.1 Complex RBM and optimization by stochastic reconfiguration
The complex Restricted Boltzmann Machine (RBM) neural network quantum state specifies the amplitudes of a wavefunction in some chosen computational basis by the exponential family:
(1) 
where the vectors and the matrix contain complex parameters to be varied in the optimization, and is a binary vector indexing ‘hidden’ units. is a constant guaranteeing normalization of the state . The complex RBM can be visualized as a binary graph between the visible nodes and the hidden nodes (see Fig. 1). To each edge we associate a variational parameter , and at each vertex we associate a bias weight or to a visible () or hidden (
) binary degree of freedom. We will often express the variational parameters as a concatenated vector labelled
. For classical RBMs, the normalization constant is the partition function of a joint probability distribution on the hidden and visible units. This is generally not true in the complex case.
The goal of variational Monte Carlo is to find the optimal paramters that minimize the energy of a given Hamiltonian in the state . The standard approach would be to use gradient descent, but this performs very poorly for spin Hamiltonians, as the updates tend to get stuck oscillating back and forth along steep wells of the energy landscape rather than falling down the more shallow directions. The stochastic reconfiguration (SR) method Sorella (2001) for energy minimization is derived as a second order iterative approximation to the imaginary time ground state projection method (see Appendix A for a self contained derivation). In SR, the parameters of the Ansatz wavefunction are iteratively updated as
(2) 
where is a constant specifying the rate of learning. The second order effects which take curvature into account are determined by the matrix
(3) 
of the diagonal operators , with , which act for instance as
(4) 
in the computational basis . We will call the matrix the quantum Fisher matrix, because of its connection with information geometry as discussed in detail in the next section. The quantum Fisher matrix can be reformulated as a classical covariance matrix of the operators ,
(5) 
and similarly
(6) 
where is the classical expectation of operator in the state , and
(7) 
is called the local energy.
For the RBM Ansatz, the diagonal operators take on the simple form:
(8)  
(9)  
(10) 
where , and indices run over visible vertices and run over hidden vertices. Thus the size of the quantum Fisher matrix is .
The SR method is computationally efficient when the following are true:

The operators and can be computed efficiently for every point .

The probability distribution can be sampled from for any values of ; meaning that any single Monte Carlo update can be computed efficiently. In practice we require that each Monte Carlo update is independent of system size; i.e. updates are local.

The sampling procedure converges rapidly (in subpolynomial time) to the desired state .
The complex RBM Ansatz guarantees that (1) and (2) hold whenever the number of hidden units is a constant multiple of the visible units. However, like essentially any sampling algorithm, provably guaranteeing (3) seems nearly impossible in any practically relevant problem. However, experience has shown that convergence often is rapid in practice, or can be curtailed, whenever one steers clear of frustration or the Fermionic sign problem. It is worth pointing out, though, that convergence of the sampler can depend sensitively on the chosen basis and the initial state, as evidenced in Sec. III.2.
i.2 Natural gradient and SR
The stochastic reconfiguration method can be understood as a quantum extension of Amari’s natural gradient optimization Amari (1998). Plain vanilla gradient descent optimizes a multivariate function by updating the parameters in the direction of steepest descent:
(11) 
at a certain rate .
In systems where the landscape of the function is very steep in certain directions and shallow in others, convergence can be very slow as the updates fluctuate back and forth in a deep valley, but take a long time to ‘drift’ down a shallow one. The natural gradient method proposes to update the parameters according to the natural (Riemannian) geometric structure of the information space, so that the landscape is made locally euclidean before the update. Suppose the coordinate space is a curved manifold in the sense that the infinitesimal square length is given by the quadratic form
(12) 
where the matrix is the Riemannian metric tensor. Amari showed that the steepest descent direction of the function in the Riemannian space is given by
(13) 
The action of the inverse of can be heuristically understood as ‘flattening’ out the space locally. For general optimization problems, the Hessian is a natural choice for , as it reproduces Newton’s second order method. In machine learning applications, and with RBMs in particular, the Hessian is hard to construct from sampling. It also appears to be attracted to saddle points Dauphin et al. (2014).
When the parameter space in question is naturally associated with a classical probability distribution, the ‘natural’ geometry is chosen to be the Fisher information matrix as it is the unique metric that is invariant under sufficient statistics Cencov (2000). For pure parametrized quantum states, the natural Riemannian metric is derived from the FubiniStudy distance:
(14) 
Infinitesimal distances are given by:
(15) 
which reproduces the quantum Fisher matrix for parametrization as .
In particular, when the wavefunction is positive in a given computational basis, the quantum state can be written as , and the quantum Fisher matrix is
(16)  
(17) 
where and is the Fisher information matrix associated to the probability distribution . Thus, the SR method reproduces the natural gradient method for positive wave functions. For this reason, we will be calling the matrix associated to a pure quantum state the quantum Fisher matrix.
i.3 Spectral analysis of the quantum Fisher matrix
In this paper, we will argue that spectral properties of the quantum Fisher matrix reveal essential information about the physical properties of the system under study as well as the dynamics of optimization.
The quantum Fisher matrix is positive semidefinite, implying that its spectrum is real and there exists a set of orthonormal eigenvectors. The magnitude of an eigenvalue determines how steep the learning landscape is in that particular direction. The spectrum will generically be sloppy Waterfall et al. (2006), with a spectral function bounded above by a decaying exponential.
It is often argued in the machine learning community that gradient descent algorithms favor regions in parameters space where most eigenvalues are close to zero Sagun et al. (2016); Papyan (2018). This implies that at convergence, most directions in the landscape are nearly flat, suggesting that nearby points in parameter space encode much of the same physical properties. In classical supervised learning, the flatness of the landscape has been associated with the ‘generalization’ ability of the learned model Hochreiter and Schmidhuber (1997); in the physics setting we interpret it to mean that the representation is robust.
Because of the bipartite graph structure of the RBM Ansatz, it is natural to talk about correlations between the visible and hidden units. The quantum Fisher matrix is a square matrix, with the first two blocks corresponding to the biases , and the third block corresponds to the weights matrix . The main block describes the orientations in parameter space that can affect correlations in the model. We will see later that eigenvectors associated to eigenvalues of large magnitude are typically close to a product state between the visible and hidden part, meaning that they mostly just affect the first moments of the spin variables.
To measure correlations in the eigenvectors , we truncate the first two blocks of the eigenvectors associated with the biases, and renormalize the ‘’ part to have Hilbert Schmidt norm 1. We then calculate the entanglement in the eigenstate :
(18) 
where is the partial trace over the hidden layer, and is the von Neumann entropy of the reduced density matrix.
Ii Results
In this section, we analyze the spectral properties of the quantum Fisher matrix during the learning process of finding the ground state of the transverse field Ising (TFI) model. The TFI Hamiltonian is given by
(19) 
where are Pauli spin operators, and is the external field. The system has symmetry () which is explicitly broken for in the thermodynamic limit (). A second order phase transition occurs at . At zero external field the model has two degenerate ground states and , whereas in the limit of the ground state is unique, given by .
The spectral properties of the quantum Fisher matrix, as well as the energy during the learning process are plotted in Fig. 2. Figure 2(a) confirms that the optimization procedure successfully finds the ground state for all values of , albeit at different speeds. The quantum Fisher matrix is constructed approximately by MonteCarlo sampling and its full spectrum is evaluated every 5 epochs during learning. The eigenvalues at some representative epochs are plotted in decreasing order in Fig. 2(b).
The dynamics of the learning process proceeds in two distinct stages. The first stage is observed at the very beginning of the learning, lasting for roughly epochs^{1}^{1}1
The duration of the first phase appears to depend on the hyperparameters (learning rate, regularization), but not on the system size.
, and is the same for all values of . The initial shape of the spectrum has two sharp drops located at and (see Fig. 2(c)). This is a consequence of the random initialization with small weights. An analytic justification of this behavior is provided in Appendix B. The spectrum then gets pushed up until approximately the ’th epoch, revealing that more and more dimensions in the information space become relevant.The second stage of learning then slowly transforms the distribution to that of the final converged state. We observe that the spectrum falls off very sharply (exponentially) in all cases examined (Fig. 2(b)), but the exact spectral profile depends strongly on the details of the model, yet not on the system size or on the specific values of the learned weights (see Appendix C for an in depth discussion). We take this as evidence that the learned state not only minimizes the energy, but also closely matches the actual ground state of the model. The behavior of the spectrum of the quantum Fisher matrix for each phase of the TFI model is discussed in the next subsection.
ii.1 Phases of the TFI model
The ferromagnetic phase .
Let us start by considering the extreme case with . The quantum Fisher matrix after convergence becomes a pure state up to numerical precision. The singularity of the quantum Fisher matrix in this case can be explained from the properties of the ground state: When , the Hamiltonian Eq. (19) has two ground states and . We first note that the optimization consistently found a solution with and , leading to a symmetric state. Let us therefore assume that the solution we have exactly describes the symmetric ground state; i.e. . Then the ground state is leading to an RBM representation for or where , and zero otherwise.
Moreover, we have and where . This gives
(20)  
(21) 
Thus, the quantum Fisher matrix is
(22) 
which is rank . We note that the above argument does not depend on the details of the weights , rather only on its magnitude , so that any set of RBM weights that accurately model the ground state will exhibit the same behavior. The SR optimization typically favors small weights.
As the external field increases, the number of terms of the ground state in the computation basis increases, thus we also expect that rank of to increase as . This is consistent with the results from our numerical data in Fig. 2(b). Importantly, rank deficiency is observed throughout the ferromagnetic phase, albeit much more pronounced in the vicinity of . We interpret this behavior as a signature that the phase is connected to a product state in the physical basis. For values of close to one, the rank deficiency can only be seen for at large system sizes, and after many training epochs.
The critical point ()
At the critical point, the distribution of eigenvalues after convergence is smooth, and decreasing exponentially. This behavior is also seen in many classical image processing tasks in machine learning Papyan (2018); Grosse and Salakhudinov (2015), suggesting that it might be signature of (critical) long range order. Indeed, each element of the quantum Fisher matrix can be expanded in terms of correlation functions, all of which are sizeable in the critical case. This eigenvalue distribution is characteristic of ‘Sloppy model universality’, which has been shown to reflect systems with certain forms of scale invariance Waterfall et al. (2006), further corroborating the claim. We will see in section III.1 that this behavior is seen in many other systems and reveals that the RBM is fine tuning a solution with the help of a large number of hidden units.
The paramagnetic phase ()
In this case, we see that the energy converges rapidly and the eigenvalues almost do not change after the initial learning stage. In particular the second jump in the spectrum of the initial random RBM survives until the end. When , the jump is located at , revealing that the quantum Fisher matrix has no support on the antisymmetric subspace (see Appendix B). Precisely, the 406th eigenvalue has magnitude and the next one has magnitude in our numerical data.
To understand the stepwise behavior, we first focus on the randomly initialized RBM case; i.e. at epoch . As we initialize the parameters of the RBM with small random Gaussian values (sampled from where ), the classical probability distribution would be similar to the case when all parameters are zero. When , the RBM gives , i.e. the identity distribution. We can then perturbatively expand the quantum Fisher matrix in terms of the parameters. The derivation up to is given in Appendix B. Our derivation gives eigenvalues of associated with the visible biases block of the matrix and eigenvalues of order in the weights block of the quantum Fisher matrix. This explains the first and the second jumps in the eigenvalue distribution of the random RBM.
The randomly initialized RBM also hints at the fact that the quantum Fisher matrix throughout the paramagnetic phase strongly retains properties of the limit with product state . We can compare the spectra of the quantum Fisher matrix for and the randomly initialized case in Fig. 2(c). It shows that the second step is preserved but the first step disappears. This is because the first step depends on the details of weights but the second one is the consequence of the symmetry. We made detailed comparison between the quantum Fisher matrix for the paramagnetic phase and randomly initialized RBM in Appendix C. We there show that the converged matrix has larger diagonal elements in the part of the matrix than the random RBM case which also support eigenvalues between to .
Throughout the phase diagram of the TFI, the spectrum of the quantum Fisher matrix at convergence has two special points at and at , as seen in Fig. 2(c). The location of these points is independent of the number of hidden units, suggesting that they originate from the nature of the physical system, and the overall bipartite structure of the RBM, rather than any details of the RBM graph.
ii.2 Eigenvectors
Above, we have argued the eigenvalues of the quantum Fisher matrix reveal signatures of the phase of matter being simulated. We now ask whether the eigenvectors can teach us anything about how correlations are conveyed in the learning landscape. In particular, since the complex RBM is constructed from a bipartite graph with no connections among the hidden and visible units, we know that all correlations have to be mediated by weights. Entanglement in the information manifold is therefore completely contained in the weights block of the Fisher matrix.
In Fig. 2(d), we plot the entanglement between the visible and hidden units of the part of each eigenvector (see Eqn. (18)). We observe that the first eigenvectors have very little entanglement when . This suggests that the directions of largest curvature are almost exclusively associated with the biases, or first moments, of the distribution. Note that this does not imply that the values of the weights are small, as representations of the first moments are distributed over the biases and the weights. Rather it is a reminder that the actual values of the weights of the network reveal little information of the correlations in the system, as is manifest in Fig. 6 of Appendix C. This behavior is less pronounced for
as the quantum Fisher matrix behaves more like a random matrix whose eigenvectors are expected to have a more homogenous amounts of entanglement.
The entanglement increases in the bulk of the spectrum. Interestingly, this means that the directions in parameter space that encode information about correlations are typically dense, smooth and flat. In the context of classical ML, these properties are akin to good generalisation ability of the learning models, whereas in the present physics context, we interpret it to meant that the algorithm preferentially learns stable configurations; where changes (even large) in most directions in configuration space will not affect the physically observable properties of the system. Similar conclusions have been alluded to in the context of sloppy models universality in statistical mechanics Machta et al. (2013).
ii.3 Numerics
For numerical simulation, we set the ratio between the numbers of hidden units and visible units of the complex RBM to . Thus the RBM has parameters overall ( and for biases and for the weight matrix
). To sample from the RBM, Markov chain MonteCarlo (MCMC) method enhanced with parallel tempering was employed.
Choo et al. (2018) We used 16 parallel Markov chains with linearly divided temperatures from to . For each Markov chain, we used local spin flip updates. To directly compare the results from variational MonteCarlo with exact digitalization, we use the size of system and impose the periodic boundary condition throughout the paper unless otherwise stated. In practice, SR has two hyperparameters: the learning rate ( in Eq. (2)) and the regularization that should be added to the diagonal of the Fisher information matrix for numerical stability in when computing the inverse. For the simulation of TFI, we have used the hyperparameters and .ii.4 Predictions
From the spectral analysis of the quantum Fisher matrix for the transverse field Ising model, we make the following predictions, which we expect to hold more generally for ferromagnetic quantum spin models:

The spectral profile is universal within a phase of the model, and is only weakly dependent on system size away from phase transition points. The spectrum of the quantum Fisher matrix is therefore a good indicator of the existence of a phase transition if it is possible to find two points in phase space with vastly different spectral profiles.

The first eigenvectors are close to product states, and hence do not encode correlations in the system. They mostly pertain to first moments of the distribution.

A rank deficient quantum Fisher matrix is evidence that the state is in a phase connected to a product state in the chosen computational basis. A smoothly decaying spectrum is a sign that the system contains a lot of correlation; often a critical phase with polynomial decaying correlation functions.

Kinks in the spectrum reveal symmetries in the model. In the case of the TFI, the persistent kink at is a sign that the symmetric and antisymmetric subspaces are strictly separated everywhere except at the critical point.
Iii Further experiments
In this section, we study two further models to test whether the predictions made in Sec. II.4 extend to more general spin systems. The first model is the two dimensional coherent Gibbs state, whose quantum Fisher matrix is evaluated exactly without having recourse to learning. The second is the XXZ model, where we explore all three phases with the tools developed above.
iii.1 Coherent Gibbs state of the two dimensional classical Ising model
We consider the RBM representation of the coherent Gibbs state of the two dimensional classical Ising model. Recall the classical Ising model
(23) 
where is the configuration of the spin and are nearest neighbors on a two dimensional lattice. For convenience, we set . We consider a system in thermal equilibrium with inverse temperature . At high temperature , the system exhibits a disordered paramagnetic phase characterized by zero magnetization , whereas it shows a symmetry broken ferromagnetic phase with nonzero magnetization at sufficiently low temperature Baxter (1982). The phase transition takes place at in the thermodynamic limit and is secondorder. We thus have polynomial decay of the correlation function at the critical point.
The coherent Gibbs state for the model with inverse temperature is given by
(24) 
in a chosen computational basis and is the normalization factor which is the same as the partition function of the classical model. A key observation is that correlation functions of spin operators are exactly the same as that of the classical model, i.e. where is the Boltzmann distribution. Thus we also have polynomially decaying quantum correlation functions for this state at .
It is known that coherent Gibbs states of Ising type models can be represented exactly as an RBM Gao and Duan (2017) by associating each edge of the lattice to one hidden unit (we provide a selfcontained derivation in Appendix D). In particular, the coherent Gibbs state of an Isingtype model defined on a graph can be described using the RBM with parameters and a by sparse weight matrix .
Using this mapping, we construct the quantum Fisher matrix of the RBM representation for coherent Gibbs states . To sample from the distribution, we have employed the Wolff algorithm Wolff (1989) instead of usual local update scheme in this case as it is more efficient close to the transition point. The spectral profiles of the quantum Fisher matrix for different values of are shown in Fig. 3a).
The figure shows very similar shape to that of the TFI case when they are deep in the ferromagnetic or paramagnetic phase. The eigenvalues exhibit a collapsing distribution in the ferromagnetic phase for large and get progressively more singular as we increase . Compare this behavior to the TFI for depicted in Fig. 2. In the paramagnetic phase (), we see a stepwise distribution where the step is exactly located at , very much like the TFI model at large . Thus for coherent Gibbs states that are deep in each phase, we get the same qualitative behavior of the quantum Fisher matrix in both models.
In contrast to the learned TFI case in Section II, the dropoff at survives also at criticality. This can be understood by the fact that the quantum Fisher matrix is constructed from the exact coherent Gibbs state which is exactly symmetric in the exchange of spins. Hence the quantum Fisher matrix has zero support on the antisymmetric subspace also at criticality. In Fig. 3c), we have plotted the quantum Fisher information which is simply the trace of the quantum Fisher matrix for different values of . We see that the quantum Fisher information reaches a maximum in the vicinity of the phase transition point, hence acting as an order parameter reminiscent of the magnetic susceptibility. A more detailed analysis of the quantum Fisher information as a witness of phase transitions for this and other models will be presented elsewhere.
iii.2 The XXZ model
We now consider the Heisenberg XXZ model
(25) 
This model is exactly solvable using Bethe Ansatz. The solution shows three distinct phases: (1) a gapped ferromagnetic phase for , (2) a critical phase for , and (3) a gapped antiferromagnetic phase for . The ground state when is a superposition between and . It is also known that the ground state is in subspace for . In the critical phase (), the Hamiltonian is gappless in the thermodynamic limit and the correlation length diverges. The phase transition at is first order and an infinite order KosterlitzThouless transition takes place at .
We will again look at the spectral properties of the Fisher information matrix in this model for , and . For and , we have restricted the wave function to the symmetric subspace by applying the swap update rule in MCMC. Fig. 4(a) shows the convergence of sampled energy over SR iterations. We see that SR successfully finds the ground states in all cases but the initial drift starts later in the XXX case (). The spectrum of the quantum Fisher matrix shown in Fig. 4(b) also verifies slow initial learning in the XXX case. The spectrum begins to change slowly compared to other cases. We suspect that the symmetry of the Hamiltonian is related to slow learning in the initial stage. When we compare this to the result from other values of , the quantum Fisher matrix does not differ much as it only depends on the parameters of the RBM but the gradient of the energy is much smaller when than other cases.
We plot the converged spectra in Fig. 4(c). Using this, we can extract some information of the converged ground state when . As the first order phase transition occurs at this point, the system has two different types of ground states: one that is a superposition of and from and the other one living in a subspace from . As the converged spectrum is singular, we can expect that the ground state found in our simulation is ferromagnetic. We indeed have calculated from MonteCarlo samples and it gives which means a large portion of the state is in and . When and , we see broader converged spectra. We note that there is a small step at when even though the whole spectrum is dense. In comparison, more smooth spectrum is obtained when .
One should also ask about the behavior of quantum Fisher matrix in the antiferromagnetic phase. However, we found that usual MCMC does not produce unbiased samples in the antiferromagnetic phase, so usual SR does not converge to the real ground state ^{2}^{2}2Even though this problem can be solved by applying a local basis transformation that makes the Hamiltonian stoquastic, we did not use such a technique as we want to see how RBM encodes a quantum state without postmanipulation of the problem.. As a consequence, we checked performed the optimization using the exactly constructed quantum Fisher matrix for small enough systems from the probability distribution . The result obtained from the exact simulation for the system size is shown in Appendix E. One observation is that we see a dense converged spectrum when despite the system being gapped. Thus the gap of the system alone does not implies a dense spectrum of the quantum Fisher matrix.
Iv Implication to optimization
In this section, we use the insight gained about the structure of the quantum Fisher matrix to construct a new optimization method for quantum spin systems. The new method allows for significant savings in evaluation time for solving the inverse linear problem in stochastic reconfiguration. Precisely, in each step of SR, we need to solve the linear equation
(26) 
for a given quantum Fisher matrix . Even when the matrix is wellconditioned, the complexity of solving this equation scales as where is the dimension of the matrix, or number of parameters. As itself scales like , the time cost is quartic in
. This is one of the main reasons why second order methods, including natural gradient descent, are not widely used in classical large scale deep learning applications.
Our new optimization method can be seen as an extension of RMSProp
Hinton (2012). The method provides a significant advantage in computation time as it does not involve solving a large system of linear equations. However, the method is not always a good approximation of the natural gradient, but rather depends decisively on the structure of the quantum Fisher matrix.Before describing our method, we briefly review RMSProp for classical machine learning and how it is related to the Fisher information metric from the viewpoint of Ref. Martens (2014). For convenience, the original RMSProp is described in Appendix F
. This algorithm improves a naive stochastic gradient descent by using
, the running average of the squared gradients, to rescale the instantaneous gradient for updating weights. An observation in Ref. Martens (2014) is that is a diagonal approximation of the uncentered covariance matrix of gradients when the learning is in the steady state. When the function we want to optimize is the logarithmic likelihood (which is typical in classical machine learning), recovers the diagonal part of the Fisher information metric at stationarity. The additional square root and prefactor in the last step are added to correct for “poor conditioning” Martens (2010). This provides a plausible argument for why such a simple algorithm works incredibly well. One can also argue that other popular and efficient optimizers such as Adagard, Adadelta and Adam similarly use a type of diagonal approximation of the Fisher information metric Martens (2014).We now describe our variant of RMSProp applied to the ground state optimization problem. Using the same principle as above, one may use to estimate the diagonal part of the uncentered quantum Fisher matrix . The details of the algorithm are outlined in Alg. 1. A distinguishing property of this algorithm to the original RMSProp is that it uses different vectors for a gradient decent direction and estimating the curvature: is calculated by but the gradient of the energy is used for update in the last step. The algorithm suggested here is also different from the method used in Refs. Kessler et al. (2019); Yang et al. (2019) that put energy gradient directly to the classical optimizers.
We have tested the proposed version of RMSProp using different learning rates for the TFI. The results for the ferromagnetic phase and the critical case ( to ) are shown in Fig. 5. For small , we see that RMSProp gets easily stuck in local minima unlike SR. When and , the figure shows that the energy converges to that of the ground state for some learning rate . However, such a convergence is probabilistic. For , and , we ran the same simulation several times and found that, for any , some instances converge to the ground state whereas others get stuck in local minima. In contrast, SR works properly for a wide range of hyperparameters and , for which the energy converges to the ground state regardless of the choice of the learning rates .
For larger such as , the proposed RMSProp shows better convergence behaviors for most values of but it still show stepwise dynamics. In the critical case , the learning curves of RMSProp are smooth and insensitive to the choice of the learning rate, suggesting that the system no longer gets stuck in problematic local minima.
Our results suggest that preserving the singular nature of the quantum Fisher matrix is essential for ensuring convergence to the ground state energy. Indeed, the converged quantum Fisher matrices studied in Appendix C show that the diagonal of the Fisher matrices give rank for and full rank () for other values of . In contrast, the real ranks of the quantum Fisher matrices (measured by counting the number of eigenvalues larger than ) are given as for and , respectively.
We still note that even though the rank provides a plausible argument for the behavior of the learning curves, it does not for the converged energies; the converged energies for and are slightly larger than the ground state energies. Moreover, the convergence behavior in the paramagnetic phase () is more complicated and cannot be solely explained from the quantum Fisher matrix. A partial reason is that the path taken by RMSProp deviates from that of the SR in initial stage of learning (see Appendix F). Detailed investigations in this regime remain for future work.
V Conclusion
We have initiated a detailed study of the quantum information geometry of learning ground states of spin chains in the artificial neural neural network framework. We have focused on complex restricted Boltzmann states and the stochastic reconfiguration method which implements a quantum version of Amaris natural gradient update scheme. Our main result is that the eigenvalues and eigenvectors of the quantum Fisher matrix reflect both the learning dynamics, which is unsurprising, as well as the intrinsic static phase information of the model under study – which is rather surprising. In particular, we found that in the entire noncritical ferromagnetic phase of a number of models, the spectrum of the quantum Fisher matrix has reduced rank. The matrix becomes highly singular in regions of the phase that are close to product states. In critical phases, the spectrum becomes smooth with more and more eigenvectors contributing to the information geometry landscape.
We have identified a universal behavior of the leading eigenvectors of the quantum Fisher matrix: they all convey little entanglement, as measured by the entanglement entropy between the visible and hidden layers. This, in combination with the insight that critical models have smooth spectra, suggests that correlations in complex RBM Ansatz are preferentially represented in the bulk of the information geometry space. Our interpretation of this key dynamical feature of RBM learning is that the model preferentially chooses stable representations, where the entropy of the landscape dominates over the energy. A similar phenomenon is classical supervised machine learning is frequently observed in discussion of ‘generalization’. Finally, we explored strategies for diagonal approximations of the quantum Fisher matrix, and found that their success crutially depends on the phase of the model under study. We therefore do not expect any diagonal approximation of the quantum Fisher matrix to be effective in general.
Vi Acknowledgements
We thank S. Trebst and D. Gross for helpful discussions. We acknowledge support from the DFG (CRC TR 183), and the ML4Q excellence cluster. Source codes for the current manuscript can be found in CYP’s github repository CYP . The numerical simulations were performed on the CHEOPS cluster at RRZK Cologne.
References
 Biamonte et al. (2017) Jacob Biamonte, Peter Wittek, Nicola Pancotti, Patrick Rebentrost, Nathan Wiebe, and Seth Lloyd, “Quantum machine learning,” Nature 549, 195 (2017).
 Peruzzo et al. (2014) Alberto Peruzzo, Jarrod McClean, Peter Shadbolt, ManHong Yung, XiaoQi Zhou, Peter J Love, Alán AspuruGuzik, and Jeremy L O’brien, “A variational eigenvalue solver on a photonic quantum processor,” Nature communications 5, 4213 (2014).
 McClean et al. (2016) Jarrod R McClean, Jonathan Romero, Ryan Babbush, and Alán AspuruGuzik, “The theory of variational hybrid quantumclassical algorithms,” New Journal of Physics 18, 023023 (2016).
 Kandala et al. (2017) Abhinav Kandala, Antonio Mezzacapo, Kristan Temme, Maika Takita, Markus Brink, Jerry M Chow, and Jay M Gambetta, “Hardwareefficient variational quantum eigensolver for small molecules and quantum magnets,” Nature 549, 242 (2017).
 Carleo and Troyer (2017) Giuseppe Carleo and Matthias Troyer, “Solving the quantum manybody problem with artificial neural networks,” Science 355, 602–606 (2017).
 Choo et al. (2019) Kenny Choo, Titus Neupert, and Giuseppe Carleo, “Twodimensional frustrated j 1 j 2 model studied with neural network quantum states,” Physical Review B 100, 125124 (2019).
 Carleo et al. (2012) Giuseppe Carleo, Federico Becca, Marco Schiró, and Michele Fabrizio, “Localization and glassy dynamics of manybody quantum systems,” Scientific reports 2, 243 (2012).
 Van Nieuwenburg et al. (2017) Evert PL Van Nieuwenburg, YeHua Liu, and Sebastian D Huber, “Learning phase transitions by confusion,” Nature Physics 13, 435 (2017).
 Carrasquilla and Melko (2017) Juan Carrasquilla and Roger G Melko, “Machine learning phases of matter,” Nature Physics 13, 431 (2017).
 Broecker et al. (2017) Peter Broecker, Juan Carrasquilla, Roger G Melko, and Simon Trebst, “Machine learning quantum phases of matter beyond the fermion sign problem,” Scientific reports 7, 8823 (2017).
 Sweke et al. (2018) Ryan Sweke, Markus S Kesselring, Evert PL van Nieuwenburg, and Jens Eisert, “Reinforcement learning decoders for faulttolerant quantum computation,” arXiv preprint arXiv:1810.07207 (2018).

Andreasson et al. (2019)
Philip Andreasson, Joel Johansson, Simon Liljestrand, and Mats Granath, “Quantum error correction for the toric code using deep reinforcement learning,”
Quantum 3, 183 (2019).  Chen et al. (2018) Jing Chen, Song Cheng, Haidong Xie, Lei Wang, and Tao Xiang, “Equivalence of restricted boltzmann machines and tensor network states,” Physical Review B 97, 085104 (2018).
 Collura et al. (2019) Mario Collura, Luca Del’Anna, Timo Felser, and Simone Montangero, “On the descriptive power of neuralnetworks as constrained tensor networks with exponentially large bond dimension,” arXiv preprint arXiv:1905.11351 (2019).
 Cencov (2000) Nikolai Nikolaevich Cencov, Statistical decision rules and optimal inference, 53 (American Mathematical Soc., 2000).
 Sorella (2001) Sandro Sorella, “Generalized lanczos algorithm for variational quantum monte carlo,” Physical Review B 64, 024512 (2001).
 Amari (1998) ShunIchi Amari, “Natural gradient works efficiently in learning,” Neural computation 10, 251–276 (1998).
 Dauphin et al. (2014) Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio, “Identifying and attacking the saddle point problem in highdimensional nonconvex optimization,” in Advances in neural information processing systems (2014) pp. 2933–2941.
 Waterfall et al. (2006) Joshua J Waterfall, Fergal P Casey, Ryan N Gutenkunst, Kevin S Brown, Christopher R Myers, Piet W Brouwer, Veit Elser, and James P Sethna, “Sloppymodel universality class and the vandermonde matrix,” Physical review letters 97, 150601 (2006).
 Sagun et al. (2016) Levent Sagun, Leon Bottou, and Yann LeCun, “Eigenvalues of the hessian in deep learning: Singularity and beyond,” arXiv preprint arXiv:1611.07476 (2016).
 Papyan (2018) Vardan Papyan, “The full spectrum of deep net hessians at scale: Dynamics with sample size,” arXiv preprint arXiv:1811.07062 (2018).
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and J Schmidhuber, “Flat minima,” Neural Computation 9, 1–42 (1997).
 (23) The duration of the first phase appears to depend on the hyperparameters (learning rate, regularization), but not on the system size.
 Grosse and Salakhudinov (2015) Roger Grosse and Ruslan Salakhudinov, “Scaling up natural gradient by sparsely factorizing the inverse fisher matrix,” in International Conference on Machine Learning (2015) pp. 2304–2313.
 Machta et al. (2013) Benjamin B Machta, Ricky Chachra, Mark K Transtrum, and James P Sethna, “Parameter space compression underlies emergent theories and predictive models,” Science 342, 604–607 (2013).
 Choo et al. (2018) Kenny Choo, Giuseppe Carleo, Nicolas Regnault, and Titus Neupert, “Symmetries and manybody excitations with neuralnetwork quantum states,” Phys. Rev. Lett. 121, 167204 (2018).
 Baxter (1982) Rodney J Baxter, Exactly solved models in statistical mechanics (Academic Press, 1982).
 Gao and Duan (2017) Xun Gao and LuMing Duan, “Efficient representation of quantum manybody states with deep neural networks,” Nature communications 8, 662 (2017).
 Wolff (1989) Ulli Wolff, “Collective monte carlo updating for spin systems,” Physical Review Letters 62, 361 (1989).
 (30) Even though this problem can be solved by applying a local basis transformation that makes the Hamiltonian stoquastic, we did not use such a technique as we want to see how RBM encodes a quantum state without postmanipulation of the problem.
 Hinton (2012) Geoffrey Hinton, “Neural networks for machine learning lecture notes,” http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf (2012).
 Martens (2014) James Martens, “New insights and perspectives on the natural gradient method,” arXiv preprint arXiv:1412.1193 (2014).
 Martens (2010) James Martens, “Deep learning via hessianfree optimization,” Proceedings of the 27th International Conference on International Conference on Machine Learning , 735–742 (2010).
 Kessler et al. (2019) Jan Kessler, Francesco Calcavecchia, and Thomas D Kühne, “Artificial neural networks as trial wave functions for quantum monte carlo,” arXiv preprint arXiv:1904.10251 (2019).
 Yang et al. (2019) Li Yang, Zhaoqi Leng, Guangyuan Yu, Ankit Patel, WenJun Hu, and Han Pu, “Deep learningenhanced variational monte carlo method for quantum manybody physics,” arXiv preprint arXiv:1905.10730 (2019).
 (36) https://github.com/chaeyeunpark.
 Sehayek et al. (2019) Dan Sehayek, Anna Golubeva, Michael Albergo, Bohdan Kulchytskyy, Giacomo Torlai, and Roger G Melko, “The learnability scaling of quantum states: restricted boltzmann machines,” arXiv preprint arXiv:1908.07532 (2019).
Appendix A Stochastic reconfiguration
For the readers convenience, we derive the stochastic reconfiguration method of Sorella Sorella (2001). The main idea of Stochastic Reconfiguration (SR) is to modify the parameters of a trial wavefunction in such a way that it approaches the ground state along a path dictated by the projection , where is chosen such that .
Let be a state in our ansatz class, with its vector of parameters. From now on, we will suppress the parameters . Then, for sufficiently small , we can write
(27) 
where , are coefficients, and is a state in the orthogonal subspace. Note the identity , where the operators are defined as:
(28) 
were is the computational basis.
We can now obtain a system of linear equations for the coefficients by multiplying Eqn. (27) by and by to get
(29)  
(30) 
The averages are taken in the states . We can then solve for to get
(31) 
where the matrix is given by
(32) 
and the vector is given by
(33) 
We can now identify the coefficients as the update coefficients for the variables , up to an overall constant , which can be interpreted as the learning rate. The SR update scheme can then be summarized as:
(34) 
for some learning rate . Here, is regularization constant that is typically .
Appendix B quantum Fisher matrix of random RBM
We provide an explanation of the stepwise structure of the spectrum of the quantum Fisher matrix upon small random initialisation of the weights. The quantum Fisher matrix is broken up into three main sectors: , corresponding to the visible biases, the hidden biases and the weights.
As in the main text, we use and
to indicate the number of visible and hidden units, respectively. In our simulations, the weights are initialized to be Gaussian distributed with an average magnitude of order
. We therefore make the following assumption about the initial state: the classical probability distribution associated with the initial quantum state is close to the identity, and in particular is separable. This implies that each spin has zero expectation value at initialization for all , and that for all .As the entries of the visible biases block are:
(35) 
we get the identity matrix for the
part. The covariance between the visible and hidden units involves the term . Recall that the argument of the hyperbolic tangents are(36) 
where are the hidden biases and are the weights connecting the hidden and visible units. Under the assumption that all parameters are small, we approximate . Then
(37) 
Likewise, we can obtain the full unary part () of the matrix as
(38) 
We can easily see this is rank as the first row generates the remaining rows. This explains the first eigenvalues which are .
Next, the part of the quantum Fisher matrix is given by
(39)  
(40) 
where label the visible units and label the hidden units. Using the expansion
(41) 
we have
(42) 
where is the matrix of weights, is a vector form of the bias , and with . Using the assumption of small initial weights, we have
(43) 
Then the matrix is approximately
(44) 
where is the swap operator. The rank of is given by . Moreover, is the projector that preserves the symmetric states except the copied state, i.e. when but .
When , the whole covariance matrix is given by and the matrix (Eq. (42)) has rank . This explains the small subleading eigenvalues of order .
However, the blockdiagonal assumption breaks down when we have nonzero bias in the hidden layer () as we have offdiagonal blocks between the unary and part. An additional also enters into . Still, it is not difficult to see that this does not change the overall rank. A precise calculation gives
(45) 
up to third order corrections. It is simple to see that first rows still generate the next rows. Moreover, applying to the first rows gives the additional terms in the last rows so the rank of the matrix from the part also does not change. Thus we have exactly the same rank even when we turn on hidden biases .
Appendix C Further properties of the quantum Fisher matrix
In this section, we investigate further properties of the quantum Fisher matrix. We use the same numerical data as in the main text; the TFI with system size .
c.1 Converged weights
Converged parameters of neural networks are often claimed to reveal features of the data or system under study Carleo and Troyer (2017); Sehayek et al. (2019). We compare the converged weights and the quantum Fisher matrix for different values of in Fig. 6. We find that, in contrast with the spectral information of the quantum Fisher matrix, it is difficult to infer any information from the converged weights of the network. For example, converged weights for and are not sensibly different, whereas the quantum Fisher matrices that reveal essential features of the phase of the system.
This brings to light one the of the key subtleties of RBM Ansatze, which is the extreme redundancy of representation. Let us illustrate this fact by constructing three completely different solutions of the RBM parameters that (approximately) represent the same quantum state . As a first solution, consider the one obtained from our numerical simulation Fig. 6 (a). This solution is fully complex, i.e. real and imaginary parts of the weights are both nonzero. On the other hand, a real solution can be found from the coherent Gibbs states for classical Ising model as discussed in Sec. D. The state is obtained by letting and for a classical Ising model defined on any graph that does not have an isolated vertex. We note that the parameters obtained using this scheme are real as (see Sec. D for details). Finally, it is also possible to represent this state only using pure imaginary parameters. By letting , , and the weight as
(46) 
It is clear from these examples that inferring information of quantum states solely from the activation parameters of the RBM is very ambiguous.
c.2 Nonzero elements of Fisher information matrix
We investigate the rank of the quantum Fisher matrix more closely. Let us first focus on the ferromagnetic phase (). In the main text, we have shown that the rank of the quantum Fisher matrix increases as increases. A question we are interested in is how nonzero elements are distributed in unary and parts of the matrix. To answer this question, we use the quantum Fisher matrix itself after convergence plotted in Fig. 6(b). When , we see that the Fisher information matrix only has nonzero elements in the unary part. In contrast, the part of the matrix shows nonzero elements (especially in diagonal part) when . To see this clearly, we have counted the number of diagonal elements of the quantum Fisher matrix that are larger than . It shows there are such diagonal elements when but for all larger . As the rank of the full matrix is small even for larger , the nonzero elements in the part in this case implies the eigenvectors with dominant eigenvalues have compelling part. In addition, this provides an argument why RMSProp that is studied in Sec. IV works badly for small .
Next, we consider the paramagnetic phase (). In the main text, we have shown that the Fisher information matrix when shows a step at . The whole shape of the spectrum remains similar for smaller even though the location of step can be little shifted. Compared to the randomly initialized RBM, we see larger diagonal elements in part. As Fig. 2 shows that eigenvalues between th to are much larger for the converged Fisher information matrix than the random RBM, we expect that part of the matrix contributes to these eigenvalues. To test this, we have diagonalized only the part of quantum Fisher matrix when where we could observe a step at . Thus despite the whole spectrum does not show a clear step at th eigenvalue, we may still consider that eigenvalues are from the unary part and are from the part. We also found that all diagonal elements of the quantum Fisher matrix is larger than when so the diagonal approximation of the quantum Fisher matrix is full rank.
c.3 System size dependence of the spectral profile
When we use the same parameter and the Hamiltonian, we observe that spectra of the converged Fisher information matrix behaves almost the same for varying . In Fig. 7, we show the spectra of the converged quantum Fisher matrix for different values of using the TFI with different values of . We clearly see that eigenvalue distributions for the same only vary little with the change of the system size . Still, it is not easy to make a exact correspondence between the results from different as the order of the quantum Fisher matrix is given by which is not monomial. Thus there is no single constant scale factor we can use for rescaling the results. Still, this suggests that the spectrum of the quantum Fisher matrix can be used as a faithful diagnostic tool on small systems to infer qualitative behavior on larger systems.
Appendix D Coherent Gibbs states for classical Ising models
We consider a classical Ising model defined on a graph where is the set of vertices and is the set of edges. We assign binary values or to each vertex and interaction strengths to each edge . The Hamiltonian of this model is given by
(47) 
Then our objective is finding parameters of the RBM that describe coherent Gibbs states for the given , i.e. solving the equations
(48) 
for all . Here, and is a constant that can be freely chosen as our RBM does not use a specific normalization.
As the is symmetric under overall flip (), we first consider symmetric RBM that has zero biases, i.e. . Then we can simplify the equation to
(49) 
We can find such a easily by letting and equating each term using a column of in the left hand side to the term in the right hand side using an edge. In other words, we solve
(50) 
for all where is a constant assigned to each edge that gives . Setting all if , we then need to solve the coupled equations
(51)  
(52) 
These equations can be solved for any as is a complex matrix.
For the two dimensional Ising model we consider in the main text, for all edges that connect any neighboring vertices in 2D lattice. In this case, we can easily get a real solution .
Appendix E The XXZ model using exact wave functions
In the main text, we studied the Heisenberg XXZ model using variational quantum MonteCarlo. There, the observables such as the quantum Fisher matrix and the energy gradient are calculated from the samples obtained from MCMC. In this section, we study the same system using exactly constructed wave functions instead of MCMC. A modified step of each iteration of SR is as follows. First, we calculate all components of the wave function in the computational basis. Then we obtain the normalization factor by calculating the exponential sum . Using this result, the energy gradient and the Fisher information matrix are also calculated by computing Eqs. (5,6) exactly and parameters are updated accordingly. As we do not sample from the distribution, the algorithm is not stochastic anymore. Thus we would call this method exact reconfiguration (ER) instead of SR. We note that ER is extremely expensive in computation since we need to calculate several exponential sums for each iteration.
Using ER, we have simulated the XXZ model with the system size that is tractable using current CPUs. The result is shown in Fig. 8. There are two noteworthy features: First, the converged spectrum when shows a broader spectrum as compared to Fig. 4 in the main text. We conjecture that this is related to the fact that the ground state found using ER has more component in subspace compared to SR case. Indeed, we have which is slightly smaller than what is found in the SR case in the main test. Second, the converged quantum Fisher matrix shows a smooth spectrum when even though the system has a gapped antiferromagnetic ground state. It implies that a smooth spectrum of the converged quantum Fisher matrix is not sufficient to infer criticality.
Appendix F RMSProp in the paramagnetic phase
We here study RMSProp introduced in Sec. IV for the paramagnetic phase of TFI. The learning curves for different values of are shown in Fig. 9. We can see that the learning curves are more complex than what we have seen for the ferromagnetic and critical cases. Specifically, we have three distinct observations as follows. First, there is a spike of the rescaled energy that goes up in the initial stage of learning. In addition, the size of the spike grows with . This means that an initial direction that optimizer selects is different to the optimal direction. Second, the properties of the quantum Fisher matrix after convergence are not helpful to understand the learning. In Appendix C, we have shown that the properties of the quantum Fisher matrix do not change much within the paramagnetic phase. However, it does not seem like there is a common property of the learning curves from different . Third, the converged energy can be as low as that of the ground state. This is interesting as it indicates the optimizer sometimes finds the proper solution even though the learning dynamic is not good.
From these observations, we suspect that RMSProp takes a different learning pathway than SR in the paramagnetic phase. To understand the applicability and details of the learning dynamics of the algorithm better, more detailed investigations such as tracking the path of optimization are required. We leave such a detailed investigation of this optimizers and the comparison to other optimizers for future work.
Comments
There are no comments yet.