# A statistical physics approach to learning curves for the Inverse Ising problem

Using methods of statistical physics, we analyse the error of learning couplings in large Ising models from independent data (the inverse Ising problem). We concentrate on learning based on local cost functions, such as the pseudo-likelihood method for which the couplings are inferred independently for each spin. Assuming that the data are generated from a true Ising model, we compute the reconstruction error of the couplings using a combination of the replica method with the cavity approach for densely connected systems. We show that an explicit estimator based on a quadratic cost function achieves minimal reconstruction error, but requires the length of the true coupling vector as prior knowledge. A simple mean field estimator of the couplings which does not need such knowledge is asymptotically optimal, i.e. when the number of observations is much large than the number of spins. Comparison of the theory with numerical simulations shows excellent agreement for data generated from two models with random couplings in the high temperature region: a model with independent couplings (Sherrington-Kirkpatrick model), and a model where the matrix of couplings has a Wishart distribution.

## Authors

• 2 publications
• 30 publications
09/03/2021

### Variational Bayes algorithm and posterior consistency of Ising model parameter estimation

Ising models originated in statistical physics and are widely used in mo...
06/15/2020

### Lower Bounds and a Near-Optimal Shrinkage Estimator for Least Squares using Random Projections

In this work, we consider the deterministic optimization using random pr...
03/10/2021

### Mean-field methods and algorithmic perspectives for high-dimensional machine learning

The main difficulty that arises in the analysis of most machine learning...
05/04/2020

### A Dynamical Mean-Field Theory for Learning in Restricted Boltzmann Machines

We define a message-passing algorithm for computing magnetizations in Re...
12/10/2020

### Spectral analysis of some random matrices based schemes for stable and robust nonparametric and functional regression estimators

In the first part of this work, we develop and study a random pseudo-inv...
11/03/2021

### Interpolation Estimator for Infinite Sets of Random Vectors

We propose an approach to the estimation of infinite sets of random vect...
05/24/2019

### Large-Scale Statistical Survey of Magnetopause Reconnection

The Magnetospheric Multiscale Mission (MMS) seeks to study the micro-phy...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In recent years, there has been an increasing interest in applying classical Ising models to data modelling. Applications range from modelling the dependencies of spikes recorded from ensembles of neurons

[1, 2] to protein structure determination [3] or gene expression analysis [4]. An important issue for such applications is the so–called inverse Ising problem, i.e. the statistical problem of fitting model parameters, external fields and couplings, to a set of data. Unfortunately, the exact computation of statistically efficient estimators such as the maximum likelihood estimator is computationally intractable for large systems. Hence, to overcome this problem researchers have suggested two possible solutions: the first one tries to approximate maximum likelihood estimators by computationally efficient procedures such as Monte Carlo sampling [5] or mean field types of analytical computations, see e.g. [6, 7, 8, 9]. A second line of research abandons the idea of maximising the likelihood function and replaces it by other cost functions which are easier to optimise. The most prominent example is the so–called pseudo–likelihood method [10, 11, 12, 13, 14]. In general it is not clear which of the two methods leads to better reconstruction of an Ising model. The quality of such estimators, e.g. measured by the mean squared reconstruction error of network parameters, will depend on the problem at hand.

As an alternative to analysing specific instances of problems, one may study the typical prediction performance of algorithms assuming that the true Ising parameters are drawn at random from a given ensemble distribution. For such random problem cases, one can apply powerful methods of statistical physics to compute (scaled) reconstruction errors exactly

in the limit where the number of spins grows to infinity and the number of data is increased proportionally to the number of spins. Such an approach has been applied extensively to statistical learning in large neural networks in the past

[15, 16, 17]. In a previous paper [18]

we have applied this method to the learning from dynamical data which are modelled by a kinetic Ising model with random independent couplings. This problem is theoretically simpler compared to the static, ’equilibrium’ Ising case discussed in the present paper. This is because the spin statistics of the dynamical model is fairly simple in the ’thermodynamic’ limit of a large network and gives rise to Gaussian distributed fields.

We will show in the following that a related approach is possible to data drawn independently from an equilibrium Ising model when we assume that couplings are learnt independently for each spin using local cost functions. Although the spin statistics is more complicated, computations are possible, when the so–called ’cavity’ method [19] is applicable to the true teacher Ising model.

The paper is organised as follows: Chapter 2 explains the inverse Ising problem and maximum likelihood estimation. Chapter 3 introduces simpler estimators which are derived from local cost functions. In Chapter 4, we review the statistical physics approach for analysing learning performances within the so–called teacher student scenario. In Chapter 5 we explain the cavity method for performing quenched averages over spin configurations. Chapter 6 presents explicit results of our method applied to the inverse Ising model with independent Gaussian couplings (SK–model). In Chapter 7 we study the learning performance of algorithms based on local quadratic cost functions and we compute the optimal local quadratic cost function. In Chapter 8 we show that an optimal quadratic function provides the best local estimator for the couplings. Chapter 9 introduces further applications of the cavity method which allow us to simplify order parameters corresponding to the true teacher couplings. As an example, we compute the reconstruction error for an Ising model with Wishart distributed, i.e. weakly dependent couplings. The method is also applied to re-derive a simple mean field approximation to the maximum likelihood estimator. Chapter 10 explains how the mean field estimator can be obtained from a local cost function and presents results for the reconstruction errors. Chapter 11 discusses the asymptotics of the reconstruction errors for large number of data and relates these results to expressions known from classical statistics. Chapter 12 contains comparisons of our results with those of simulations of the estimators and Chapter 13 presents a summary and an outlook.

## 2 Estimators for the Inverse Ising model

Let us consider a system of binary spin variables connected by pairwise interactions and subject to external local fields

. The probability distribution of the spin set is given by the Boltzmann equilibrium distribution

 P(σ|J,H)=Z−1\tiny{Ising}exp[β∑i

where is the partition function and is the inverse temperature. Given a set of independent observations drawn independently from (1), the inverse Ising problem consists of estimating the model parameters and from the data. A standard approach for parameter estimation is the maximum likelihood (ML) method, which has the properties of consistency and asymptotic efficiency [20]. Maximum likelihood can be formulated as the minimisation of the following cost function (negative log–likelihood)

 EML(J,H)=−M∑k=1lnP(σk|J,H) (2)

with respect to the matrix of couplings and the field vector . As is well known, the minimisation of (2

) is equivalent to a simple set of conditions for the first and second moments of the ensemble (

1) of spins: the parameters estimated by ML lead to the matching of the empirical (data averaged) magnetisations to the magnetisation given by the model (1). Likewise we have the matching of all empirical pair correlations of spins with their model counterparts. Despite the simplicity of this rule, the practical minimisation of (2) requires the computation of these spin moments for a given set of couplings and fields which is equivalent to averaging over spin configurations, which is intractable for larger . An approximation of such averages by Monte Carlo sampling is possible but requires sufficient time for equilibration. Alternatively, different approximation techniques have been developed to provide a good estimate of the parameters at a smaller computational cost, see e.g. [8, 9, 11, 12, 21, 22, 23, 24].

## 3 Local Learning

If we neglect the symmetry of coupling matrix, i.e. the equality , we can develop estimators which learn the ’ingoing’ coupling vectors for for each spin independently. It turns out that the corresponding (local) algorithms can often be performed in a much more efficient way compared to the ML method.

In the following we will concentrate on the estimation of the couplings only and set the external fields to zero. We will specialise on the couplings for spin and assuming that the typical couplings are variables with magnitude scaling like for large . We define a vector of rescaled couplings (weights) as

 W=(W1,…,WN−1)≐√N(J01,…,J0N−1). (3)

We will assume that an estimator for is defined by the minimisation of a cost function,

 E(W)=M∑k=1E(W;σk), (4)

which is additive in the observed data. An important and widely used case is the pseudo–likelihood approach, where the cost function

 E(W;σ)=−lnP(σ0|σ∖0,W)=−βσ0∑j≠0Wjσj√N+ln⎛⎝2coshβ∑j≠0Wjσj√N⎞⎠ (5)

is given by the negative log–probability of spin conditioned on all other spins . In contrast to the ML approach, the gradient of this function can be computed in an efficient way.

## 4 Teacher–student scenario and statistical physics analysis

We assume in the following that data are generated independently at random from a ’teacher’ network with coupling matrix . A local learning algorithm based on the minimisation of (4) produces ’student’ network couplings as estimators for the teacher network couplings . To measure the quality of a given local learning algorithm, we will compute the average square reconstruction error given by

 ε=N−1¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯(W∗−W)2=Y−2ρ+Q, (6)

where we define order parameters

 Y=N−1¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯(W∗)2,Q=N−1¯¯¯¯¯¯¯¯¯¯¯¯(W)2,ρ=N−1¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯W∗⋅W, (7)

representing, respectively, the squared lengths of the teacher and student coupling vectors and the overlap between teacher and a student coupling vectors. Here the overline defines an expectation over the ensemble of training data drawn at random from an Ising model with teacher couplings , i.e.

 ¯¯¯¯¯¯¯¯¯¯¯(…)=∑σ1,…,σMM∏k=1P(σk|J∗)(…). (8)

Since there is often no explicit analytical solution to the minimisers of (4), we will resort to a statistical physics approach which has been successfully applied to the analysis of a great variety of problems related to learning in neural networks [15, 16, 17]. In this approach [25], one defines a statistical ensemble of student weights by a Gibbs distribution

 p(W)=1Zexp[−νE(W)], (9)

with the partition function

 Z=∫dWexp[−νE(W)], (10)

where represents an effective temperature which controls the fluctuations of the ’training energy’ . Using techniques from statistical physics of disordered systems one computes order parameters at nonzero temperature and performs the limit at the end of the calculation. The ’thermal average’ with respect to the distribution (9) converges to the minimiser of the cost function . Order parameters can be extracted from the quenched average of the free energy corresponding to (8) using the replica method:

 F=−N−1ν−1¯¯¯¯¯¯¯¯¯lnZ=−limn→0N−1ν−1∂∂nln¯¯¯¯¯¯¯Zn, (11)

where the average replicated partition function for integer is given by

 ¯¯¯¯¯¯¯Zn=∫n∏a=1dWa{∑σP(σ|J∗)exp[−νn∑a=1E(Wa;σ)]}αN. (12)

To allow for an analytical treatment, we assume that the local cost function depends on the spins and couplings only via and the local field in the following way:

 E(W;σ)=Φ(σ0h). (13)

Obviously, the pseudo–likelihood cost function (5) belongs to this class of functions. The goal of the following Section is to perform the expectation (12). The resulting expression depends on a set of order parameters and can for integer be evaluated by standard saddle–point methods in the limit . Performing an analytical continuation for yields both the free energy and the self–averaging values of these order parameters. While in most previous applications [15, 16, 17] of this programme to learning in neural networks, the quenched average over data in (12) is straightforward, the required average over Ising spin configurations drawn from the distribution (1) cannot be performed (for arbitrary ) in closed form. One might attempt a solution to this problem by introducing a second set of replicas which would deal with the partition function in the denominator of (1). We expect that such an approach can be carried on for random teacher couplings but may lead to complicated expressions which have to be carefully evaluated for . In the next Section we will use a simpler approach using ideas of the cavity method [19] which allows, under certain assumptions on the teacher coupling matrix , the explicit computation of the quenched average for .

## 5 Cavity approach I: quenched averages

In order to perform the quenched average in (12), we will combine the replica approach with ideas of the so–called cavity method. In doing so we write the Gibbs distribution (1) corresponding to the teacher couplings in the form

 P(σ|J∗)∝exp⎡⎣βσ0∑j≠0J∗0jσj⎤⎦P\tiny{cav}(σ∖σ0), (14)

where denotes the distribution of the remaining spins in a system where the spin was removed, creating a cavity at this site, which gives the method its name. The replicated partition function depends only on the fields where . The cavity assumption for the statistics of such fields in densely connected systems can be summarised as follows: in performing expectations over

, we can assume that dependencies between spins are so weak that random variables

become jointly Gaussian distributed in the limit

. Hence, the joint distribution of spin

and the fields can be expressed as

 P(σ0,h∗,h1,…hn)=1Z0eβσ0h∗p\tiny{cav}(h∗,h1,…hn) (15)

with the normalisation

 Z0=2∫cosh(βh∗)p\tiny{cav}(h∗)dh∗ . (16)

Assuming that in absence of external fields we have vanishing magnetisations (paramagnetic phase), the distribution is a multivariate Gaussian density with zero mean and covariance

 ⟨hahb⟩=1N∑i,j≠0WaiC∖0ijWbj. (17)

The matrix is the correlation matrix of the reduced spin system (without ), which does not depend on the couplings . We have and assume that typically for and large . However, this scaling does not mean that we can neglect the non–diagonal matrix elements. We will later see that they give nontrivial contributions to the final reconstruction error. Within this framework, the quenched average in (12) is rewritten in terms of integrals over the random variables as follows:

 ∑σP(σ|J∗)exp[−νn∑a=1E(Wa;σ)]=∑σ0∫dh∗n∏a=1dha1Z0exp[βσ0h∗]exp[−νn∑a=1Φ(σ0ha)]p\tiny{cav}(h∗,h1,…hn). (18)

This result can be expressed by the covariances (17) which in the limit will become self averaging order parameters which will be computed by the replica method (Appendix A). Under the assumption of replica symmetry (which is expected to be correct for convex cost functions, which holds e.g. in the case of pseudo–likelihood), these new order parameters and their physical meaning are denoted as:

 V≐1N∑i,j≠0W∗iC∖0ijW∗jR≐1N∑i,j≠0W∗iC∖0ij⟨Wj⟩w=1N∑i,j≠0W∗iC∖0ijWaja≠∗,q0≐1N∑i,j≠0⟨WiWj⟩wC∖0ij=1N∑i,j≠0WaiC∖0ijWaja≠∗,q≐1N∑i,j≠0⟨Wi⟩wC∖0ij⟨Wj⟩w=1N∑i,j≠0WaiC∖0ijWbja≠b≠∗, (19)

where the brackets denote averages with respect to the distribution of couplings (9).

## 6 Replica result

Using a replica symmetric ansatz, the computations follow the approach summarised in Appendix A. In the zero temperature limit , the fluctuations of student couplings vanish and we obtain the convergence of the order parameters with the limiting ’susceptibility’

 x≐limν→∞(q0−q)ν=limν→∞νN∑i,j≠0(⟨WiWj⟩w−⟨Wi⟩w⟨Wj⟩w)C∖0ij

remaining finite and nonzero. As a main result, we find that the auxiliary order parameters (19) are obtained by extremizing the limiting free energy function

 F=−extrq,R,x{12q−R2/Vx+α∫dvGβR,q(v)maxy[−(y−v)22x−Φ(y)]}, (20)

where denotes a scalar Gaussian density with mean

. Remarkably, this free energy does (for any fixed cost function ) only depend on the teacher couplings via the order parameter , defined in equation (19). To compute the prediction error, however, we need the ’original’ order parameters (7). These can be expressed by the auxiliary ones , and . This relation can be derived from the free energy (Appendix C) in a standard way by adding corresponding external fields to the ’Hamiltonian’ in the Gibbs free energy (9). This relation brings back further statistics related to the teacher couplings via

 ρ =RYV, (21) Q =(q−R2V)1NTr¯¯¯¯¯¯¯¯¯¯C−1+R2YV2,

with the corresponding reconstruction error

 ε=(q−R2V)1NTr¯¯¯¯¯¯¯¯¯¯C−1+Y(1−RV)2. (22)

In deriving these results, we have also assumed that for , . Note that the prediction error is larger than the one we would get if we had neglected the off–diagonal elements of the correlation matrix . The error (22) depends on the teacher couplings through the parameter and the parameter

(the cavity variance of the teacher field) and through the trace of the inverse correlation matrix

corresponding to the teacher’s spin distribution. We will show later that the latter quantity can be expressed by the former using a second application of the cavity method. In the next section, we will see that the parameter can be estimated from the data.

We will illustrate the result (22) for the case of random teacher couplings drawn independently for from a Gaussian density of variance . This corresponds to the celebrated Sherrington–Kirkpatrick (SK) model [26]. For , i.e. outside of the spin–glass phase, our simple form of the cavity arguments are known to be correct [19] and one finds the values

 V=Y=1,limN→∞1NTr¯¯¯¯¯¯¯¯¯¯¯¯¯¯(C)−1=1+β2, (23)

for zero magnetisations in the literature [27]. A comparison of the theory (22) with numerical simulations is shown in Section 12.

Among the simplest functions satisfying the property (13), we consider quadratic cost functions of the form

 Eη(W)=12∑i≠0,j≠0Wi^CijWj−η√N∑j≠0^C0jWj, (24)

where the empirical correlation matrix is defined as

 ^Cij≐1MM∑k=1σkiσkj. (25)

These allow for an explicit computation of the estimator in terms of a matrix inversion. The estimator minimizing (24) is given by

 Wηi=η√N∑j≠0(^C−1−0)ji^C0ji≠0, (26)

where the matrix is the submatrix of where the -th column and -th row are deleted (not to be confused with the cavity matrix ) and is a free parameter. The estimation error can be computed from the free energy (20) by setting

 Φ(h)=h22−ηh, (27)

and gives (see Appendix D)

 ε=(βη1+β2V−1)2Y+η2(α−1)(1+β2V)1NTr¯¯¯¯¯¯¯¯¯¯C−1. (28)

The optimal choice for the quadratic cost function (24) is found by fixing the parameter to the value that minimizes the error (28), namely

 ηopt=(α−1)(1+β2V)βY(α−1)β2Y+(1+β2V)1NTr¯¯¯¯¯¯¯¯¯¯C−1, (29)

with the corresponding minimal error

 εopt=(1+β2V)Y1NTr¯¯¯¯¯¯¯¯¯¯C−1(α−1)β2Y+(1+β2V)1NTr¯¯¯¯¯¯¯¯¯¯C−1. (30)

In general, the computation of the optimal parameter requires the knowledge of the three parameters , and which characterise the statistical ensemble to which he unknown teacher matrix belongs. However, (29) simplifies as and we get

 limα→∞ηopt=1+β2Vβ. (31)

We will now show that the remaining parameter can be estimated from the observed data. We use the fact that at its minimum, the cost function (24) equals

 Eη(Wη)=−N2η2Δ, (32)

where we have used (26) and defined

 Δ=∑i≠0,j≠0^C0i(^C−1−0)ij^C0j, (33)

which only depends on the spin data. On the other hand in the situations where our statistical physics formalism applies, the minimal training energy (32) will be self–averaging in the thermodynamic limit and can be computed as the zero temperature limit of the free energy, i.e. the free energy function (20) evaluated at the stationary values of the order parameters. The calculation in (Appendix D) yields

 Δ=1+αβ2Vα(1+β2V). (34)

This shows, that the unknown parameter and the asymptotically optimal parameter can be directly estimated from the observed spin correlations.

In the next section, we will show that the optimal quadratic cost function yields in fact the total optimum of the reconstruction error with respect to free variations of the cost function .

## 8 The optimal local cost function

In this section, we will derive the form of the optimal local cost function within the cavity/replica approach and show that it is quadratic. Hence, the results of the previous section can be applied, where the optimal quadratic cost function was already computed. We will give a derivation of this fact for the case of finite inverse ’temperature’ , assuming that the argument can be continued to .

The optimisation of cost functions for learning problems within the replica approach goes back to the work of Kinouchi and Caticha [28]. We will follow the framework of [29] (see also [30]). Our goal is to minimise an error measure for a learning problem which is of the form such as (22). It depends on order parameters which are computed by setting the derivatives of a free energy function (such as 79) equal to zero. The main idea is to take these conditions into account within a Lagrange function

 ε(R,q,q0)+∑S∈R,q,q0λS∂∂SFΦ(R,q,q0), (35)

where the are the corresponding Lagrange multipliers. The optimal function is obtained from the variation

 δδΦ∑S∈R,q,q0λS∂∂SFΦ(R,q,q0)=0. (36)

For our problem, we can write (see 71, 79)

 FΦ(R,q,q0)=F0(R,q,q0)−αν∫GβR,q(v)lnΨq0−q(v)dv, (37)

where is independent of and denotes a scalar Gaussian density with mean and standard deviation . The free energy depends on through the function

 Ψq0−q(v)≐∫Gv,q0−q(y)e−νΦ(y)dy. (38)

We will first derive a condition on the form of the optimal function from the variation

 δδΨ∑S∈R,q,q0λS∂∂S∫GβR,q(v)Ψq0−q(v)dv=0. (39)

From this, we will recover the form of the optimal . To obtain the derivatives with respect to the order parameters we use the following rules for expectations over Gaussian measures, which can be easily derived using integration by parts

 ∂∂μ∫Gμ,ω(v)f(v)dv =∫Gμ,ω(v)∂vf(v)dv, (40) ∂∂ω∫Gμ,ω(v)f(v)dv =12∂2∂μ2∫Gμ,ω(v)f(v) (41) =12∫Gμ,ω(v)∂2vf(v)dv. (42)

Hence, the derivatives required for (39) are

 ddR∫GβR,q(v)lnΨq0−q(v)dv=β∫GβR,q(v)∂vlnΨq0−q(v)dv, (43) ddq0∫GβR,q(v)lnΨq0−q(v)=12∫GβR,q(v)∂2vΨq0−q(v)Ψq0−q(v)dv= (44) 12∫GβR,q(v){∂2vlnΨq0−q(v)+(∂vlnΨq0−q(v))2}dv, ddq∫GβR,q(v)lnΨq0−q(v)dv=β22∫GβR,q(v)∂2vlnΨq0−q(v)dv− (45) ddq0∫GβR,q(v)lnΨq0−q(v)dv.

An application of standard variational calculus to a linear combination of these order parameter derivatives shows that

 ∂vlnΨq0−q(v)=c1+c2∂vlnGβR,q(v), (46)

where are independent of . Since the logarithm of the Gaussian density is a quadratic function in , we conclude that also is a quadratic expression in the variable , making a (non–normalised) Gaussian density.

To conclude our argument on the optimal form of , we use relation (38). This shows that the Gaussian density is the convolution of a (non–normalised) Gibbs density of a random variable with the density of a Gaussian random variable . As a convolution corresponds to the addition two random variables, we know that is also a Gaussian random variable. Since is Gaussian, then is also a Gaussian density and is quadratic in . We have a already computed the best quadratic cost function in the previous Section, and we conclude that the estimator (26) with (29) is the best local estimator of the couplings.

## 9 Cavity approach II: TAP equations and approximate mean field ML estimator

So far we have ignored the symmetry of the coupling matrix by restricting ourselves to estimators derived from local cost functions. In this Section, we will discuss a well known approximation [31] of the (symmetric) maximum likelihood estimator which is based on mean field theory. We will re–derive this estimator using the more advanced (adaptive) TAP mean field theory, because its results for the spin correlation matrix will also be needed in the following. We will later compute its reconstruction error in Section 10. Our starting point is a generalisation of the well known TAP mean field approach developed for the SK model. Using the cavity approach [31] one derives the following ’adaptive’ TAP equations for the magnetisations

 mi=tanh(β∑jJijmj−β2Vimi+βHi), (47)

where

 Vi=⟨{∑jJij(σj−⟨σj⟩∖i)}2⟩∖i (48)

is the variance of the cavity field at spin . Using a linear response argument (i.e. by taking the derivative of eq. (47) with respect to ), one obtains the following cavity approximation to the susceptibility , i.e. the covariance matrix of the spins:

 χ(J)=(Λ−βJ)−1, (49)

where the diagonal matrix has elements

 Λii=β2Vi+1χii=β2Vi+11−m2i. (50)

From this result, we can draw the following conclusions:

1. Writing the moment matching conditions for the maximum likelihood estimator as

 Cij(J)≐⟨σiσj⟩=^Cij≐1MM∑k=1σkiσkj (51)

and specialising to the paramagnetic case , we have . Hence, the cavity approximation (49) yields the mean field (MF) estimator given by [6]

 JMFij=−1β(^C−1(J))ijfori≠j . (52)

At first glance, this simple and explicit form of a (symmetric) coupling estimator does not seem to fit into the framework developed in this paper. Surprisingly, we will derive a local cost function in the next Chapter which allows for the computation of the reconstruction error using the statistical physics approach.

2. Inverting (49) and using (50) for , we get an expression for the trace of the inverse spin correlation matrix in terms of the variances of the cavity fields at all spins which is given by

 1NTr¯¯¯¯¯¯¯¯¯¯C−1=β2N∑iVi+1 . (53)

If we assume that the teacher couplings

can be viewed as generated from a random matrix ensemble for which the

become self–averaging, i.e. as we finally obtain the simple result

 limN→∞1NTr¯¯¯¯¯¯¯¯¯¯C−1=β2V+1 . (54)

With this result, we can eliminate another unknown parameter of the teacher’s ensemble of couplings, as we have shown that can be estimated from the observed spin data, see (33) and (34).

Equation (54) agrees with the special result (23) for the SK model, since the ’Onsager correction’ in the TAP equations for the SK model gives . As an application of the general result (54), we present numerical results for the reconstruction error for the Wishart ensemble in Section 12, where the couplings are given by

 J∗ij=1NγN∑μ=1ξμiξμj (55)

and the are independent zero mean Gaussian random variables with unit variance. The thermodynamics of this model agrees with that of the celebrated Hopfield model of a neural network (where ) [32], in the phase where there is no macroscopic overlap between the spin configurations and a stored pattern. Hence, we can read off the cavity variance from the TAP mean field equations obtained by [25], setting . One finds

 V=γ1−β. (56)

For other random matrix ensembles which are invariant against orthogonal transformations it is possible to obtain a general expression for the cavity variance in terms of the so–called R–transform of the matrix ensemble (for details, see [33, 34]

) and can be expressed by the limiting eigenvalue spectrum of the matrices.

## 10 Reconstruction error for MF–ML estimator

We will now turn to the computation of the reconstruction error for the MF–ML estimator (52). At first glance, this estimator does not seem to be related to a local cost function in the style of (4). But surprisingly, it is not hard to construct such a function. If we specialise again to the estimation of the coupling vector corresponding to spin , we can simplify the estimator (52) using the matrix inversion lemma [35] in the form

 WMFi=√NJMF0i=−√Nβ(^C−1)0i=√Nβϕ0∑j≠0(^C−1−0)ji^C0ji≠0, (57)

where

 ϕ0=11−∑i,j≠0^C0i(^C−1−0)ij^C0j=11−Δ, (58)

where was introduced in (33). Assuming as before, that is self–averaging for , the mean field estimator is of the form (26) and is associated to a cost function of the form (24). Hence, the results of sections 7 apply. In particular, from (34) and (58) we compute

 ϕ0=11−Δ=α(1+β2V)α−1, (59)

and the estimation error is given by (28) with the parameter :

 εMF=Y(α−1)2+α2β2(α−1)3(1+β2V)1NTr¯¯¯¯¯¯¯¯¯¯C−1. (60)

## 11 Asymptotics

We will now investigate the limiting scaling of the reconstruction error as the number of data grows much larger than the number of parameters (per spin) to be estimated. This means we consider the limit . This is of special interest, because we can compare the results obtained by our replica/cavity approach with results derived independently by standard arguments of classical statistics. From (30), (60) and Appendix E we can see that, as , the scaling of the reconstruction errors for the pseudo–likelihood estimator, the optimal local estimator and the mean field estimator is

 ε≃cα, (61)

where

 c\tiny{PLM} = 1β21∫dvGβV,V(v)(1−tanh2(βv))1NTr¯¯¯¯¯¯¯¯¯¯C−1, (62) c\tiny{OPT} = 1+β2Vβ21NTr¯¯¯¯¯¯¯¯¯¯C−1=(1+β2V)2β2, (63) c\tiny{MF--ML} = c\tiny{OPT}, (64)

where in the second equality of the second line, we have used (54). Hence, asymptotically the simple mean field estimator and the optimal estimator converge to the true couplings at the same speed. Thus, one might conjecture that the mean field estimator is equivalent to the true maximum likelihood estimator in the thermodynamic limit, assuming that the cavity approach is correct.

The validity of the inequality given by (62) and (63) will depend on the temperature and can be established at least for small enough. For the SK model, this region covers the entire paramagnetic phase where our simple cavity method is valid. However, the difference between the two is not very large for small . In fact, expanding (62) in powers of shows that error coefficients for both estimators agree up to terms of order . One may however argue that the comparison between the two estimators is not fair, because the pseudo–likelihood estimator does not yield symmetric couplings whereas the mean field one (and hence, asymptotically the optimal one) does. One might thus get a better estimate by a final symmetrisation. Unfortunately, with our present method, the effect of symmetrisation on the reconstruction error cannot be computed. We expect that methods of random matrix theory would be needed for this. Hence, we postpone a treatment of this problem to future publications. On the other hand, preliminary simulations show that the improvement of the pseudo likelihood estimator after symmetrisation is rather weak (at least for the systems with random couplings studied in this paper). This result is further supported by the fact that for small , the pseudo likelihood estimator is already almost symmetric, a fact that can be easily shown, if we expand (5) for small . The lowest order term yields an explicit result which is symmetric.

We want to compare the replica based asymptotics (62) and (63-64) with exact asymptotic expressions for the errors of statistical estimators which are defined by the minimisation of smooth cost functions of the type (4), see e.g. [20] or [25] for an alternative derivation using replicas. The idea behind such asymptotic results is an expansion of the cost function in terms of the parameters around the teacher parameters (assuming convergence to the teacher in the infinite data limit). Setting

and using the law of large numbers and central limit arguments one can show the following equation for the data averaged correlations

 ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯δWiδWj≃1Nα[(U−1BU−1)ij]forα→∞, (65)

together with . The matrices are given by

 Uij=⟨∂i∂jE(W∗;σ)⟩,Bij=⟨∂iE(W∗;σ)∂jE(W∗;σ)⟩. (66)

The partial derivatives are with respect to the components of and the brackets are averages over spins using the distribution . For the pseudo–likelihood case, (65) can be further simplified. In Appendix F, we show that in this case and we finally obtain

 ε≃N−1∑i¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯(δWi)2=1NαTrB−1, (67)

with

 Bij=β2⟨σiσj⎡⎣1−tanh2(β1√N∑j≠0W∗0jσj)⎤⎦⟩. (68)

If we neglect the correlations between and the field for large and note that , this result is in agreement with (62).

A similar calculation is possible for the OPT/MF–ML case. Here we get

 Uij=βϕ0NCij,Bij=β2ϕ20N⟨σiσjh2∗⟩+1NCij−2βϕ0N⟨h∗σ0σiσj⟩. (69)

To obtain the asymptotics of the replica result (63) form these matrices, we assume that the dependencies between the random variables on the one hand and respectively and on the other hand can be neglected for . Using the facts that , and finally yields (63) .

## 12 Numerical Results

In the previous Sections we saw that the error of any algorithm that infers the network couplings by minimizing a cost function of the kind (13) satisfies (22), in the large limit, when the cavity arguments apply. The order parameters are the ones extremizing the free energy (20). For pseudo–likelihood maximization, the set of equations (84) for the order parameters has to be solved numerically, whereas for the local optimal and MF–ML estimators we computed analytically the error in the form, respectively, (30) and (60). Note that the error (22) is expressed in terms of three parameters that depend on the distribution of the teacher couplings: , and the trace of the inverse correlation matrix . As an example, we considered the the Gaussian ensemble of the SK model, with parameters given by and relation (23) and the Wishart ensamble of (55) with parameters given by and relations (54) and (56). Figure 1 compares the predicted error with the mean squared error that we get from simulations, as a function of . We show results for the pseudo–likelihood, local optimal and MF–ML algorithms applied to the SK and the Wishart model. We only report results for the high–temperature (paramagnetic) region, i.e. for where defines the onset of spin–glass ordering. In this region, we expect that on the one hand, the cavity arguments are exact and the other hand, the convergence of the spin simulations to the thermal equilibrium is sufficiently fast. For the SK model, we have and for the Wishart model for zero magnetization and small [36], where is the Edwards-Anderson order parameter. The data are generated by Monte Carlo sampling with a burn in time of spin updates and sampling every updates, and the couplings are recovered either by minimizing the pseudo–likelihood cost function (5) using a Newton method or from the empirical correlation matrices: see (26, 29) for the local optimal algorithm and (52) for MF–ML. The plot shows that the replica calculation predicts rather well the results of the simulations for systems of spins. In addiction, it is clear that the optimal local algorithm outperforms the other two methods and, in the high–temperature regime considered here, the MF–ML algorithm performs better than pseudo–likelihood maximization. This performance difference is more relevant for increasing and in the small region, whereas it is almost negligible for large , in agreement with the asymptotic expansions. Finally, we compare the analytical results for the asymptotic behavior of the error computed in Section 11 with the results from simulations. Assuming the scaling (61), we fitted the function to the mean squared error of the couplings inferred from simulations at large . In Table 1 we show that this ’experimental’ value of is consistent with (62) and (63, 64). We then plot the predicted value of as a function of in Figure 2, where we can see that the difference between pseudo–likelihood maximization, the local optimal and MF–ML algorithms is almost indistinguishable and goes to zero for small , as we would expect by noticing that the analytical formula for (62) agrees with (63, 64) up to second order in . From the plot it is also clear that for larger –i.e. smaller stochasticity of the spins– the error in predicting the couplings is smaller.

The three algorithms show different behaviors in the small region. As the MF–ML algorithm relies on the inversion of the correlation matrix (52), that becomes singular at , its error diverges at , as can be seen from (60). On the contrary, the error of the optimal local algorithm shows no divergence, since and at (see 29, 30). From simulations we also observe that the error of the pseudo–likelihood estimator increases for decreasing and for it reaches large values, with large variations across trials, while the extremization of the free energy (see 84) fails in the region . A way to overcome this divergence is to introduce a regularizing term in the objective function. We postpone the study of regularized estimators to future work. We present additional plots in Appendix G, showing the error dependence on the system size.

## 13 Discussion and outlook

We have presented a statistical physics approach for calculating the reconstruction error of algorithms for learning the couplings of large Ising models. Our method assumes local cost functions for learning and is based on a combination of the replica trick and of cavity arguments for computing quenched averages over spin configurations which are drawn at random from a teacher network. A replica symmetric ansatz seems to be justified as long as the learning algorithms are based on convex cost functions. The cavity approach assumes a large densely connected network with couplings that are roughly of the same size leading to only weakly correlated spins. These assumptions are correct in the thermodynamic limit for certain statistical ensembles of network couplings but may also give good approximations for realistic networks. While our method is so far restricted to problems which are realisable by pairwise spin–interactions, it could nevertheless be of practical interest in providing approximate statistics for hypotheses testing against more complicated network models (having e.g. 3–spin interactions).

Our results show that the learning problem is, at least within our framework, surprisingly simple: An explicit estimator based on a quadratic cost function achieves minimal error and outperforms the more complicated pseudo–likelihood estimator. This optimal estimator only requires prior knowledge of the length of the true coupling vector. Moreover, a simple (symmetric) mean field approximation to the maximum likelihood estimator is asymptotically optimal and can be computed without such prior knowledge. In the case of the SK model, the region of small in which these results hold covers the entire paramagnetic phase, where our simple cavity arguments are known to be valid. It would be interesting to work out analytically how well the mean field estimator approximates the exact maximum likelihood estimator in the thermodynamic limit.

Our work is only a first step to an understanding of the typical performance of learning algorithms for the inverse Ising problem. From a technical point of view our method could be generalised in several directions. We have restricted ourselves to models where data are sampled from the paramagnetic phase of a teacher network. While it is possible to generalise the analysis, the average over samples from a spin–glass phase would usually require more complex types of cavity arguments [37] which are related to the breaking of the replica symmetry of the teacher network. In such a case, the simple Gaussian distribution of cavity fields on which our analysis strongly relies is no longer valid. One might expect that now the quadratic cost functions may no longer be optimal (and not even consistent) but could be outperformed by a pseudo–likelihood method.

We also expect that our cavity framework could be extended to sparse networks as long as the number of nonzero couplings per spin is large enough to allow for the application of the central limit arguments used in our work.

After finishing our work we became aware of a recent preprint [30] where similar learning problems (focussing on a teacher model with independent Gaussian couplings) were studied. The author applied a double replica calculation (the other set of replica are used for dealing with the partition function in the quenched average over the spins) instead of using cavity arguments. This results in a free energy function which agrees essentially with our result (20). However, the order parameters appearing in the free energy are not defined by (19) but by (7) instead, and the reconstruction error differs from ours. The major difference is that the result for the error in [30] does not contain the spin–correlation matrix as in our equation (22). We believe that this could be related to an implicit approximation of the correlation matrix by a unit matrix.

## Acknowledgements

We would like to thank Andrea Pagnani for fruitful discussions that significantly motivated our work.

## Appendix A Details of the replica calculation

From (11-12-18) one can see that the free energy can be written as

 F=−limn→0N−1ν−1∂∂nln∫n∏a=1dWa{∑σ0∫dh∗n∏a=1dha1Z0exp[βσ0h∗]exp[−νn∑a=1Φ(σ0ha)]p\tiny{cav}(h∗,h1,…hn)}αN, (70)

where is a multivariate Gaussian density with zero mean and covariance given by (17). The average over the Gaussian fields yields quadratic terms in and , that can be simplified by introducing the order parameters