The replica method from statistical mechanics has been applied to Bayesian inference problems (e.g., coding, estimation) already two decades ago Tanaka ; kabashima2003cdma . Rigorous proofs of the formulas for the mutual informations/entropies/free energies stemming from this method, have for a long time only been partial, consisting generally of one sided bounds franz2003replica ; Franz-Leone-Toninelli ; KoradaMacris_CDMA ; MontanariTse ; Montanari-codes ; Macris05CorrInequ ; Macris07LDPC ; Kudekar-Macris-2009 . It is only quite recently that there has been a surge of progress using various methods –namely spatial coupling Giurgiu_SCproof ; XXT ; 2018arXiv181202537B ; barbier_ieee_replicaCS ; barbier_allerton_RLE , information theory private , and rigorous versions of the cavity method coja-2016 ; 2016arXiv161103888L ; 2017arXiv170108010L – to derive full proofs, but which are typically quite complicated. Recently we introduced BarbierM17a a powerful evolution of the Guerra-Toninelli generalised-mean-field ; guerra2002thermodynamic ; guerra2005introduction interpolation method –called adaptive interpolation– that allows to fully prove the replica formulas in a quite simple and unified way for Bayesian inference problems. In this contribution we give a pedagogic introduction to this new method.
We first illustrate how the adaptive interpolation method allows to solve the well known Curie-Weiss model. For this model a simple version of the method contains most of the crucial ingredients. The present rigorous method of solution is new and perhaps simpler compared to the usual ones. Most of these ideas can be transferred to any Bayesian inference problem, and here this is reviewed in detail for the rank-one matrix estimation or factorisation problem (one of the simplest non-linear estimation problems). The solution of Bayesian inference problems requires only one supplementary ingredient: the concentration of the “overlap” with respect to both thermal and quenched disorder. Remarkably, this can be proven for Bayesian inference problems in a setting often called Bayesian optimal inference, i.e. when the prior and hyper-parameters are all known.
The adaptive interpolation method has been fruitfuly applied to a range of more difficult problems with a dense underlying graphical structure. So far these include matrix and tensor factorisation2017arXiv170910368B , estimation in traditional and generalised linear models barbier2017phase (e.g., compressed sensing and many of its non-linear variants), with random i.i.d. as well as special structured measurement matrices RLEStructuredMatrices , learning problems in the teacher-student setting barbier2017phase ; NIPS2018_7584
(e.g., the single-layer perceptron network) and even multi-layer versionsNIPS2018_7453 . For inference problems with an underlying sparse graphical structure full proofs of replica formulas are scarce and much more involved, e.g., Giurgiu_SCproof ; coja-2016 . The interpolation method and replica bounds for sparse systems have been pioneered by Franz and Leone in franz2003replica ; Franz-Leone-Toninelli (see also bayati2013 ; salez2016interpolation ) but so far the present adaptive interpolation method is still in its infancy for sparse systems eric-censor-block and it would be desirable to develop it further. The method was initially formulated with a more technical discrete interpolation scheme, but it was already observed that a continuous interpolation is natural BarbierM17a . The continuous analysis presented here was then explicitly developed for the tensor factorization problem in 2017arXiv170910368B . Here we directly use the continuous version which is certainly more natural when the underlying graphical structure is dense. So far however, for sparse systems which present new difficulties, only the discrete analysis has been developed eric-censor-block .
2 The Curie-Weiss model revisited
Consider the Hamiltonian of the ferromagnetic Ising model on a complete graph of spins , or Curie-Weiss model:
where is the coupling constant and the external field. The free energy is
with the partition function and the inverse temperature.
Let and the potential:
Note that this potential verifies at its stationary point(s)
The purpose of this section is to prove in a new fashion the following variational formula for the free energy:
The thermodynamic limit of the free energy for the Curie-Weiss model verifies
A trivial maximization over yields
where the second equality is obtained from the first one by parametrizing , . This well known variational formula can also be obtained directly by a simpler formulation of the adaptive interpolation discussed in section 3. However, this simpler formulation is not powerful enough for more complicated problems.
It is possible to check that
For the present model this can be checked by explicit computation of both sides. Otherwise, this follows directly using that the potential equals minus a convex function of minus another convex function of (see e.g. the appendix D of barbier2017phase ). Finally we also note that maximization over in the r.h.s. yields another well known formula for the free energy:
with the binary entropy function.
2.1 Adaptive interpolation
Let for a sequence that tends to as for . Here is interpreted as a “perturbation field” that will soon play a crucial role (note that this field could also belong to without changing the sub-sequent analysis) . Let a “trial” function depending on an interpolating parameter and on , and that will be chosen (adapted) later on. Then set . Define an interpolating Hamiltonian, with interpolating parameter , as
and the corresponding interpolating free energy
In this definition is the partition function associated with the interpolating Hamiltonian. This model interpolates between the Curie-Weiss model at (with a slightly different external field ) and a decoupled spin model at with mean-field controlled by the trial function and the perturbation. It is then easy to verify that
To obtain the equality in the first line we use that , where the magnetization and is the thermal average, i.e. the expectation w.r.t. the measure proportional to . Therefore by the mean value theorem, where . In the second equality of the second line the perturbation term has been extracted by continuity.
In order to compare the free energy of the Curie-Weiss model with the potential we use the fundamental theorem of calculus. Using (3) we find directly
We now naturally turn to the calculation of . Let us introduce the notation for the Gibbs bracket of the interpolating system, which is the thermal average of a function of the spins:
2.2 Simplifying the sum rule: concentration of the magnetization
At this stage we need a concentration result for the magnetization, that states
and which holds for all values of the temperature, coupling constant and magnetic field. This is where the perturbation field plays a crucial role. For the Curie-Weiss model the proof is elementary and goes as follows. The thermal fluctuations of the magnetization are precisely given by the second derivative of the interpolating free energy (denoted when we need to emphasize its explicit -dependence) w.r.t. :
Assume that the map is a diffeomorphism111Recall a diffeomorphism is a bijection which is continuously differentiable and whose inverse is also continuously differentiable. whose Jacobian is greater or equal to one for all ; we will say in this case that is regular. Under this assumption we can write
Then using (7) this leads to
Moreover note that which is bounded by in absolute value. Therefore the r.h.s. is bounded by which proves (6).
Now, integrating the fundamental sum rule over and using this concentration result (with a sequence vanishing more slowly than as , i.e. with )
where is uniform in , and . Note that this identity is valid for an arbitrary trial function as long as is regular, a condition that we have to verify when using this sum rule for specific trial functions.
2.3 Matching bounds
2.3.1 Upper bound
2.3.2 Lower bound
Now we choose to be the solution of
Here one has to be careful and ask whether this equation possesses a solution, as the r.h.s. depends on the interpolation path through the function . Here is a crucial observation: from the set-up of the interpolation, the l.h.s. is and the r.h.s. is a function so (9) is a first order differential equation (ODE)
Thus the “perturbation” actually serves as initial condition of this ODE. The function is with bounded derivative w.r.t. its second argument. Indeed which is finite for finite . Therefore we can apply the Cauchy-Lipshitz theorem to assert that (10) possesses a unique global solution over that we denote , with .
We want to replace this solution in (8). Thus we have to check that the flow of this ODE is regular (i.e. a diffeomorphism with Jacobian greater or equal to one). The argument is as follows. The flow is injective by unicity of the solution and since is itself . By the Liouville formula for the Jacobian (see hartmanordinary Corollary 3.1 in Chapter V)
which is greater than one because . Also, this Jacobian never vanishes so the local inversion theorem combined with the fact that the flow is injective implies that this flow is a diffeomorphism. We can thus use the sum rule (8).
The concavity of allows to extract the -integral from the first term in (8):
by recognizing the expression of the potential (1). A crucial observation is that with our particular choice of interpolating function we have
and thus . This ends the proof of Theorem 2.1.
3 Alternative formulation for the Curie-Weiss model in a random field
In this section we repeat the proof but directly obtain the simpler expression of remark 1 for the free energy, with the variational formula involving a potential depending on the single parameter representing the magnetization. Moreover we consider this time random i.i.d. external local fields in order to show that this is easily implemented in our approach (this case could have been considered also with the previous method); this model has been first rigorously treated in PhysRevB.15.1519 . For convenience we consider to be supported on
. Standard limiting arguments allow to extend the support to the whole real line as long as the first few moments exist.
Looking at the derivation below, the reader might wonder why we took a seemingly more complicated path in the previous section by introducing a two-parameter potential depending on both and an “effective field” . This is because the two different proofs are, as it will soon become clear, based on different types of convexity arguments, and in many problems of interest such as generalized linear estimation barbier2017phase or non-symmetric tensor estimation 2017arXiv170910368B , only the interpolation based on a two-parameter potential seems to be effective in order to obtain a full proof of replica formulas (instead of single-sided bounds reachable using a single-parameter potential). The arguments are very similar than in the previous section and we will be brief.
The Hamiltonian with random external fields, free energy and potential are this time
where is the expectation w.r.t. the random external fields h.
The thermodynamic limit of the free energy for the Curie-Weiss model with random external fields verifies
Set with taking values in . The interpolating Hamiltonian and free energy are this time
By similar computations as before the sum rule becomes in this case (here the trial function is unconstrained as we did not use the concentration yet):
where is uniform in , , .
From there a first bound is straightforward. Set
. Then the “variance”in (14) cancels, while the “remainder” , and thus
. Note that this first bound did not require the concentration of the magnetization, nor the use of the degree of freedom allowed by the possible time-dependence of the interpolation function. These two ingredients are used now for the converse bound.
We now set to be the unique solution of the ODE with initial condition ; this solution exists for all by the Cauchy-Lipschitz theorem. With this choice we check that the partial derivative appearing in the exponential in the Liouville formula equals . Thus the flow of the ODE is regular. We can then average the sum rule (14) over a small interval in order to use, thanks to the regularity of the flow, the following concentration for the magnetization (see the appendices for the proof):
for a positive constant depending only on the support of and the coupling strength . This allows to simplify the sum rule by cancelling the remainder (up to a vanishing correction). Only the non-negative variance term survives which leads directly to (choosing a sequence going to at an appropriate rate) . This finally yields , and thus proves Theorem 3.1.
4 Rank-one matrix estimation, or Wigner spike model
Consider the following matrix estimation model, also called Wigner spike model in the statistics literature, which serves as a simple model of low-rank information extraction from a noisy data matrix such as in principal component analysis. One has access to the symmetric matrix of observationsobtained as
where the signal-vector to infer has i.i.d. componentswith , and the Gaussian noise is i.i.d. for and symmetric . In order to ease the proof, we consider a prior supported on a bounded interval . Then, technical limiting arguments as found in 2016arXiv161103888L ; barbier2017phase allow, if desired, to extend the final result to unbounded support as long as the first few moments exist.
We consider the problem in the “high-dimensional” setting where the total signal-to-noise ratio (SNR) per parameter: , is an order one quantity, where denotes the SNR per observation. In the present case we have access to independent observations and for the off-diagonal terms, for the diagonal ones. Therefore we check
This explains the presense of the scaling in the observation model (16). Note that any other scaling would make the estimation task either trivial if the total SNR per parameter tends to infinity, or impossible if it tends to zero.
We suppose that we are in a Bayesian optimal setting where the prior as well as the noise distribution are known. The posterior, or Gibbs distribution, is of the form . It is convenient to re-express it in terms of the independent variables , instead of . Replacing by its epression (16), expanding the square, and then simplifying all the -independent terms with the normalization, it becomes
where the Hamiltonian and partition function are
The free energy of this model is then defined as
always denotes the expectation w.r.t. all (quenched) random variables in the ensuing expression (hereand ). This quantity is directly related to the mutual information between the observations and the input signal through the simple relation
and variational expressions for this quantity are thus of fundamental interest. We will show that such expressions can be rigorously determined using the adaptive interpolation method.
The Hamiltonian is nothing else than that of the so-called planted Sherrington-Kirkpatrick spin glass if one has a binary signal with Bernoulli prior; in this case the -integral becomes a sum over spin configurations . We shall see that, because this spin glass model stems from a Bayesian optimal setting, the replica symmetric formula for the free energy is exact. Let the replica symmetric potential be
with , and . This potential verifies as its stationary point(s)
where, by definition, the minimum mean-square error (MMSE) function of a scalar r.v. observed through a Gaussian channel , , with SNR , is
We will prove the following theorem, already proved using the adaptive interpolation method in BarbierM17a (formulated in a more technical discrete time setting). Note that the theorem was also already proved in koradamacris for a binary Bernoulli signal and also more recently using different (and more involved) techniques in XXT ; 2016arXiv161103888L ; 2018arXiv180101593E .
The thermodynamic limit of the free energy for the Wigner spike model verifies
We note two remarks that are similar to those made for the Curie-Weiss model.
Here the maximum over is attained at and one finds
as is usually found in the literature. This one-parameter variational expression can also be obtained directly by a simpler formulation of the adaptive interpolation discussed in section 5. For more complicated models, however, the simpler formulation is not powerful enough.
The r.h.s. can be optimized over by inverting the second equation in (18). There is a unique solution since the MMSE function is monotone decreasing in , which yields
In contrast to the Curie-Weiss model for general priors we do not have an explicit analytic expression for and .
Such replica formulas have also been proven for (non-symmetric) low-rank matrix and tensor estimation (or factorization) to varying degrees of generality koradamacris ; 2017arXiv170108010L ; 2017arXiv170200473M ; 2017arXiv170910368B ; 2018arXiv180101593E , or in random linear barbier_ieee_replicaCS ; barbier_allerton_RLE ; private and generalized estimation and learning barbier2017phase ; NIPS2018_7584 ; NIPS2018_7453 . The proof reviewed here by the adaptive interpolation method is one of the simplest and most generic.
4.1 Adaptive interpolation
Let , for some sequence that tends to as for (as opposed to the Curie-Weiss model, here it is crucial that this perturbation as it plays the role of a SNR). Let and set . Consider the following interpolating -dependent estimation model, where , with accessible observations and obtained through
with i.i.d. Gaussian noise , and with . The function thus plays the role of a SNR in a scalar (i.e. decoupled) denoising problem. The associated interpolating posterior written in Gibbs form is