This decade is witnessing a burst of mathematical studies related to high-dimensional inference and learning problems. One reason is that an important arsenal of methods, developed in particular by the physicists and mathematicians working on the rigorous aspects of spin glasses, has found a new rich playground where it can be applied with success [1, 2, 3, 4, 5, 6, 7]
. Models in learning like the perceptron and Hopfield neural networks have been analyzed in depth since the eighties by the physics community[8, 9, 10, 11, 12, 13], or in inference, e.g., in the context of communications and error correcting codes [14, 15], using powerful but non-rigorous techniques such as the replica and cavity methods [16, 17]. But due to the difficulty and richness of these models rigorous results experienced some delay with respect to (w.r.t.) the physics appoaches and were restricted to very specific models such as the famous Sherrington-Kirkpatrick model [2, 6, 7]
. The trend is changing and it is fair to say that the gap between heuristic (yet often exact) physics approaches and rigorous ones is quickly shrinking. In particular important progress towards the vindication of the replica and cavity methods has been made recently in the context of high-dimensional Bayesian inference and learning. Examples of problems in this class where the physics approaches are now rigorously settled include low-rank matrix and tensor factorization[18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
, random linear and generalized estimation[30, 31, 32, 33, 34, 35], models of neural networks in the teacher-student scenario [34, 36, 37], or sparse graphical models such as error-correcting codes and block models [38, 39, 40].
All these results are based in some way or another on the control of the fluctuations of the order parameter of the problem, the overlap, which quantifies the quality of the inference. Optimal Bayesian inference –optimal meaning that the true posterior is known– is an ubiquitous setting in the sense that the overlap can be shown to concentrate, and this in the whole regime of parameters (amplitude of the noise, number of observations/data points divided by the number of parameters to infer etc). When the overlap is self-averaging (which is the case in optimal Bayesian inference under a proper perturbation, see Theorems 3 and 4) then one expects replica symmetric variational formulas for the asymptotic free energy or mutual information density, as understood a long time ago by physicists [41, 42]
. Actually in the physics literature replica symmetry is generally the term used to precisely mean that the order parameter concentrates. This is in contrast with models where the overlap is not self-averaging, like in spin glasses at low temperature or combinatorial optimization problems, which leads to more complicated formulas for the free energy computed using Parisi’sreplica symmetry breaking scheme [43, 16, 17, 5, 6, 7].
In most of the studied statistical models the overlap order parameter is a scalar. In the context of optimal Bayesian inference it is now quite standard to show that when the overlap is a scalar it is self-averaging in the whole phase diagram, see, e.g., [44, 27]. The techniques to do so have been originally developed in the context of communications starting with [45, 46, 30] and then generalized in [47, 38]. In this paper we consider instead Bayesian inference problems where the signal to be reconstructed is made of vectorial components. In this case the overlap is a matrix and the associated replica formulas are variational formulas over matrices. The concentration techniques developed for scalar overlaps do not apply directly, and need to be extended using new non-trivial ideas. In particular, new difficulties will appear w.r.t. the scalar case due to the fact that overlap matrices are not symmetric objects. Examples of problems where matrix overlaps appear are the factorization of matrices and tensors of rank greater than one , or the so-called committee machine neural network with few hidden neurons [48, 49, 50, 36]. In the context of spin glasses, matrix overlap order parameters have also appeared recently in studies of vectorial versions of the Potts and mixed -spin models by Panchenko [51, 52]; in these models replica symmetry breaking occurs and the overlap does not concentrate. Let us also mention the recent work by Agliari and co-workers  on a “multi-species” version of the Hopfield model, where a matrix order parameter also appears. There concentration of overlap, in the replica symmetric region where concentration is expected, is assumed based on strong physical arguments. In the context of optimal Bayesian inference the situation is more favorable than in spin glasses: thanks to special identities that follow from Bayes’ rule and known as “Nishimori identities” in statistical physics (see, e.g., [54, 55]), we show in this paper how to control the overlap fluctuations in the whole phase diagram111Let us mention another particular setting where the (scalar) overlap as well as its multi-body generalizations (which appear in diluted problems, i.e., problems defined by sparse graphical models) can be controlled under proper perturbations for all the parameters values in the phase diagram: ferromagnetic models ..
Section 2 presents the general setting, gives a few examples of models covered by our results, and explains the important Nishimori identity for optimal Bayesian inference problems. In section 3 we introduce the perturbation needed in order to prove overlap concentration, and then give our main results Theorems 3 and 4. Then in section 4 we provide the proof of Theorem 4. Finally in section 5 we prove an important intermediate concentration result for another matrix, that will be key in controlling the overlap.
2 Optimal Bayesian inference of signals with vector entries
Consider a model where a hidden signal made of components (indexed by ), that are each a
-dimensional bounded vector (with dimensions indexed by
), is generated probabilistically. Its probability distribution, called prior, may depend on a generic hyper-parameter with an arbitrary set, i.e., . We assume that the prior has bounded support (with arbitrarily large but independent of ). Then some data (also called observations) are generated conditionally on the unknown signal and an hyper-parameter , where is again generic. Namely, the data , with for a generic : the data and hyper-parameters can be vectors, tensors etc. The conditional distribution is called likelihood, or “output channel”. We also assume that the hyper-parameters and are also probabilistic, with respective probability distributions supported on , and supported on .
The inference task is to recover the signal as accurately as possible given the data . We assume that the hyper-parameters , the likelihood and the prior are known to the statistician, and call this setting optimal Bayesian inference.
The information-theoretical optimal way of reconstructing the signal follows from its posterior distribution. Using Bayes’ formula the posterior reads
Employing the language of statistical mechanics we call
the base Hamiltonian, while the posterior normalization is the partition function of the base inference model. Finally the averaged free energy is minus the averaged log-partition function:
The average is over the randomness of the ground truth signal, the observations and hyper-parameters . These are called the quenched variables as they are fixed by the realization of the problem, in contrast with the dynamical variable which fluctuates according to the posterior. In general
will be used for an average w.r.t. all random variables in the ensuing expression. Note that the averaged free energy is nothing else than the Shannon entropy density of the observations (given the hyper-parameters):. Therefore it is simply related to the mutual information density between the observations and the signal:
The conditional entropy density is often easy to compute, as opposed to the averaged free energy.
We call model (2.1) the “base model” in contrast with the perturbed model presented in section 3, a slightly modified version of the base model where additional side-information is given, and for which overlap concentration can be proved without altering the thermodynamic limit of the averaged free energy (if it exists), see Lemma 2.
The central object of interest is the overlap matrix (or simply overlap) defined as with
Here is a sample drawn according to the posterior distribution and is the signal (all vectors are columns, transposed vectors are rows). The overlap contains a lot of information. E.g., the minimum mean-square error (MMSE), an error metric often considered in signal processing, is related to it through
where we denote the expectation w.r.t. the posterior (2.1) of the base model. The minimization is over all functions of in , is the Frobenius norm. A simple fact from Bayesian inference is that the estimator minimizing the MMSE is the posterior mean . Another metric of interest in problems where, e.g., the sign of the signal is lost due to symmetries is the matrix-MMSE, again related to the overlap (the notation means for some positive constant depending only on the prior support ):
Finally if one is interested in estimating the sum over a subset of the the signal entries a possible error metric is
Let us provide some examples of models that fall under the setting of optimal Bayesian inference with vector variables as described in the previous section. In the symmetric order- rank- tensor factorization problem, the data-tensor is generated as
Here is a Gaussian noise tensor with independent and identically distributed (i.i.d.) entries for , and the signal components are i.i.d., i.e., with a prior of the form with a probability distribution supported on . The case
is known as the Wigner spike model, or low-rank matrix factorization, and is one of the simplest probabilistic model for principal component analysis. The Wigner spike model is an example of model where the signal’s sign is lost, and therefore a relevant error metric is the matrix-MMSE (2.2).
Another model is the following generalized linear model (GLM):
Note that here the observations are i.i.d. given and ; this is the reason for the notation instead of , the latter representing the full likelihood while the former is the conditional distribution of a single data point. We also assume that the prior is decoupled over the signal components and . A particular and simple deterministic case is
This model is the committee machine mentionned in the introduction [34, 36]. Here represents the weights of the -th hidden neuron, and are -dimensional data points used to generate the labels . The teacher-student scenario in which our results apply corresponds to the following: the teacher network (2.5) generates from the data . The pairs are then used in order to train (i.e., learn the weights of) a student network with exactly the same architecture.
A richer example is a multi-layer version of the GLM above:
with which is factorized as . Here represent intermediate hidden variables, while is the data, and for . This scaling for the variables sizes is often assumed in order not to make the inference problem impossible, nor trivial. This multi-layer GLM has been studied by various authors for the case and when the output components are scalars [57, 58, 59, 37, 60]. But one can define generalizations where these are multi-dimensional, in which case overlap matrices naturally arise.
A final example could be another combinaison of complex statistical models such as, e.g., the following symmetric matrix factorization problem where the low-rank representation of the tensor ( is a hidden variable) is itself generated from a generalized linear model over a more primitive signal :
Here again some factorization structure may be assumed, and .
2.3 The Nishimori identity
The following identity is a simple consequence of Bayes’ formula, and applies to optimal Bayesian inference.
Lemma 1 (Nishimori identity).
Let be a couple of random variables with joint distribution
be a couple of random variables with joint distributionand conditional distribution . Let and let be i.i.d. samples from the conditional distribution (called “replicas”). Let us denote the expectation operator w.r.t. the conditional distribution and the expectation w.r.t. the joint distribution. Then, for any continuous bounded function we have
It is equivalent to sample the couple according to its joint distribution or to sample first according to its marginal distribution and then to sample conditionally on from the conditional distribution. Thus the two -tuples and have the same law. ∎
In practice the Nishimori identity222This identity has been abusively called “Nishimori identity” in the statistical physics literature despite that it is a simple consequence of Bayes’ formula. The “true” Nishimori identity concerns models with one extra feature, namely a gauge symmetry which allows to eliminate the input signal, and the expectation over the signal in expressions of the form can therefore be dropped. allows to “replace” the ground truth signal by an independent replica and vice-versa in expressions involving only other replicas and the observations, where by replicas we mean i.i.d. samples drawn according to the posterior.
3 The vectorial Gaussian channel perturbation
In order to force the overlap to concentrate we need to have access to infinitesimal side-information, in addition of the observations , coming from the following vectorial Gaussian channel:
Here the signal is the same as in the base inference model. The observations , the signal components and i.i.d. Gaussian noise variables are all -dimensional vectors. The signal-to-noise (SNR) matrix controlling the signal strength
with a sequence that tends to slowly enough (the rate will be specified later), and belongs to defined as
We also denote . Matrices belonging to are symmetric strictly diagonally dominant with positive entries and thus , where is the set of symmetric positive definite matrices of dimension , see . As it possesses a unique principal square root matrix denoted . The advantage of working with the ensemble is the following. We require that the SNR matrix always belong to so that its square root is real and unique. For a generic positive matrix in , but not necessarily in , one cannot varry its (symmetric) elements independently as doing so the matrix might not be positive definite anymore; the constraint is a “global” constraint over the matrix elements. In contrast if we can varry its elements independently (as long as it remains in ) without the possibility that falls out of .
The perturbed inference model is then
It is called “perturbed model” because the original observation model has been slightly modified by adding new observations coming from (3.1) that are “weak” (as ). The perturbation Hamiltonian associated with the observation channel (3.1) is
using the symmetry of the SNR matrix. The total Hamiltonian is therefore the sum of the base Hamiltonian and the perturbation one. The posterior of the perturbed model, written in the standard Gibbs-Boltzmann form of statistical mechanics, is
where again the partition function is simply the normalization constant. We also define the Gibbs-bracket as the expectation operator w.r.t. the posterior of the perturbed model:
for any function s.t. its expectation exists. Thus depends on the quenched variables and the perturbation parameter .
It is crucial to notice the following. The perturbation is constructed from an inference channel (3.1) which form is known ( is given). Therefore the perturbed model (3.3) is a proper inference problem in the optimal Bayesian inference setting. This means that in addition to the data , the statistician fully knows the data generating model, namely the likelihood and the additive Gaussian nature of the noise in the second channel in (3.3), the prior as well as all hyper-parameters , and is therefore able to write the true posterior (3.5) of the model when estimating the signal. As a consequence the Nishimori identity Lemma 1 applies to the perturbed model and its bracket .
An important quantity is the averaged free energy of the perturbed model:
where the expectation carries over the random hyper-parameters, the ground truth signal (given ) and the data generated according to (3.3), but not over which remains fixed. Later we will average quantities w.r.t. , but in this case we will explicitely write . For proving the concentration of the overlap we need the following crucial hypothesis:
Hypothesis (Free energy concentration). The free energy (3.7) of the perturbed model concentrates at the optimal rate, namely there exists a constant that may depend on everything but , and s.t.
There are some remarks to be made here. The first one is related to the scenarios where this hypothesis can be verified. For purely generic optimal inference models without any restricting assumptions on the form of the distributions it is generally very hard, if not wrong, to try proving (3.8). The model must be “random enough” and possess some underlying factorization structure for such hypothesis to be true. The most studied case in the literature is when the prior and the likelihood factorize, namely and the data points are i.i.d. given . The examples (2.3)–(2.5) fall in this class. Under such independence/factorization assumptions it is quite straightforward to prove that the free energy concentrates using standard techniques (see, e.g., [27, 34]). But such simple factorization properties are not always there, as illustrated by examples (2.6), (2.7). In these two last examples it is a perfectly valid question to wonder wether the overlap of the hidden variables do concentrate333Note that proving concentration of the overlap for a hidden variable requires a perturbation of the form (3.1) over the hidden variable, not over , which in this case is just interpreted as a constitutive element of the prior of the hidden variable of interest, see  where this is done. (this question is crucial in the analysis of ). The hidden variables have very complex structured prior (i.e., probability distribution), with highly non-trivial factorization properties, in which case proving (3.8) requires work. See, e.g.,  where this has been done for the multi-layer GLM (2.6) with a single hidden layer () where this is already challenging.
The second remark is that the perturbation does not change the limit of the averaged free energy:
Lemma 2 (The base and perturbed models have same asymptotic averaged free energy).
There exists a constant s.t. . Therefore and have same thermodynamic limit, provided it exists.
3.1 Main results
All along this paper we denote a generic positive numerical constant depending only on the parameters . E.g., depends only on appearing in (3.8), the variables dimensionality and on the prior support . Let us denote the average over the matrix appearing in the perturbation (3.1) as
Here is the volume of which vanishes as (there are independent entries in as it is symmetric). Recall the notation for the expectation w.r.t. the posterior of the perturbed inference model (3.6).
In order to give our first result we need to introduce the overlap between two replicas
where, again, replicas are i.i.d. random variables drawn accroding to the posterior measure (3.5) of the perturbed model (and thus share the same quenched variables): . By a slight abuse of notation let us continue to use the same bracket notation for the expectation of functions of replicas w.r.t. to the product posterior measure: .
Our main results are the following concentration theorems for the overlap in a (perturbed) model of optimal Bayesian inference. We start with the first type of fluctuations, namely the fluctuations of the overlap w.r.t. the posterior distribution, or what is called “thermal fluctuations” in statistical mechanics. Note that for controlling these fluctuations we do not need that the free energy concentrates, i.e., the hypothesis (3.8) is not required. As a consequence this result is valid even for very complex models without any factorization properties for the signal’s prior nor for the likelihood (as long as they are defined in the optimal Bayesian setting). This is a consequence of the precense of the perturbation.
Theorem 3 (Thermal fluctuations of ).
Assume that the perturbed inference model is s.t. the Nishimori identity Lemma 1 holds. Let a sequence verifying and . There exists positive constants s.t.
The next, stronger, result takes care of the additional fluctuations due to the quenched randomness, and requires the free energy concentration hypothesis:
Theorem 4 (Total fluctuations of ).
Before entering the proof let us make a very last remark. There are problems with multiple overlaps. For example one may also consider the non-symmetric version of the tensor factorization problem. In this case matrices , of respective size with scaling linearly with and with a possibly matrix-dependent prior , are to be reconstructed from a data-tensor of the form
In this case there is one overlap per matrix-signal to be inferred: . It should be clear to the reader that all the setting described in this paper can be straightforwardly extended to include this case: one has to consider one perturbation channel of the form (3.1) per variable to be reconstructed (i.e., per matrix in the non-symmetric tensor factorization problem), each with its own independent matrix SNR: , . Then the total Hamiltonian is the sum of the base one and the perturbation Hamiltonians, and so forth.
4 Proof of concentration of the overlap matrix
For the sake of readibility we now drop the index in the matrix SNR:
We use , and , for the variables dimension indices which are running from to , and , for the variables indices running from to . When we write and we always mean .
Let us start with some preliminary computations.
4.1 Preliminaries: properties of the matrix
The proof that the overlap concentrates relies on the concentration of another matrix defined as
where we used
The fluctuations of this matrix are easier to control than the ones of the overlap. This comes from the fact that is related to derivatives of the averaged free energy, which is self-averaging by hypothesis (3.8). First consider . We have
We used the Nishimori identity Lemma 1 which in this case reads . Each time we use an identity that is a consequence of Lemma 1 we write a on top of the equality (that stands for Nishimori). We integrate by part the Gaussian noise thanks to the formula for and any bounded function . This leads to
We used that the derivative of the Hamiltonian (3.4) is
We now exploit the symmetry of the matrices and in order to symmetrize the terms in (4.4) and then use the formula
Identity (4.4) then becomes
Similarly wo obtain for the diagonal terms . Using this (4.1) becomes
Therefore the expectation of is directly related to the one of . It is thus natural to guess that if concentrates onto its mean, the overlap should concentrate too. Indeed, the following concentration identity for is key in proving Theorems 3 and 4. Note that the following proposition does not require the Nishimori identity (i.e., to be in the optimal Bayesian setting). But it will be crucial when linking the fluctuations of to those of .
Proposition 5 (Concentration of ).
Let a sequence verifying and . There exists a positive constant s.t.
Moreover if and the free energy concentrates as in identity (3.8), then there exists a constant s.t.
Let us assume this result and show how it implies concentration of . We will prove it later in section 5.
4.2 Thermal fluctuations: proof of Theorem 3
Let us start with the control of the fluctuations due to the posterior distribution. Recall that the overlap between two replicas is , and that by a slight abuse of notation we continue to use the bracket notation for the expectation of functions of replicas w.r.t. to the product posterior measure: .