DeepAI
Log In Sign Up

Phase transitions and sample complexity in Bayes-optimal matrix factorization

02/06/2014
by   Yoshiyuki Kabashima, et al.
CEA
0

We analyse the matrix factorization problem. Given a noisy measurement of a product of two matrices, the problem is to estimate back the original matrices. It arises in many applications such as dictionary learning, blind matrix calibration, sparse principal component analysis, blind source separation, low rank matrix completion, robust principal component analysis or factor analysis. It is also important in machine learning: unsupervised representation learning can often be studied through matrix factorization. We use the tools of statistical mechanics - the cavity and replica methods - to analyze the achievability and computational tractability of the inference problems in the setting of Bayes-optimal inference, which amounts to assuming that the two matrices have random independent elements generated from some known distribution, and this information is available to the inference algorithm. In this setting, we compute the minimal mean-squared-error achievable in principle in any computational time, and the error that can be achieved by an efficient approximate message passing algorithm. The computation is based on the asymptotic state-evolution analysis of the algorithm. The performance that our analysis predicts, both in terms of the achieved mean-squared-error, and in terms of sample complexity, is extremely promising and motivating for a further development of the algorithm.

READ FULL TEXT VIEW PDF

page 39

page 40

07/31/2022

Unitary Approximate Message Passing for Matrix Factorization

We consider matrix factorization (MF) with certain constraints, which fi...
03/01/2015

Phase Transitions in Sparse PCA

We study optimal estimation for sparse principal component analysis when...
10/17/2021

Perturbative construction of mean-field equations in extensive-rank matrix factorization and denoising

Factorization of matrices where the rank of the two factors diverges lin...
04/27/2017

Optimal Sample Complexity for Matrix Completion and Related Problems via ℓ_2-Regularization

We study the strong duality of non-convex matrix factorization: we show ...
07/03/2018

Approximate Survey Propagation for Statistical Inference

Approximate message passing algorithm enjoyed considerable attention in ...
12/13/2013

Sample Complexity of Dictionary Learning and other Matrix Factorizations

Many modern tools in machine learning and signal processing, such as spa...
09/15/2020

Bilinear Generalized Vector Approximate Message Passing

We introduce the bilinear generalized vector approximate message passing...

I Introduction

We study in this paper a variety of questions which all deal with the general problem of matrix factorization. Generically, this problem is stated as follows: Given a dimensional matrix , that was obtained from noisy element-wise measurements of a matrix , one seeks a factorization , where the dimensional matrix and the dimensional matrix must satisfy some specific requirements like sparsity, low-rank or non-negativity. This type of problem appears in many applications, for instance dictionary learning Olshausen et al. (1996); Olshausen & Field (1997); Kreutz-Delgado et al. (2003), sparse principal component analysis Zou et al. (2006), blind source separation Belouchrani et al. (1997), low rank matrix completion Candès & Recht (2009); Candès & Tao (2010) or robust principal component analysis Candès et al. (2011), that will be described below.

Theoretical limits on when matrix factorization is possible and computationally tractable are still rather poorly understood. In this work we make a step towards this understanding by predicting the limits of matrix factorization and its algorithmic tractability when is created using randomly generated matrices and , and measured element-wise via a known noisy output channel . Our results are derived in the limit where with fixed ratios , . We predict the existence of sharp phase transitions in this limit and provide the explicit formalism to locate them.

The methods that we use in this paper are based on a generalization of approximate message passing (AMP) Donoho et al. (2009)

to the matrix factorization problem, and on its asymptotic analysis which is known in statistical physics as the cavity method

Mézard et al. (1987); Mézard & Montanari (2009), and has been called state evolution in the context of compressed sensing Donoho et al. (2009). We also use the replica method whose results are equivalent to those of the cavity method Mézard et al. (1987). These methods are widely believed to be exact in the context of theoretical statistical physics, but most of the results that we shall obtain in the present work are not rigorously established. Our predictions have thus the status of conjectures.

This work builds upon some previous steps that we described in earlier reports Sakata & Kabashima (2013); Krzakala et al. (2013). The message passing algorithm related to our analysis was first presented in Krzakala et al. (2013) and is very closely related to the Big-AMP algorithm developed and tested in Schniter et al. (2012); Parker et al. (2013, 2014); relations and differences with Big-AMP will be mentioned in several places throughout the paper. Our main focus here, beside the detailed derivation of the algorithm, is the asymptotic analysis and phase diagrams which were not studied in Schniter et al. (2012); Parker et al. (2013, 2014).

Our general method provides a unifying framework for the study of computational tractability and identifiability of various matrix factorization problems. The first step for this synergy is the formulation of the problem via a graphical model (see Fig. 1) that is amenable to analysis using the present methods.

The phase diagrams that we shall derive establishes for each problem two types of thresholds in the plane : the threshold where the problem of matrix factorization ceases to be solvable in principle, and the threshold where AMP ceases to find the best solution. In most existing works the computation of phase transitions was treated separately for each of the various problems. For instance, redundant dictionaries for sparse representations and low rankness are usually thought as two different kinds of dimensional reduction. Interestingly, in our work these two reductions (and others) are described within a unified formalism; this is theoretically interesting in the context of recent developments Elad (2012).

i.1 Statement of the problem

In a general matrix factorization problem one measures some information about matrix elements of the product of two unknown matrices and , whose matrix elements will be denoted and . Let us denote the product , with elements

(1)

The element-wise measurement of

is then specified by some known probability distribution function

, so that:

(2)

The goal of matrix factorization is to estimate both matrices and from the measurements .

In this paper we will treat this problem in the framework of Bayesian inference. In particular we will assume that the matrices

and were both generated from a known separable probability distribution

(3)
(4)

In the following we shall mostly study the case where the distributions are all identical (there is a single distribution ), and the distributions for various (as well as for various ) are also all identical. Our approach can be generalized to the case where , , depend in a known way on the indices , , and , provided that this dependence is through parameters that themselves are taken from separable probability distributions. Examples of such dependence include the blind matrix calibration or the factor analysis. On the other hand our theory does not cover the case of an arbitrary matrix for which we would have .

The posterior distribution of and given the measurements is written as

(5)

where is the normalization constant, known as the partition function in statistical physics.

Notice that, while the original problem of finding and , given the measurements , is not well determined (because of the possibility to obtain, from a given solution, an infinity of other solutions through the transformation and , where is any nonsingular matrix), the fact of using well defined priors and actually lifts the degeneracy: the problem of finding the most probable given the measurements and the priors is well defined. In case the priors and do not depend on the indices and we are left with a permutational symmetry between the column of and rows of . Both in the algorithm and the asymptotic analysis this symmetry is broken and one of the solutions is chosen at random.

Typically, in most applications, the distributions , and

will depend on a set of parameters (such as the mean, variance, sparsity, noise strength, etc.) that we usually will not write explicitly in the general case, in order to simplify the notations. The prior knowledge of these parameters is not necessarily required in our approach: these parameters can be learned via an expectation-maximization-like algorithm that we will discuss briefly in section

II.5.

Note also that Eq. 1 can be multiplied by an arbitrary constant: with a corresponding change in the output function the problem will not be modified. In the derivations of this paper we choose the above constant in such a way that the elements of matrices , , and are of order , whereas the elements of scale in a consistent way, meaning that the mean of each is of order and its variance is also of order .

i.2 Bayes-optimal inference

Our paper deals with the general case of incomplete information. This happens when the reconstruction assumes that the matrices , and were generated with some distributions , and , whereas in reality the matrices were generated using some other distributions , and . The message passing algorithm and its asymptotic evolution will be derived in this general case.

However, our most important results concern the Bayes-optimal setting, i.e. when we assume that

(6)

In this case, an estimator that minimizes the mean-squared error (MSE) with respect to the original signal , defined as

(7)

is obtained from marginals of

with respect to the posterior probability measure

, i.e.,

(8)

is the marginal probability distribution of the variable . The mean squared error achieved by this optimal estimator is called the minimum mean squared error (MMSE) in this paper.

A similar result holds for the estimator of that minimizes the mean-squared error

(9)

which is obtained from the mean of with respect to the posterior probability measure . In the remainder of this article we will be using these estimators.

i.3 Statement of the main result

The main result of this paper are explicit formulas for the MMSE achievable in the Bayes optimal setting (as defined above) for the matrix factorization problem in the “thermodynamic limit”, i.e. when with fixed ratios , . When sparsity is involved we consider that a finite fraction of matrix elements are non-zero. Similarly, when we treat matrices with low ranks we consider again the ranks to be a finite fraction of the total dimension. We also derive the AMP-MSE, i.e. the mean square error achievable by the approximate message passing algorithm as derived in this paper.

To compute the MMSE and AMP-MSE we need to analyze the fixed points of the following iterative equation.

(10)
(11)
(12)

where is a notation for a Gaussian probability measure . We denoted . Here and are the so called input functions, they are defined using the prior distributions and as

(13)
(14)

The output function is defined using the output probability as

(15)

The function is related to the output probability via

(16)

The MSE is computed as and with and being fixed points of (10-12).

The AMP-MSE is obtained from a fixed point reached with uninformative initialization corresponding to squares of the means of the prior distributions

(17)

In case the prior distribution depends on another random variables, e.g. in case of matrix calibration, we take additional average with respect to that variable. If the above initialization gives

and then this is a fixed point of the above iterative equations. This is due to the permutational symmetry between the columns of matrix and rows of matrix . To obtain a nontrivial fixed point we initialize at for some very small , corresponding to an infinitesimal prior information about the matrix elements of the matrix .

To compute the MMSE we need to initialize the iterations in an informative way, i.e. infinitesimally close to the solution

(18)

If the resulting fixed point agrees with the AMP-MSE then this is also the MMSE. In case that this informative initialization leads to a different fixed point than the AMP-MSE then the MMSE is given by the one of them for which the following free entropy is larger

(19)

Sections III, IV, V are devoted to the derivation of these formulas for MMSE and AMP-MSE. In section VI we then give a number of examples of phase diagram for specific applications of the matrix factorization problem.

We reiterate at this point that whereas our results are exact results within the theoretical physics scope of the cavity/replica method, we do not provide their proofs and their mathematical status is that of conjectures. We devote section II.3 to explaining the reasons of why this is so. The situation is comparable to the predictions-conjectures that were made for the satisfiability threshold Mézard et al. (2002), which have been partly established rigorously recently Ding et al. (2014), or the predictions-conjectures that were made for the CDMA problem Tanaka (2002); Guo & Verdú (2005), and were later proved by Montanari & Tse (2006); Bayati & Montanari (2011).

i.4 Applications of matrix factorization

Several important problems which have received a lot of attention recently are special cases of the matrix factorization problem as set above. In this paper we will analyse the following ones.

Dictionary learning

Many signals of interest are sparse in some basis, this fact is widely used in data compression and more recently in compressed sensing. A lot of work has been devoted to analyzing bases in which different data are sparse. The goal of dictionary learning is to infer a basis in which the data are sparse based purely on a large number of samples from the data. The matrix then represents the samples of -dimensional data. The goal is to decompose into a matrix , and a sparse matrix , is the noise.

In this paper we will analyse the following teacher-student scenario of dictionary learning. We will generate a random Gaussian matrix with iid elements of zero mean and variance , and a random Gauss-Bernoulli matrix with fraction of non-zero elements. The non-zero elements of will be iid Gaussian with mean and variance . The noise with elements is also iid Gaussian with zero mean and variance . We hence have

(20)
(21)
(22)

where is the Dirac delta function. The goal is to infer and from the knowledge of with the smallest possible number of samples .

In the noiseless case, , exact reconstruction might be possible only when the observed information is larger that the information that we want to infer. This provides a simple counting bound on the number of needed samples :

(23)

Note that the above assumptions on , and likely do not hold in any realistic problem. However, it is of theoretical interest to analyze the average theoretical performance and computational tractability of such a problem, as it gives a well defined benchmark. Moreover, we anticipate that if we develop an algorithm working well in the above case it might also work well in many real applications where the above assumptions are not satisfied, in the same spirit as the approximated message passing algorithm derived for compressed sensing with zero mean Gaussian measurement matrices Donoho et al. (2009) works also for other kinds of matrices.

Typically, we would look for an invertible basis with , in that case we speak of a square dictionary. However, with the compressed-sensing application in mind, it is also very interesting to consider that might be under-sampled measurements of the actual signal, corresponding then to . Hence we will be interested in the whole range . The regime of corresponds to an overcomplete dictionary, in which case each of the measurements is a sparse linear combination of the columns (atoms) of the dictionary.

We remind that the columns of the matrix can always be permuted arbitrarily and multiplied by . This is also true for rows of the matrix : all these operations do not change , nor the posterior probability distribution. This is hence an intrinsic freedom in the dictionary learning problems that we have to keep in mind. Note that many works consider the dictionary to be column normalized, which lifts part of the degeneracy in some optimization formulations of the problem. In our setting the equivalent of column normalization is asymptotically determined by the properties of the prior distribution.

Blind matrix calibration

In dictionary learning one does not have any specific information about the elements . In some applications of compressed sensing one might have an approximate knowledge of the measurement matrix: it is often possible to use known samples of the signal in order to calibrate the matrix in a supervised manner (i.e. using known training samples of the signal). Sometimes, however, the known training samples are not available and hence the only way to calibrate is to measure a number of unknown samples and perform their reconstruction and calibration of the matrix at the same time, such a scenario is called the blind calibration.

In blind matrix calibration, the properties of the signal and the output function are the same as in dictionary learning, eqs. (21-22). As for the matrix elements one knows a noisy estimation . In this work we will assume that this estimation was obtained from as follows

(24)

where is a Gaussian random variables with zero mean and variance . This way, if the matrix elements have zero mean and variance , then the same is true for the elements .

The control parameter

is then quantifying how well one knows the measurement matrix. It provides a way to interpolate between the pure compressed sensing

, where one knows the measurement matrix , and the dictionary learning problems . Explicitly, the prior distribution of a given element of the matrix is

(25)

where

is a Gaussian distribution with mean

and variance .

Low-rank matrix completion

Another special case of matrix factorization that is often studied is the low-rank matrix completion. In that case one “knows” only a small (but in our case finite when ) fraction of elements of the matrix . Also one knows which elements are known and which are not; let us call the set on which elements are known, it is a set of size . In this case the output function is:

(26)

The precise choice on the function on the second line is arbitrary as long as it does not depend on the . In what follows we will assume that the known elements were chosen uniformly at random.

In low rank matrix completion is small compared to and , hence both and are relatively large. Note, however, that the limit we analyse in this paper keeps and while , whereas in many previous works on low-rank matrix completion the rank was considered or and hence and of order . Compared to those works the analysis here applies to ”not-so-low-rank” matrix completion. The question is what fraction of elements of needs to be known in order to be able to reconstruct the two matrices and .

For negligible measurement noise, , a simple counting bound gives that the fraction of known elements we need for reconstruction is at least

(27)

Again we will study the student-teacher scenario when a low-rank is generated from and having iid elements distributed according to eq. (20) and (21) with (no sparsity). To construct we keep a random fraction of elements of , and the goal is to reconstruct and from that knowledge.

Sparse PCA and blind source separation

Principal component analysis (PCA) is a well-known tool for dimensional reduction. One usually considers the singular value decomposition (SVD) of a given matrix and keeps a given number of largest values, thus minimizing the mean square error between the original matrix and its low-rank approximation. The SVD is computationally tractable, and provides the minimization of the mean square error between the original matrix and its low-rank approximation. However, with additional constraints there is no general computationally tractable approach.

A variant of PCA that is relevant for a number practical application requires that one of the low-rank components is sparse. The goal is then to approximate a matrix by a product where is a tall matrix, and a wide sparse matrix. The teacher-student scenario for sparse PCA that we will analyse in this paper uses eq. (20-22) and the matrix dimensions are such that and are both large, but still of order , and comparable one to the other. Hence it is only the region of interest for and that makes this problem different from the dictionary learning. Note that, in the same way as for the matrix completion, many works in the literature consider whereas here we have in such a way that and . Hence we work with low rank, but not as low as most of the existing literature.

In the zero measurement noise case, , the simple counting bound gives that the rank for which the reconstruction problem may be solvable needs to be smaller than

(28)

One important application where sparse PCA is relevant is the blind source separation problem. Given -sparse (in some known basis) source-signals of dimension , they are mixed in an unknown way via a matrix into channel measurements . In blind source separation typically both the number of sources and the number of channels (sensors) are small compared to . When we obtain an overdetermined problem which may be solvable even for . More interesting is the undetermined case with the number of sensors smaller than the number of sources, , which would not be solvable unless the signal is sparse (in some basis), in that case the bound (28) applies.

Robust PCA

Another variant of PCA that arises often in practice is the robust PCA, where the matrix is very close to a low rank matrix plus a sparse full rank matrix. The interpretation is that was created as low rank but then a small fraction of elements was distorted by a large additive noise. The resulting is hence not low rank.

In this paper we will analyse a case of robust PCA when and are generated from eq. (20-21) with and the output function is

(29)

where is the fraction of elements that were not largely distorted, is the small measurement noise on the non-distorted elements, is the large measurement noise on the distorted elements. We will require to be comparable to the variance of such that there is no reliable way to tell which elements were distorted by simply looking at the distribution of . The parameters regime we are interested in here is and both relatively large and comparable one to another.

In robust PCA in the zero measurement noise case, , the simple counting bound gives that the fraction of non-distorted elements under which the reconstruction may still be solvable needs to satisfy the same bound as for matrix completion (27). Indeed this counting bound does not distinguish between the case of matrix completion when the position of known elements of is known and the case of RPCA when their positions are unknown.

Factor analysis

One major objective of multivariate data analysis is to infer an appropriate mechanism from which observed high-dimensional data are generated. Factor analysis (FA) is a representative methodology for this purpose. Let us suppose that a set of

-dimensional vectors

is given and its mean is set to zero by a pre-processing. Under such a setting, FA assumes that each observed vector is generated by -dimensional common factor and -dimensional unique factor as , where is termed the loading matrix. The goal is to determine the entire set of , , and from only . Therefore, FA is also expressed as a factorization problem of the form of in matrix terms.

The characteristic feature of FA is to take into account the site dependence of the output function as

(30)

where denotes the variance of the -th component of the unique factor. In addition, it is normally assumed that

independently obeys a zero mean multivariate Gaussian distribution, the variance of which is set to the identity matrix in the basic case. These assumptions make it possible to express the log-likelihood for Y in a compact manner as

, where . In a standard scheme, and , which parameterize the generative mechanism of the observed data , are determined by maximizing the log-likelihood function Jöreskog (1969). After obtaining these, the posterior distribution is used to estimate the common factor , where and are the maximum likelihood estimators of and , respectively. Finally, unique factor is determined from the relation

. Other heuristics to minimize certain discrepancies between the sample variance-covariance matrix

and are also conventionally used for determining and . As an alternative approach, we employ the Bayesian inference for FA.

Non-negative matrix factorization

In many application of matrix factorization it is known that both the coefficient of the signals and the coefficients of the dictionary must be non-negative. In our setting this can be taken into account easily by imposing the distributions and to have non-negative support. In the student-teacher scenario of this paper we can hence consider the nonzero elements of to be for and for , and analogously for .

i.5 Related work and positioning of our contribution

Matrix factorization and its special cases as mentioned above are well-studied problems with extensive theoretical and algorithmic literature that we are cannot fully cover here. We will hence only give examples of relevant works and will try to be exhaustive only concerning papers that are very closely related to our work (i.e. papers using message-passing algorithms or analyzing the same phase transitions in the same scaling limits).

The dictionary learning problem was identified in the work of Olshausen et al. (1996); Olshausen & Field (1997) in the context of image representation in the visual cortex, and the problem was studied extensively since Kreutz-Delgado et al. (2003). Learning of overcomplete dictionaries for sparse representations of data has many applications, see e.g. Lewicki & Sejnowski (2000). One of the principal algorithms that is used is based on K-SVD Aharon et al. (2006). Several authors studied the identifiability of the dictionary under various (in general weaker than ours) assumptions, e.g. Michal Aharon (2006); Vainsencher et al. (2011); Jenatton et al. (2012); Spielman et al. (2013); Arora et al. (2013); Agarwal et al. (2013); Gribonval et al. (2013). An interesting view on the place of sparse and redundant representations in todays signal processing is given in Elad (2012).

The closely related problems of sparse principal component analysis or blind source separation is also explored in a number of works, see e.g. Lee et al. (1999); Zibulevsky & Pearlmutter (2001); Bofill & Zibulevsky (2001); Georgiev et al. (2005); d’Aspremont et al. (2007). A short survey on the topic with relevant references can be found in Gribonval et al. (2006).

Matrix completion is another problem that belongs to the class treated in this paper. Again, many important works were devoted to this problem giving theoretical guarantees, algorithm and applications, see e.g. Candès & Recht (2009); Candès & Tao (2010); Candes & Plan (2010); Keshavan et al. (2010).

Another related problem is the robust principal component analysis that was also studied by many authors; algorithms and theoretical limits were analyzed in Chandrasekaran et al. (2009); Yuan & Yang (2009); Wright et al. (2009); Candès et al. (2011).

Our work differs from the mainstream of existing literature in its general positioning. Let us mention here some of the main differences with most other works.

  • Most existing works concentrate on finding theoretical guarantees and algorithms that work in the worst possible case of matrices and . In our work we analyze the typical cases when elements of and are generated at random. Arguably a worst case analysis is useful for practical applications in that it provides some guarantee. On the other hand, in some large-size applications one can be confronted in practice with a situation which is closer to the typical case that we study here. Our typical-case analysis provides results that are much tighter than those usually obtained in literature in terms of both achievability and computational tractability. For instance our results imply much smaller sample complexity, or much smaller mean-squared error for a given signal-to-noise ratio, etc.

  • Our main contribution is the asymptotic phase diagram of Bayes-optimal inference in matrix factorization. Special cases of our result cover important problems in signal processing and machine learning. In the present work we do not concentrate on the validation of the associated approximate message-passing algorithm on practical examples, nor on its comparison with existing algorithms. A contribution in this direction can be found in Krzakala et al. (2013); Parker et al. (2014).

  • A large part of existing machine-learning and signal-processing literature provides theorems. Our work is based on statistical physics methods that are conjectured to give exact results. While many results obtained with these methods on a variety of problems have been indeed proven later on, a general proof that these methods are rigorous is not known yet.

  • Many existing algorithms are based on convex relaxations of the corresponding problems. In this paper we analyze the Bayes-optimal setting. Not surprisingly, this setting gives a much better performance than convex relaxation. The reason why it is less studied is that it is often considered as hopelessly complicated from the algorithmic perspective; however some recent results using message-passing algorithms which stand at the core of our analysis have shown very good performance in the Bayes-optimal problem. Besides, the Bayes-optimal offers an optimal benchmark for testing algorithmic methods.

  • When treating low-rank matrices we consider the rank to be a finite fraction of the total dimension, whereas most of existing literature considers the rank to be a finite constant. The AMP algorithm derived in this paper can of course be used as a heuristics also in the constant rank case, but our results about MMSE are exact only in the limit where the rank is a finite fraction of the total dimension. Also note that for the special case of rank one the AMP algorithm was derived and analysed rigorously Rangan & Fletcher (2012); Deshpande & Montanari (2014).

  • When treating sparsity we consider the number of non-zeros is a finite fraction of the total dimension, whereas existing literature often considers a constant number of non-zeros. Again the AMP algorithm derived in this paper can of course be used as a heuristics also in the constant number of non-zeros case, but our results about MMSE are exact only in the limit where the density is a finite fraction of the total dimension.

The paper is organized as follows: First, in Sec. III we give a detailed derivation of the approximate message passing algorithm for the matrix factorization problem. This algorithm is equivalent to maximization of the Bethe free entropy, whose expression is discussed in Sec. IV. These first two sections thus state our algorithmic approach to matrix factorization. The asymptotic performance of the AMP algorithm and the Bayes-optimal MMSE are analyzed in Sec. V using two technics: i) the state evolution of the AMP algorithm and ii) the replica method. As usual, the two methods are found to agree and to give the same predictions. We then use these results in order to study some exemples of matrix factorization problems in Sec. VI. In particular we derive the phase diagram, the MMSE and the sample complexity of dictionary learning, blind matrix calibration, sparse PCA, blind source separation, low rank matrix completion, robust PCA and factor analysis. Our results are summarized and discussed in the conclusion in Sec. VII.

Ii Elements of the background from statistical physics

ii.1 The Nishimori identities

There are important identities that hold for the Bayes-optimal inference and that simplify many of the calculations that follow. In the physics of disordered systems these identities are known as the Nishimori conditions Opper & Haussler (1991); Iba (1999); Nishimori (2001). Basically, the Nishimori identities follow from the fact that the true signal , is an equilibrium configuration with respect to the Boltzmann measure (5). Hence many properties of the true signal , can be computed by using averages over the distribution even if one does not know precisely (5).

In order to derive the Nishimori identities, we need to define three types of averages: the thermodynamic average, the double thermodynamic average, and the disorder average.

  • Consider a function depending on a “trial” configuration and . We define the “thermodynamic average” of given as:

    (31)

    where is given by Eq. (5). The thermodynamic average is a function of .

  • Similarly, for a function that depends on two trial configurations , and , , we define the “double thermodynamic average” of given as:

    (32)

    this is again a function of .

  • For a function that depends on the measurement and on the true signal , we define the “disorder average” as

    (33)

    Note that if the quantity depends only on , then we have

    (34)

    This is simply because the partition function is .

Let us now derive the Nishimori identities. We consider a function that depends on the trial configuration and on the true signal . Its thermodynamic average is a function of the measurement and the true signal . The disorder average of this quantity can be written as

(35)

In this last expression, renaming to and to , we see that the average over is nothing but the double thermodynamic average:

(36)

where in the last step we have used the form (34) of the disorder average.

The identity

(37)

is the general form of the Nishimori identity. Written in this way, it holds for many inference problems where the model for signal generation is known.

It is well known (although it may be sometimes hard to prove) that in the thermodynamic limit (as introduced in the first paragraph of section I.3) the “self-averaging” property holds for many quantities. This means that the distribution of concentrates with respect to the realization of disorder , i.e. for large system sizes the quantity converges with probability one to its average over disorder, . This self-averaging property makes the Nishimori identity very useful in practice.

To give one particularly useful example, let us define and . The Nishimori condition states that . Assuming that the quantities and are self-averaging, we obtain in the thermodynamic limit, for almost all : . Explicitly, this gives:

(38)

This is a remarkable identity concerning the mean of with the posterior distribution . The left-hand side measures the overlap between this mean and the sought true value . The right-hand side measures the self overlap of the mean, which can be estimated without any knowledge of the true value , by generating two independent samples from .

By symmetry, all these examples apply also to averages and functions of the matrix :

(39)

Validity of the relation (38) also follows straightforwardly from the self-averaging property and an elementary relation for conditional expectations .

Other example of a Nishimori identity used in our derivation is eq. (106). Note that this one is crucial to maintain certain variance-related variables strictly positive as explained in section III.3. Without eq. (106) the AMP algorithm may have pathological numerical behavior causing that some by definition strictly positive quantities become negative. Deeper understanding of this problem in cases where Nishimori conditions cannot be imposed is under investigation.

Nishimori condition also greatly simplify the state evolution equations of section V.1.1. In general we have nine equations for state evolution. However, after imposing Nishimori conditions, eqs. (186-188) we are left with only three.

ii.2 Easy/hard and possible/impossible phase transitions

Fixed points of equations (10-12) allow us to evaluate the MMSE of the matrix and of the matrix . We investigate two fixed points: The one that is reached from the ”uninformative” initialization (17), and the fixed point that is reached with the informative (planted) initialization (18). If these two initializations lead to a different fixed point it is the one with the largest value of the Bethe free entropy (I.3) that corresponds to the Bayes-optimal MMSE. The reason for this is that the Bethe free entropy is the exponent of the normalization constant in the posterior probability distribution: a larger Bethe free entropy is hence related to a fixed point that corresponds to an exponentially larger probability. Furthermore, in cases where the uninformative initialization does not lead to this Bayes-optimal MMSE we anticipate that the optimal solution is hard to reach for a large class of algorithms.

Depending on the application in question and the value of parameters () we can sometimes identify a phase transition, i.e. a sharp change in behavior of the MMSE. As in statistical physics it is instrumental to distinguish two kinds of phase transitions

  • Second order phase transition: In this case there are two regions of parameters. In one, the recovery performance is poor (positive MMSE); in the other one the recovery is perfect (zero MMSE). This situation can arrive only in zero noise, with positive noise there is a smooth ”crossover” and the transition between the phase of good and poor recovery is not sharply defined. Interestingly, as we will see, in all examples where we observed the second order phase transition, its location coincides with the simple counting bounds discussed in Section I.4. In the phase with poor MMSE, there is simply not enough information in the data to recover the original signal (independently of the recovery algorithm or its computational complexity). Experience from statistical physics tells us that in problems where such a second order phase transition happens, and more generally in cases where the state evolution (10-12) has a unique fixed point, there is no fundamental barrier that would prevent good algorithms to attain the MMSE. And in particular our analysis suggest that in the limit of large system sizes the AMP algorithm derived in this paper should be able to achieve this Bayes-optimal MMSE.

  • First order phase transition: A more subtle phenomenology can be observed in the low-noise regime of dictionary learning, sparse PCA and blind matrix calibration. In some region of parameters, that we call the ”spinodal” region, the informative (planted) and uninformative initializations do not lead to the same fixed point of the state evolution equations. The spinodal regime itself may divide into two parts - the one where the uninformative fixed point has a larger free entropy, and the ”solvable-but-hard” phase where the informative fixed point has a larger free entropy. The boundary between these two parts is the first order phase transition point. We anticipate that in the solvable-but-hard phase the Bayes-optimal MMSE is not achievable by the AMP algorithm nor by a large class of other known algorithms. The first order phase transition is associated with a discontinuity in the MMSE. The MSE reached from the uninformative initializations will be denoted AMP-MSE and is also discontinuous.

In the case of the first order phase transition it will hence be useful to distinguish in our notations between the minimal achievable MSE that we denote MMSE, and the MSE achievable by the AMP-like algorithms that we denote AMP-MSE. When MMSEAMP-MSE in the large limit we say that AMP is asymptotically optimal. The region where in the large size limit MMSEAMP-MSE is called the spinodal regime/region.

ii.3 Why our results are not rigorous? Why do we conjecture that they are exact?

In this section we aim at summarizing the assumptions that we make but do not prove along the derivation of our results presented in sections III, IV, V. In section VI we then give the MMSE and AMP-MSE as obtained from numerical evaluation of eqs. (10-12).

ii.3.1 The statistical-physics strategy for solving the problem

Let us first state the general strategy from a physics point of view. We use two complementary approaches, the cavity method and the replica method.

Our main tool is the cavity method, it presents two advantages: 1) during the derivation we obtain the AMP algorithm as a side-product and 2) based on experience of the last three decades, it is likely that a rigorous proof of our conjectures will follow very closely this path. In the cavity method we first assume that there is a fixed point of the belief propagation equations that correctly describes the posterior probability distribution, and that the BP equations (initialized randomly, or in the case where a first order phase transition is present initialized close to the planted configuration) converge to it. Then we use the large- limit in order describe the evolution of the iteration on average. While doing so, we perform a statistical analysis of the BP equations in which we keep only the leading terms in the large limit, this is called the cavity method Mézard et al. (1987). The result of this analysis are the state evolution equations that should be asymptotically exact, they do not depend on the size of the system anymore. There is one known way by which the assumptions made in this derivation can fail, this way is called replica-symmetry-breaking (RSB). Fortunately, in the Bayes optimal inference, RSB is excluded, as we explain below.

Our second analysis uses the replica method. In the replica approach one computes the average of the -th power of the normalization of the posterior probability distribution. Then one uses the limit to obtain the average of its logarithm. Doing this, one needs to evaluate a saddle point of a certain potential and one assumes that this saddle point is restricted to a certain class that is called ”replica symmetric”, eq. (222). Again this assumption is justified in the Bayes optimal inference, because we know that there is no RSB in this case.

Our two analyses, using the cavity method and the replica method lead to the same set of closed equations for the MMSE of the problem. This and the fact that apart of the above assumptions we we did not make any additional approximation in the derivations makes us claim that the results described in this paper are exact.

ii.3.2 What are the assumptions?

Let us start with the cavity method, which is in general much more amenable to a rigorous analysis. Let us hence describe the assumptions made in section III.

  • The main assumption we make is that in the leading order in the true marginal probabilities can be obtained from a fixed point of the belief propagation equations (40)-(43) as written for the factor graph in Fig. 1. For this to be possible the incoming messages on the right hand side of eqs. (40)-(43) must be conditionally independent (in the thermodynamic limit, as defined is section I.3 ), in which case we can write the joint probabilities as products over the messages. If the factor graph were a tree this assumption would be obviously correct. On factor graphs that are locally tree-like these assumptions can be justified (often also rigorously) by showing that the correlations decay fast enough (see for instance Mézard & Montanari (2009)). The factor graph in Fig. 1 is far from a tree. However, from the point of view of conditional independence between messages, this factor graph resembles the one of compressed sensing for which the corresponding results were proven rigorously Bayati & Montanari (2011); Donoho et al. (2012). It is hence reasonable to expect that matrix factorization belongs to the same class of problems where BP is asymptotically exact in the above sense.

  • We assume that the iterations of equations (40)-(43) converge for very large systems () to this fixed point that describes the true marginal probabilities, provided that the iteration is started from the right initial condition (in the presence of a first order phase transition we need to consider initialization of the iteration in the planted configuration in order to reach the right fixed point).

Given these assumptions the rest of section III is justified by the fact that in the derivation we only neglect terms that do not contribute to the final MSE in the thermodynamic limit.

The main assumptions made in the replica calculation is that “self-averaging” applies (this is a statement on the concentration of the measure, which basically assumes that the averaged properties are the same as the properties of a generic single instance of the problem) and that we can exchange the limits and . On top of these, it relies on the replica symmetric calculation, which is justified here because we are interested in evaluating Bayes optimal MMSE. Unfortunately in the mathematical literature there is so far very little progress in proving these basic assumptions of the replica method, even in problems that are much easier than matrix factorization. It can be remembered that the original motivation for introducing the cavity method Mézard et al. (1986) was precisely to provide an alternative way to the replica computations, which could be stated in clear mathematical terms.

In this paper we follow the physics strategy and we hence provide a conjecture for evaluating the MMSE in matrix factorization. We do anticipate that further work using the path of the cavity approach will eventually provide a proof of our conjecture. Let us mention that rigorous results exist for closely related problems, namely the problem of compressed sensing Bayati & Montanari (2011); Donoho et al. (2012), and also rank-one matrix factorization Rangan & Fletcher (2012); Deshpande & Montanari (2014). None of these apply directly to our case where the rank is extensive and where the matrix is not known. A non-trivial generalization of these works will be needed in order to prove our results.

ii.4 Absence of replica symmetry breaking in Bayes-optimal inference

The study of the Bayes-optimal inference, i.e. the case when the prior probability distribution matches the true distribution with which the data have been generated, is fundamentally simpler than the general case. The fundamental reason for this simplicity is that, in the Bayes-optimal case, a major complication referred to as

static replica symmetry breaking (RSB) in statistical physics literature does not appear Nishimori (2001). RSB is a source of computational complications in many optimization and constraint satisfaction problems such as the random K-satisfiability Mézard et al. (2002) or models of spin glasses Mézard et al. (1987).

One common way to define RSB is via the overlap distribution Nishimori (2001); Mézard et al. (1987); Mézard & Montanari (2009). We define an overlap between two configurations and as