I Problem statement
ML-GLM model: Consider known matrices , , , of dimension . A control parameter which will be important is the ratio of the number of rows to columns in each of these matrices,
. The components of each of these matrices are drawn independently at random, from a probability distribution
having zero mean and variance. We consider a signal with elements , sampled independently from a distribution . We then collect observations of the signal as
where the so-called activation functions , , are applied element-wise. These functions can be deterministic or stochastic and are, in general, non-linear. Assuming depends on a variable and some noise distributed with , we can define the probability distribution of the output of the function as
where is the Dirac function. is then interpreted as a noisy channel through which the variable is observed, being the observation. With the above definition of we can rewrite eq. (1) in an equivalent form introducing hidden auxiliary variables for as
The inference problem of interest in this paper is the MMSE estimation of the signal from the knowledge of the observation and the matrices , for all . This inference is done through the computation of marginals of the corresponding posterior distribution .
Using the Bayes theorem and the above definition of the hidden variables, the posterior is written as
We focus here on the “Bayes-optimal inference” where the generative model and all its parameters are known, i.e. the only unknown variables are and the , . In this case, in order to minimize the expected mean-squared error between the ground truth value of and its estimator one needs to compute averages of the marginals of the posterior distribution.
As usual, computing the marginals of the posterior (4) is in general intractable. In this paper we develop an analysis of the posterior and its marginals that is asymptotically exact, in the sense that, with high probability, the estimated marginal probabilities differ from the true ones by a factor that goes to zero in the “thermodynamic” limit for every (at fixed ratios ). Our analysis is based on an AMP-type algorithm and an analysis of the corresponding Bethe/replica free energy.
with fixed known random weights. It can also be seen as the decoder-side of an autoencoder neural network with fixed known weights (corresponding to a randomly generated hierarchy of features), the goal being to infer the vector of latent variables that is the closest to some ground-truth values used to generate the data. One of the main assets of the considered ML-GLM is that the activation functions in each of the layers can be non-linear, as is the case in deep neural networks, and our analysis handles these non-linearities. When the intermediate layers are linear, the ML-GLM can be interpreted as a model for compressed sensing with a structured measurement matrix obtained as a product of random matrices. Similarly, when the external layer has a threshold activation function and all the other activation functions are linear, the ML-GLM can be seen as a single layer perceptron that aims to classify correlated patterns obtained by products of random matrices.
Ii Algorithm and analysis
ML-AMP: We consider a probability distribution defined as the posterior distribution (4) without the integral over the auxiliary variables , and represented by the graphical model depicted in Fig. 1. We derive ML-AMP by first writing the belief propagation equations 
for this graphical model. As every factor relates to many variables and every variable is contained in many factors, we use the central limit theorem to keep track of only the means and variances of the messages. Furthermore, we express the iterations in terms of node-variables instead of messages, giving rise to the so called Onsager reaction terms. This derivation was presented for the case of single-layer generalized linear estimation in e.g.[13, 14]. The resulting multi-layer-AMP (ML-AMP) algorithm then iteratively updates estimators of the means and of the variances of the auxiliary variables and of the signal . For simplicity of the multi-level notation we denote , and similarly for the variance.
We first write the ML-AMP update equations for an intermediate layer and then specify how they change for the very first () and the very last () layers. At each layer , there are two vectors, and , associated with the factor nodes, and two vectors, and , associated with the variable nodes. Their update reads
To define functions , and to state how to use eqs. (5) to update the estimators and variances , we need to define an auxiliary function for as
With this definition, the estimators of marginal mean of the auxiliary variables , and the quantity , is computed as
where the quantities on the right hand side of (6) are evaluated at time index .
The output function in the first layer is obtained as in the standard AMP algorithm for generalized linear estimation:
In the last layer, the estimator is obtained from
Finally, the expressions and in (5) are defined as: and .
Note that the ML-AMP is closely related to AMP for generalized linear estimation . The form of ML-AMP is the one we would obtain if we treated each layer separately, while defining an effective prior depending on the variables of the next layer, and an effective output channel depending on the variables of the preceding layer:
State evolution: A very useful property of AMP, compared to other algorithms commonly used for estimating marginals of posterior distributions such as (4), is that its performance can be traced analytically in the thermodynamic limit using the so-called state evolution (SE) [16, 13]. This is a version of the density evolution  for dense graphs. It is known as the cavity method in statistical physics , where it is in general non-rigorous.
Specifically, the cavity method implies that at each layer , the overlap between the ground true and its estimate provided by the ML-AMP algorithm at iteration concentrates around an “overlap”
In the case of the Bayes-optimal inference of , the state evolution for the ML-AMP algorithm is written also in terms of a parameter , defined as the value around which the components of concentrate.
For the intermediate layers, , the SE reads
where the scalar functions and are defined in eq. (6). The quantity
is given by the second moment of the components of the vector. It can be computed from the knowledge of and for
. The expectations are taken over the joint distribution
In the above distribution,
are random variables representing respectively: the ground truth hidden variables at layer, the components of the estimator of , the components of , and the components of . Note that the probability distribution (12) is similar to the one appearing in the single-layer G-AMP algorithm of  with the exception that, in , a known measurement is replaced by the distribution of the hidden variable at the previous layer. This makes the multi-layer SE quite intuitive for readers well familiar with the SE for G-AMP.
At the first (leftmost) and last (rightmost) layers, the SE order parameters are given by the same fixed point equations as in the single-layer G-AMP setting, that is
where expectations are taken over and respectively. Thus, if , the state evolution equations of ML-AMP reduce to that of the standard G-AMP algorithm.
The state evolution are iterative equations. We initialize close to zero (or otherwise, corresponding to the initialization of the ML-AMP algorithm), then compute for all layers. We then compute for the next time step for all layers, then at the same time step and all the layers, etc. Finally to obtain the mean-squared error on from the state evolution, we evaluate .
Iii Free energy and phase transitions
We define the free energy of the posterior (4) as
where denotes the thermodynamic limit. This free energy is self-averaging, meaning that the above limit is with high probability independent of the realization of the random variable , it only depends on the parameters of the model , , , and . A computation analogous to the one that leads from belief propagation to the ML-AMP algorithm and its SE can be used to rewrite the Bethe free energy  into a single instance free energy evaluated using the fixed points of the ML-AMP algorithm. Using averaging analogous to the one of state evolution one then rewrites this into the so-called replica symmetric free energy
with , and
One can check, by computing derivatives of this free energy with respect to and , that the stationary points of are fixed points of the state evolution equations (11) and (13). Let us now use (11) and (13) to express in terms of and consider the free energy only as a function of the overlaps (10). Define
In the setting of Bayes-optimal inference, the replica symmetric free energy is equal to the free energy (14). At the same time the minimum mean squared error of the Bayes-optimal estimation of is given by
where is defined in (16) and is the second moment of . This has been recently proven for the single layer linear estimation [8, 7], and based on statistical physics arguments we conjecture it to be true also in the present problem.
We divide the region of parameters of the present problem into three phases. If the MMSE is not low enough (defined in a way depending on the application) we say that inference of the signal is information-theoretically impossible. If the MMSE is sufficiently low and the ML-AMP algorithm analyzed via state evolution matches it, then we say inference is easy. Finally, and most interestingly, if MMSE is sufficiently low, but ML-AMP achieves worse MSE we talk about a region of algorithmically hard inference. Defining an algorithmically hard region by the performance of one given algorithm might seem in general unjustified. However, in the case of single layer generalized linear estimation we know of no other polynomial algorithm that would outperform AMP in the thermodynamic limit for random . This leads us to the above definition of the hard phase also for the present multi-layer problem.
Phase diagram for sparse linear regression (left) and perceptron (center) with correlated data/patterns, defined by (18) and (19) respectively, and for the two-layer decoder (right), problem (20). Model parameters in SLR are , and , and in the two-layer decoder . The ML-AMP algorithm succeeds to reconstruct the signal with very small MSE above the dotted red line. Between the two lines reconstruction is information-theoretically possible, but ML-AMP does not achieve it. Below the full red line good reconstruction is impossible. The grey lines plotted for the perceptron is a comparison with the case of random patterns. Insets: Comparisons between the MSE predicted by the state evolution (lines) and that provided by ML-AMP on a single instance with (symbols), for the left and center figures, and for the right. Dashed lines indicate the MMSE. Red triangles in the right inset compare to the performance of normal AMP in solving the decoder problem in the first layer assuming a binary i.i.d. prior on , this works for . ML-AMP takes into account correlations in and performs better.
Iv Results for selected problems
We now focus on three examples of two-layer models and draw their phase diagrams that indicate for which layer sizes (i.e. for which values of ) reconstruction of the signal is information-theoretically possible. We also run the ML-AMP on single instances sampled from the model and illustrate that the mean-squared error it reaches after convergence matches the one predicted by the state evolution.
Sparse linear regression using correlated data: Among the simplest non-trivial cases of the ML-GLM model is a two-layer analogue of sparse linear regression (SLR) defined as
where the vector we seek to infer is sparse, , and .
When the model (18) is equivalent to a SLR with a structured data matrix , a problem previously studied by [18, 19, 20, 21, 22]. Interestingly, the state evolution equations (11) have at their fixed point , where gives the R-transform of (see e.g. ). Together with eq. (13) these are the same equations obtained by adaptive approaches in [18, 19, 20, 21, 22]. This confirms that the adaptative methods are exact in the case of matrix product, as was already noted for the Hopfield model .
In Fig. 2 (left) we draw the phase diagram as a function of and . Remarkably, in the noiseless case, the phase diagram of the problem with structured matrix is reduced to the statement that both and need to be larger than the corresponding threshold known for the usual compressed sensing problem .
For this simple mapping does not hold and, as far as we know, the ML-AMP and its SE analysis give new results. We compared numerically the performance of ML-AMP to the VAMP of . Whereas for the two algorithm agree (within errorbars), for the ML-AMP gives a distinguishably lower MSE.
with . These works focused on random patterns where the elements of are iid.
It was recently argued that learning and generalization of combinatorially structured patterns, defined as , is best studied using multilayer networks and presents major differences . Our analysis of this case in Fig. 2 (center) quantifies how many extra samples are needed so that a binary perceptron is able to correctly classify combinatorially correlated patterns with respect to random ones.
Two-layer decoder: The most exciting potential applications for the present results perhaps lie in the realm of deep neural networks where models such as ML-GLM with learned weights are used. A crucial ingredient of such neural networks are non-linear activation functions present among the layers. These activation functions can be seen as noisy channels, e.g. in the context of the decoder-side of an auto-encoder with known weights : one is interested in how well a vector of latent variables can be reconstructed when is observed at the output. In Fig. 2 (right) we draw a phase diagram for the following example
with . Our results illustrate that the ML-AMP algorithm provides better results than a layer-wise estimation with ordinary AMP, because it takes into account correctly the correlations among the hidden variables.
This work has been supported by the ERC under the European Union’s FP7 Grant Agreement 307087-SPARCS.
-  Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
R. Ranganath, D. Tran, and D. M. Blei, “Hierarchical Variational Models,”
International Conference on Machine Learning (ICML), 2016.
-  P. J. Mucha, T. Richardson, K. Macon, M. A. Porter, and J.-P. Onnela, “Community structure in time-dependent, multiscale, and multiplex networks,” Science, vol. 328, no. 5980, pp. 876–878, 2010.
-  M. Mézard and A. Montanari, Information, physics, and computation. Oxford University Press, 2009.
-  T. Tanaka, “A statistical-mechanics approach to large-system analysis of cdma multiuser detectors,” IEEE Trans. Info. Theory, vol. 48, no. 11, pp. 2888–2910, 2002.
-  Y. Wu and S. Verdú, “Optimal phase transitions in compressed sensing,” IEEE Trans. Info. Theory, vol. 58, no. 10, pp. 6241–6263, 2012.
-  J. Barbier, M. Dia, N. Macris, and F. Krzakala, “The mutual information in random linear estimation,” arXiv preprint arXiv:1607.02335, 2016.
-  G. Reeves and H. D. Pfister, “The replica-symmetric prediction for compressed sensing with gaussian matrices is exact,” in 2016 IEEE International Symposium on Information Theory (ISIT), 2016, p. 665.
-  D. J. Thouless, P. W. Anderson, and R. G. Palmer, “Solution of’solvable model of a spin glass’,” Philosophical Magazine, vol. 35, p. 593, 1977.
-  M. Mézard, “The space of interactions in neural networks: Gardner’s computation with the cavity method,” J. Phys. A: Math. & Th., vol. 22, no. 12, p. 2181, 1989.
-  Y. Kabashima, “Inference from correlated patterns: a unified theory for perceptron learning and linear vector channels,” J. Phys.: Conference Series, vol. 95, p. 012001, 2008.
-  D. L. Donoho, A. Maleki, and A. Montanari, “Message-passing algorithms for compressed sensing,” Proc. Natl. Acad. Sci, vol. 106, p. 18914, 2009.
-  S. Rangan, “Generalized approximate message passing for estimation with random linear mixing,” in 2011 IEEE International Symposium on Information Theory (ISIT), Jul. 2011, pp. 2168–2172.
-  F. Krzakala, M. Mézard, F. Sausset, Y. Sun, and L. Zdeborová, “Probabilistic reconstruction in compressed sensing: algorithms, phase diagrams, and threshold achieving matrices,” J. Stat Mech.: Theory and Experiment, vol. 2012, no. 08, p. P08009, 2012.
-  R. Ranganath, L. Tang, L. Charlin, and D. Blei, “Deep Exponential Families,” in AISTATS, 2015, pp. 762–771.
-  M. Bayati and A. Montanari, “The dynamics of message passing on dense graphs, with applications to compressed sensing,” IEEE Transactions on Information Theory, vol. 57, no. 2, pp. 764–785, 2011.
J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Understanding belief propagation
and its generalizations,”
Exploring artificial intelligence in the new millennium, vol. 8, pp. 236–239, 2003.
K. Takeda, S. Uda, and Y. Kabashima, “Analysis of cdma systems that are characterized by eigenvalue spectrum,”EPL, vol. 76, p. 1193, 2006.
-  A. M. Tulino, G. Caire, S. Verdú, and S. Shamai, “Support Recovery With Sparsely Sampled Free Random Matrices,” IEEE Trans. Info. Theory, vol. 59, no. 7, pp. 4243–4271, Jul. 2013.
-  Y. Kabashima and M. Vehkaperä, “Signal recovery using expectation consistent approximation for linear observations,” in 2014 IEEE International Symposium on Information Theory (ISIT), Jun. 2014, p. 226.
-  B. Çakmak, O. Winther, and B. H. Fleury, “S-AMP: Approximate message passing for general matrix ensembles,” in 2014 IEEE Information Theory Workshop (ITW), Nov. 2014, pp. 192–196.
-  S. Rangan, P. Schniter, and A. Fletcher, “Vector Approximate Message Passing,” Oct. 2016, arXiv: 1610.03082.
-  A. M. Tulino and S. Verdú, Random Matrix Theory and Wireless Communications. Now Publishers Inc, 2004.
-  M. Mézard, “Mean-field message-passing equations in the Hopfield model and its generalizations,” arXiv:1608.01558, 2016.
-  E. Gardner and B. Derrida, “Three unfinished works on the optimal storage capacity of networks,” Journal of Physics A: Mathematical and General, vol. 22, no. 12, p. 1983, 1989.
-  A. Engel and C. Van den Broeck, Statistical mechanics of learning. Cambridge University Press, 2001.