I Problem statement
MLGLM model: Consider known matrices , , , of dimension . A control parameter which will be important is the ratio of the number of rows to columns in each of these matrices,
. The components of each of these matrices are drawn independently at random, from a probability distribution
having zero mean and variance
. We consider a signal with elements , sampled independently from a distribution . We then collect observations of the signal as(1) 
where the socalled activation functions , , are applied elementwise. These functions can be deterministic or stochastic and are, in general, nonlinear. Assuming depends on a variable and some noise distributed with , we can define the probability distribution of the output of the function as
(2) 
where is the Dirac function. is then interpreted as a noisy channel through which the variable is observed, being the observation. With the above definition of we can rewrite eq. (1) in an equivalent form introducing hidden auxiliary variables for as
(3)  
The inference problem of interest in this paper is the MMSE estimation of the signal from the knowledge of the observation and the matrices , for all . This inference is done through the computation of marginals of the corresponding posterior distribution .
Using the Bayes theorem and the above definition of the hidden variables
, the posterior is written as(4) 
We focus here on the “Bayesoptimal inference” where the generative model and all its parameters are known, i.e. the only unknown variables are and the , . In this case, in order to minimize the expected meansquared error between the ground truth value of and its estimator one needs to compute averages of the marginals of the posterior distribution.
As usual, computing the marginals of the posterior (4) is in general intractable. In this paper we develop an analysis of the posterior and its marginals that is asymptotically exact, in the sense that, with high probability, the estimated marginal probabilities differ from the true ones by a factor that goes to zero in the “thermodynamic” limit for every (at fixed ratios ). Our analysis is based on an AMPtype algorithm and an analysis of the corresponding Bethe/replica free energy.
Context: The MLGLM model considered in this paper has a range of applications and is related to other models considered in the literature. It is similar to the deep exponential families of [2, 15]
with fixed known random weights. It can also be seen as the decoderside of an autoencoder neural network with fixed known weights (corresponding to a randomly generated hierarchy of features), the goal being to infer the vector of latent variables that is the closest to some groundtruth values used to generate the data. One of the main assets of the considered MLGLM is that the activation functions in each of the layers can be nonlinear, as is the case in deep neural networks, and our analysis handles these nonlinearities. When the intermediate layers are linear, the MLGLM can be interpreted as a model for compressed sensing with a structured measurement matrix obtained as a product of random matrices. Similarly, when the external layer has a threshold activation function and all the other activation functions are linear, the MLGLM can be seen as a single layer perceptron that aims to classify correlated patterns obtained by products of random matrices.
Ii Algorithm and analysis
MLAMP: We consider a probability distribution defined as the posterior distribution (4) without the integral over the auxiliary variables , and represented by the graphical model depicted in Fig. 1. We derive MLAMP by first writing the belief propagation equations [17]
for this graphical model. As every factor relates to many variables and every variable is contained in many factors, we use the central limit theorem to keep track of only the means and variances of the messages. Furthermore, we express the iterations in terms of nodevariables instead of messages, giving rise to the so called Onsager reaction terms. This derivation was presented for the case of singlelayer generalized linear estimation in e.g.
[13, 14]. The resulting multilayerAMP (MLAMP) algorithm then iteratively updates estimators of the means and of the variances of the auxiliary variables and of the signal . For simplicity of the multilevel notation we denote , and similarly for the variance.We first write the MLAMP update equations for an intermediate layer and then specify how they change for the very first () and the very last () layers. At each layer , there are two vectors, and , associated with the factor nodes, and two vectors, and , associated with the variable nodes. Their update reads
(5)  
To define functions , and to state how to use eqs. (5) to update the estimators and variances , we need to define an auxiliary function for as
With this definition, the estimators of marginal mean of the auxiliary variables , and the quantity , is computed as
(6)  
where the quantities on the right hand side of (6) are evaluated at time index .
The output function in the first layer is obtained as in the standard AMP algorithm for generalized linear estimation:
(7) 
(8) 
In the last layer, the estimator is obtained from
(9)  
Finally, the expressions and in (5) are defined as: and .
Note that the MLAMP is closely related to AMP for generalized linear estimation [13]. The form of MLAMP is the one we would obtain if we treated each layer separately, while defining an effective prior depending on the variables of the next layer, and an effective output channel depending on the variables of the preceding layer:
State evolution: A very useful property of AMP, compared to other algorithms commonly used for estimating marginals of posterior distributions such as (4), is that its performance can be traced analytically in the thermodynamic limit using the socalled state evolution (SE) [16, 13]. This is a version of the density evolution [4] for dense graphs. It is known as the cavity method in statistical physics [4], where it is in general nonrigorous.
Specifically, the cavity method implies that at each layer , the overlap between the ground true and its estimate provided by the MLAMP algorithm at iteration concentrates around an “overlap”
(10) 
In the case of the Bayesoptimal inference of , the state evolution for the MLAMP algorithm is written also in terms of a parameter , defined as the value around which the components of concentrate.
For the intermediate layers, , the SE reads
(11)  
where the scalar functions and are defined in eq. (6). The quantity
is given by the second moment of the components of the vector
. It can be computed from the knowledge of and for. The expectations are taken over the joint distribution
(12)  
In the above distribution,
are random variables representing respectively: the ground truth hidden variables at layer
, the components of the estimator of , the components of , and the components of . Note that the probability distribution (12) is similar to the one appearing in the singlelayer GAMP algorithm of [13] with the exception that, in , a known measurement is replaced by the distribution of the hidden variable at the previous layer. This makes the multilayer SE quite intuitive for readers well familiar with the SE for GAMP.At the first (leftmost) and last (rightmost) layers, the SE order parameters are given by the same fixed point equations as in the singlelayer GAMP setting, that is
(13)  
where expectations are taken over and respectively. Thus, if , the state evolution equations of MLAMP reduce to that of the standard GAMP algorithm.
The state evolution are iterative equations. We initialize close to zero (or otherwise, corresponding to the initialization of the MLAMP algorithm), then compute for all layers. We then compute for the next time step for all layers, then at the same time step and all the layers, etc. Finally to obtain the meansquared error on from the state evolution, we evaluate .
Iii Free energy and phase transitions
We define the free energy of the posterior (4) as
(14) 
where denotes the thermodynamic limit. This free energy is selfaveraging, meaning that the above limit is with high probability independent of the realization of the random variable , it only depends on the parameters of the model , , , and . A computation analogous to the one that leads from belief propagation to the MLAMP algorithm and its SE can be used to rewrite the Bethe free energy [17] into a single instance free energy evaluated using the fixed points of the MLAMP algorithm. Using averaging analogous to the one of state evolution one then rewrites this into the socalled replica symmetric free energy
(15)  
with , and
One can check, by computing derivatives of this free energy with respect to and , that the stationary points of are fixed points of the state evolution equations (11) and (13). Let us now use (11) and (13) to express in terms of and consider the free energy only as a function of the overlaps (10). Define
(16)  
In the setting of Bayesoptimal inference, the replica symmetric free energy is equal to the free energy (14). At the same time the minimum mean squared error of the Bayesoptimal estimation of is given by
(17) 
where is defined in (16) and is the second moment of . This has been recently proven for the single layer linear estimation [8, 7], and based on statistical physics arguments we conjecture it to be true also in the present problem.
We divide the region of parameters of the present problem into three phases. If the MMSE is not low enough (defined in a way depending on the application) we say that inference of the signal is informationtheoretically impossible. If the MMSE is sufficiently low and the MLAMP algorithm analyzed via state evolution matches it, then we say inference is easy. Finally, and most interestingly, if MMSE is sufficiently low, but MLAMP achieves worse MSE we talk about a region of algorithmically hard inference. Defining an algorithmically hard region by the performance of one given algorithm might seem in general unjustified. However, in the case of single layer generalized linear estimation we know of no other polynomial algorithm that would outperform AMP in the thermodynamic limit for random . This leads us to the above definition of the hard phase also for the present multilayer problem.
Phase diagram for sparse linear regression (left) and perceptron (center) with correlated data/patterns, defined by (
18) and (19) respectively, and for the twolayer decoder (right), problem (20). Model parameters in SLR are , and , and in the twolayer decoder . The MLAMP algorithm succeeds to reconstruct the signal with very small MSE above the dotted red line. Between the two lines reconstruction is informationtheoretically possible, but MLAMP does not achieve it. Below the full red line good reconstruction is impossible. The grey lines plotted for the perceptron is a comparison with the case of random patterns. Insets: Comparisons between the MSE predicted by the state evolution (lines) and that provided by MLAMP on a single instance with (symbols), for the left and center figures, and for the right. Dashed lines indicate the MMSE. Red triangles in the right inset compare to the performance of normal AMP in solving the decoder problem in the first layer assuming a binary i.i.d. prior on , this works for . MLAMP takes into account correlations in and performs better.Iv Results for selected problems
We now focus on three examples of twolayer models and draw their phase diagrams that indicate for which layer sizes (i.e. for which values of ) reconstruction of the signal is informationtheoretically possible. We also run the MLAMP on single instances sampled from the model and illustrate that the meansquared error it reaches after convergence matches the one predicted by the state evolution.
Sparse linear regression using correlated data: Among the simplest nontrivial cases of the MLGLM model is a twolayer analogue of sparse linear regression (SLR) defined as
(18) 
where the vector we seek to infer is sparse, , and .
When the model (18) is equivalent to a SLR with a structured data matrix , a problem previously studied by [18, 19, 20, 21, 22]. Interestingly, the state evolution equations (11) have at their fixed point , where gives the Rtransform of (see e.g. [23]). Together with eq. (13) these are the same equations obtained by adaptive approaches in [18, 19, 20, 21, 22]. This confirms that the adaptative methods are exact in the case of matrix product, as was already noted for the Hopfield model [24].
In Fig. 2 (left) we draw the phase diagram as a function of and . Remarkably, in the noiseless case, the phase diagram of the problem with structured matrix is reduced to the statement that both and need to be larger than the corresponding threshold known for the usual compressed sensing problem [14].
For this simple mapping does not hold and, as far as we know, the MLAMP and its SE analysis give new results. We compared numerically the performance of MLAMP to the VAMP of [22]. Whereas for the two algorithm agree (within errorbars), for the MLAMP gives a distinguishably lower MSE.
Perceptron learning with correlated patterns: A lot of work has been dedicated to learning and generalization in a perceptron with binary weights [25, 26], defined by:
(19) 
with . These works focused on random patterns where the elements of are iid.
It was recently argued that learning and generalization of combinatorially structured patterns, defined as , is best studied using multilayer networks and presents major differences [24]. Our analysis of this case in Fig. 2 (center) quantifies how many extra samples are needed so that a binary perceptron is able to correctly classify combinatorially correlated patterns with respect to random ones.
Twolayer decoder: The most exciting potential applications for the present results perhaps lie in the realm of deep neural networks where models such as MLGLM with learned weights are used. A crucial ingredient of such neural networks are nonlinear activation functions present among the layers. These activation functions can be seen as noisy channels, e.g. in the context of the decoderside of an autoencoder with known weights : one is interested in how well a vector of latent variables can be reconstructed when is observed at the output. In Fig. 2 (right) we draw a phase diagram for the following example
(20) 
with . Our results illustrate that the MLAMP algorithm provides better results than a layerwise estimation with ordinary AMP, because it takes into account correctly the correlations among the hidden variables.
Acknowledgment
This work has been supported by the ERC under the European Union’s FP7 Grant Agreement 307087SPARCS.
References
 [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.

[2]
R. Ranganath, D. Tran, and D. M. Blei, “Hierarchical Variational Models,”
in
International Conference on Machine Learning (ICML)
, 2016.  [3] P. J. Mucha, T. Richardson, K. Macon, M. A. Porter, and J.P. Onnela, “Community structure in timedependent, multiscale, and multiplex networks,” Science, vol. 328, no. 5980, pp. 876–878, 2010.
 [4] M. Mézard and A. Montanari, Information, physics, and computation. Oxford University Press, 2009.
 [5] T. Tanaka, “A statisticalmechanics approach to largesystem analysis of cdma multiuser detectors,” IEEE Trans. Info. Theory, vol. 48, no. 11, pp. 2888–2910, 2002.
 [6] Y. Wu and S. Verdú, “Optimal phase transitions in compressed sensing,” IEEE Trans. Info. Theory, vol. 58, no. 10, pp. 6241–6263, 2012.
 [7] J. Barbier, M. Dia, N. Macris, and F. Krzakala, “The mutual information in random linear estimation,” arXiv preprint arXiv:1607.02335, 2016.
 [8] G. Reeves and H. D. Pfister, “The replicasymmetric prediction for compressed sensing with gaussian matrices is exact,” in 2016 IEEE International Symposium on Information Theory (ISIT), 2016, p. 665.
 [9] D. J. Thouless, P. W. Anderson, and R. G. Palmer, “Solution of’solvable model of a spin glass’,” Philosophical Magazine, vol. 35, p. 593, 1977.
 [10] M. Mézard, “The space of interactions in neural networks: Gardner’s computation with the cavity method,” J. Phys. A: Math. & Th., vol. 22, no. 12, p. 2181, 1989.
 [11] Y. Kabashima, “Inference from correlated patterns: a unified theory for perceptron learning and linear vector channels,” J. Phys.: Conference Series, vol. 95, p. 012001, 2008.
 [12] D. L. Donoho, A. Maleki, and A. Montanari, “Messagepassing algorithms for compressed sensing,” Proc. Natl. Acad. Sci, vol. 106, p. 18914, 2009.
 [13] S. Rangan, “Generalized approximate message passing for estimation with random linear mixing,” in 2011 IEEE International Symposium on Information Theory (ISIT), Jul. 2011, pp. 2168–2172.
 [14] F. Krzakala, M. Mézard, F. Sausset, Y. Sun, and L. Zdeborová, “Probabilistic reconstruction in compressed sensing: algorithms, phase diagrams, and threshold achieving matrices,” J. Stat Mech.: Theory and Experiment, vol. 2012, no. 08, p. P08009, 2012.
 [15] R. Ranganath, L. Tang, L. Charlin, and D. Blei, “Deep Exponential Families,” in AISTATS, 2015, pp. 762–771.
 [16] M. Bayati and A. Montanari, “The dynamics of message passing on dense graphs, with applications to compressed sensing,” IEEE Transactions on Information Theory, vol. 57, no. 2, pp. 764–785, 2011.

[17]
J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Understanding belief propagation
and its generalizations,”
Exploring artificial intelligence in the new millennium
, vol. 8, pp. 236–239, 2003. 
[18]
K. Takeda, S. Uda, and Y. Kabashima, “Analysis of cdma systems that are characterized by eigenvalue spectrum,”
EPL, vol. 76, p. 1193, 2006.  [19] A. M. Tulino, G. Caire, S. Verdú, and S. Shamai, “Support Recovery With Sparsely Sampled Free Random Matrices,” IEEE Trans. Info. Theory, vol. 59, no. 7, pp. 4243–4271, Jul. 2013.
 [20] Y. Kabashima and M. Vehkaperä, “Signal recovery using expectation consistent approximation for linear observations,” in 2014 IEEE International Symposium on Information Theory (ISIT), Jun. 2014, p. 226.
 [21] B. Çakmak, O. Winther, and B. H. Fleury, “SAMP: Approximate message passing for general matrix ensembles,” in 2014 IEEE Information Theory Workshop (ITW), Nov. 2014, pp. 192–196.
 [22] S. Rangan, P. Schniter, and A. Fletcher, “Vector Approximate Message Passing,” Oct. 2016, arXiv: 1610.03082.
 [23] A. M. Tulino and S. Verdú, Random Matrix Theory and Wireless Communications. Now Publishers Inc, 2004.
 [24] M. Mézard, “Meanfield messagepassing equations in the Hopfield model and its generalizations,” arXiv:1608.01558, 2016.
 [25] E. Gardner and B. Derrida, “Three unfinished works on the optimal storage capacity of networks,” Journal of Physics A: Mathematical and General, vol. 22, no. 12, p. 1983, 1989.
 [26] A. Engel and C. Van den Broeck, Statistical mechanics of learning. Cambridge University Press, 2001.
Comments
There are no comments yet.