Mutual Information for Low-Rank Even-Order Symmetric Tensor Factorization

04/09/2019
by   Jean Barbier, et al.
EPFL
0

We consider a statistical model for finite-rank symmetric tensor factorization and prove a single-letter variational expression for its mutual information when the tensor is of even order. The proof uses the adaptive interpolation method, for which rank-one matrix factorization is one of the first problem it was successfully applied to. We show how to extend the adaptive interpolation to finite-rank symmetric tensors of even order, which requires new ideas with respect to the proof for the rank-one case. We also underline where the proof falls short when dealing with odd-order tensors.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/01/2018

On the tensor rank of 3× 3 permanent and determinant

The tensor rank and border rank of the 3 × 3 determinant tensor is known...
02/19/2019

Mutual Information for the Stochastic Block Model by the Adaptive Interpolation Method

We rigorously derive a single-letter variational expression for the mutu...
05/16/2020

Information-theoretic limits of a multiview low-rank symmetric spiked matrix model

We consider a generalization of an important class of high-dimensional i...
04/15/2020

High-dimensional rank-one nonsymmetric matrix decomposition: the spherical case

We consider the problem of estimating a rank-one nonsymmetric matrix und...
06/13/2018

Adaptive Path Interpolation for Sparse Systems: Application to a Simple Censored Block Model

A new adaptive path interpolation method has been recently developed as ...
08/19/2019

Why So Down? The Role of Negative (and Positive) Pointwise Mutual Information in Distributional Semantics

In distributional semantics, the pointwise mutual information (PMI) weig...
04/12/2021

Statistical inference of finite-rank tensors

We consider a general statistical inference model of finite-rank tensor ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Tensor factorization is a generalization of principal component analysis to tensors, in which one wishes to exhibit the closest rank-

approximation to a tensor. It has numerous applications in signal processing and machine learning, e.g., for compressing data while keeping as much information as possible, in data visualization, etc.

[1].

An approach to explore computational and/or statistical limits of tensor factorization is to consider a statistical model, as done in [2]. The model is the following: draw

column vectors, evaluate for each of them their

th tensor power and sum those symmetric order- tensors. For , and if no degeneracy occurs, this sum is exactly the eigendecomposition of a rank-

positive semidefinite matrix. Tensor factorization can then be studied as an inference problem, namely, to estimate the initial

vectors from noisy observations of the tensor and to determine information theoretic limits for this task. To do so, we focus on proving formulas for the asymptotic mutual information between the noisy observed tensor and the original vectors. Such formulas were first rigorously derived for and , i.e., rank-one matrix factorization: see [3] for the case with a binary input vector, [4]

for the restricted case in which no discontinuous phase transition occurs,

[5] for a single-sided bound and, finally, [6] for the fully general case. The proof in [6] combines interpolation techniques with spatial coupling and an analysis of the Approximate Message-Passing (AMP) algorithm. Later, and still for , [7] went beyond rank-one by using a rigorous version of the cavity method. Reference [8]

applied the heuristic replica method to conjecture a formula for any

and finite , which is then proved for and . Reference [8] also details the AMP algorithm for tensor factorization and shows how the single-letter variational expression for the mutual information allows one to give guarantees on AMP’s performance. Afterwards, [9, 10] introduced the adaptive interpolation proof technique which they applied to the case , . Other proofs based on interpolations recently appeared, see [11] (, ) and [12] (, ).

In this work, we prove the conjectured replica formula for any finite-rank and any even order using the adaptive interpolation method. We also underline what is missing to extend the proof to odd orders.

The adaptive interpolation method was introduced in [9, 10] as a powerful improvement to the Guerra-Toninelli interpolation scheme [13]. Since then, it has been applied to many other inference problems in order to prove formulas for the mutual information, e.g., [14, 15]. While our proof outline is similar to [10]

, there are two important new ingredients. First, to establish the tight upper bound, we have to prove the regularity of a change of variable given by the solutions to an ordinary differential equation. This is non-trivial when the rank becomes greater than one. Second, the same bound requires one to prove the concentration of the overlap (a quantity that fully characterizes the system in the high-dimensional limit). When the rank is greater than one, this overlap is a matrix and a recent result

[16] on the concentration of overlap matrices can be adapted to obtain the required concentration in our interpolation scheme.

Ii Low-rank symmetric tensor factorization

We study the following statistical model. Let be a positive integer. are random column vectors in , independent and identically distributed (i.i.d.) with distribution . These vectors are not directly observed. Instead, for each -tuple with , one is given access to the noisy observation

(1)

where is a known signal-to-noise ratio (SNR) and the noise

is i.i.d. with respect to the standard normal distribution

. Let be the matrix whose th row is given by . All the observations (1) are combined into the symmetric order- tensor

Our main result is the proof of a formula for the mutual information in the limit while the rank is kept fixed. This formula is given as the optimization of a potential over the cone of symmetric positive semi-definite matrices . Let and . Define the convex (see [17, Appendix A]) function

and the potential

(2)

where is the th Hadamard power of . Note that, by the Schur Product Theorem [18], the Hadamard product of two matrices in is also in

. Introducing the second moment matrix

, the conjectured replica formula [8] reads

(3)

Remark: We can reduce the proof of (3) to the case by rescaling properly . From now on we set .

Before proving (3), we introduce important information theoretic quantities, adopting the statistical mechanics terminology. Define the Hamiltonian for all :

where . Using Bayes’ rule, the posterior density written in Gibbs-Boltzmann form is

with the normalization factor. Finally, we define the free entropy

(4)

which is linked to the mutual information through the identity

(5)

In (5), is a quantity such that is bounded uniformly in . Thanks to (5), the replica formula (3) will follow directly from the next two bounds on the asymptotic free entropy.

Theorem 1

(Lower bound) Assume is even and is such that its first moments are finite. Then

(6)
Theorem 2

(Upper bound) Assume is even and is such that its first moments are finite. Then

(7)

Iii Adaptive path interpolation

We introduce a “time” parameter . The adaptive interpolation interpolates from the original channel (1) at to a decoupled channel at . In between, we follow an interpolation path , which is a continuously differentiable function parametrized by a “small perturbation” and such that . More precisely, for , we observe

(8)

The noise is independent of both and . The associated interpolating Hamiltonian reads

(9)

where

Let

so that the posterior distribution of given is . The interpolating free entropy is similar to (4), i.e.,

(10)

Evaluating (10) at both extremes of the interpolation gives:

(11)

denotes the Frobenius norm and is a quantity such that . It is useful, in order to deal with future computations, to introduce the Gibbs bracket which denotes an expectation with respect to the posterior distribution, i.e.,

(12)

Combining (11) with the fundamental theorem of calculus

(13)

being the -derivative of , we obtain the sum-rule of the adaptive interpolation.

Proposition 1 (Sum-rule)

Let be the overlap matrix whose entries are

Assume has finite th-order moments. Then

(14)

where and are independent of and , respectively.

Proof:

See [17, Appendix B] for the computation of the -derivative .

Theorems 1 and 2 are proved in the next section by plugging two different choices for in the sum-rule (14).

Iv Matching bounds

Iv-a Lower bound: proof of Theorem 1

A lower bound on is obtained by choosing the interpolation function with a symmetric positive semidefinite matrix, i.e., and . Then the sum-rule (14) reads

(15)

where . If is even then is non-negative on and (15) directly implies

Taking the liminf on both sides of this inequality, and bearing in mind that the inequality is valid for all , ends the proof of Theorem 1. ∎

We have at our disposal a wealth of interpolation paths when considering any continuously differentiable . However, to establish the lower bound (6), we only need a simple linear interpolation, i.e., . Such an interpolation dates back to Guerra [13], and was already used by [8, 7] to derive the lower bound (6) for both cases , any order , and , any finite-rank . Now, we turn to the proof of the upper bound (7), and we will see how the flexibility in the choice of constitutes an improvement on the classical interpolation.

Iv-B Upper bound: proof of Theorem 2

Iv-B1 Interpolation determined by an ordinary differential equation (ODE)

The sum-rule (14) suggests to pick an interpolation path satisfying

(16)

The integral in (14) can then be split in two terms: one similar to the second summand in (2), and one that will vanish in the high-dimensional limit if the overlap concentrates. The next proposition states that (16) indeed admits a solution, which at first sight is not clear as the Gibbs bracket depends itself on . Non-trivial properties required to show the upper bound (7) are also proved.

Proposition 2

For all , there exists a unique global solution to the first-order ODE

This solution is continuously differentiable and bounded. If is even then , is a -diffeomorphism from (the open cone of symmetric positive definite matrices) into whose Jacobian determinant is greater than one, i.e.,

(17)

Here denotes the Jacobian matrix of .

Proof:

We now rewrite (16) explicitly as an ODE. Let be a matrix in . Consider the problem of inferring from the following observations:

(18)

It is reminiscent of the interpolating problem (8). One can form a Hamiltonian similar to (9), where is simply replaced by , and denotes the Gibbs bracket associated to the posterior of this model. One now defines the function

Note that is a symmetric positive semi-definite matrix. Indeed, from the Nishimori identity111The Nishimori identity is a direct consequence of the Bayes formula. In our setting, it states where are two samples drawn independently from the posterior distribution given , . Here can also explicitly depend on , .:

By the Schur Product Theorem [18], the Hadamard power also belongs to , justifying that takes values in the cone of symmetric positive semi-definite matrices. is continusouly differentiable on . Therefore, by the Cauchy-Lipschitz theorem, there exists a unique global solution to the -dimensional ODE:

Each initial condition is tied to a unique solution . This implies that the function is injective. Its Jacobian determinant is given by Liouville’s formula [19]:

(19)

Thanks to (19), we can show that the Jacobian determinant is greater than (or equal to) one by proving that the divergence

is nonnegative for all . A lengthy computation (see [17, Appendix C]) leads to the identity

(20)

where

(21)

If is even then is nonnegative. We show next that the ’s are nonnegative, thus ending the proof of (17). The second expectation on the right-hand side (r.h.s.) of (2) satisfies (we omit the subscripts of the Gibbs bracket):

The inequality is a simple application of Jensen’s inequality, while the equality that follows is an application of the Nishimori identity. The final upper bound is nothing but the first expectation on the r.h.s. of (2). Therefore .

Iv-B2 Proof of Theorem 2

Let be a symmetric positive definite matrix, i.e., . We interpolate with the unique solution to (16). Under this choice, the sum-rule (14) reads:

(22)

Using the convexity of , we obtain by Jensen’s inequality:

(23)

Combining both (22) and (IV-B2) directly gives

(24)

In order to end the proof of (7), we must show that the second line of the upper bound (24) vanishes when goes to infinity. This will be the case if the overlap matrix concentrates on its expectation . Indeed, provided that the th-order moments of are finite, there exists a constant depending only on such that

(25)

However, proving that the r.h.s. of (25) vanishes is only possible after integrating on a well-chosen set of “perturbations” (that play the role of initial conditions in the ODE in Proposition 2). In essence, the integration over smoothens the phase transitions that might appear for particular choices of when goes to infinity.

We now describe the set of perturbations on which to integrate. Let a sequence such that goes to and diverges to infinity when . Define the following sequence of subsets:

Those are subsets of symmetric strictly diagonally dominant matrices with positive diagonal entries, hence they are included in (see [20, Corollary 7.2.3]). The volume of is

Fix . First using the Cauchy-Schwarz inequality, and then making the change of variable , which is justified because is a -diffeomorphism (see Proposition 2), one obtains

(26)

where . The last inequality follows from (17). It is not difficult to show that all the ’s are included in the convex set . The convex hulls of , denoted , are therefore uniformly bounded subsets of . This uniform boundedness ensures that the free entropy associated to (18

) has a variance that vanishes as

(see [17, Appendix D]) uniformly in

Such concentration of the free entropy is essential to guarantee the concentration of overlap matrices in a Bayesian inference framework. Then, we can adapt the proof of

[16, Theorem 3] to show

(27)

Here is a constant that depends only on , and . Note that the integral over the convex hull is an upper bound on the integral over . Combining (25), (26) and (27), one finally obtains:

(28)

To conclude the proof, we integrate the inequality (24) over and, then, make use of (28) and

This gives the inequality

which directly implies the upper bound (7). ∎

V Future work

We leave for future work the extension of both Theorems 1 and 2 to the odd-order case. For Theorem 1, it requires proving that the last summand on the r.h.s. of (15) is nonnegative. When , both and are nonnegative so that ’s non-negativity for suffices[8]. However, for , we can only say that . Regarding Theorem 2, the whole proof directly applies to odd if we can show that the divergence (20) is nonnegative, which is more difficult than for even. Indeed, while the ’s are still , it is not necessarily the case of as is odd.

Acknowledgment

C. L. acknowledges funding from the Swiss National Science Foundation, under grant no 200021E-175541.

References

  • [1] I. T. Jolliffe, Principal Component Analysis, 2nd ed., ser. Springer Series in Statistics.   New York, NY, USA: Springer-Verlag, 2002.
  • [2] E. Richard and A. Montanari, “A statistical model for tensor pca,” in Advances in Neural Information Processing Systems 27, 2014, pp. 2897–2905.
  • [3] S. B. Korada and N. Macris, “Exact solution of the gauge symmetric p-spin glass model on a complete graph,” Journal of Statistical Physics, vol. 136, no. 2, pp. 205–230, 2009.
  • [4] Y. Deshpande, E. Abbe, and A. Montanari, “Asymptotic mutual information for the two-groups stochastic block model,” arXiv:1507.08685, 2015.
  • [5] F. Krzakala, J. Xu, and L. Zdeborová, “Mutual information in rank-one matrix estimation,” in 2016 IEEE Information Theory Workshop (ITW).   IEEE, 2016, pp. 71–75.
  • [6] J. Barbier, M. Dia, N. Macris, F. Krzakala, T. Lesieur, and L. Zdeborová, “Mutual information for symmetric rank-one matrix estimation: A proof of the replica formula,” in Advances in Neural Information Processing Systems (NIPS) 29, 2016, pp. 424–432.
  • [7] M. Lelarge and L. Miolane, “Fundamental limits of symmetric low-rank matrix estimation,” Probability Theory and Related Fields, Apr 2018.
  • [8] T. Lesieur, L. Miolane, M. Lelarge, F. Krzakala, and L. Zdeborová, “Statistical and computational phase transitions in spiked tensor estimation,” in IEEE International Symposium on Information Theory (ISIT), June 2017, pp. 511–515.
  • [9] J. Barbier and N. Macris, “The adaptive interpolation method: a simple scheme to prove replica formulas in bayesian inference,” Probability Theory and Related Fields, Oct 2018.
  • [10] J. Barbier and N. Macris, “The adaptive interpolation method for proving replica formulas. Applications to the Curie-Weiss and Wigner spike models,” arXiv e-prints, p. arXiv:1901.06516, Jan 2019.
  • [11] A. El Alaoui and F. Krzakala, “Estimation in the spiked wigner model: A short proof of the replica formula,” in 2018 IEEE International Symposium on Information Theory (ISIT).   IEEE, 2018, pp. 1874–1878.
  • [12] J.-C. Mourrat, “Hamilton-jacobi equations for mean-field disordered systems,” arXiv preprint arXiv:1811.01432, 2018.
  • [13] F. Guerra and F. Toninelli, “The thermodynamic limit in mean field spin glass models,” Communications in Mathematical Physics, vol. 230, no. 1, pp. 71–79, 2002.
  • [14] J. Barbier, N. Macris, and L. Miolane, “The layered structure of tensor estimation and its mutual information,” in 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Oct 2017, pp. 1056–1063.
  • [15] J. Barbier, F. Krzakala, N. Macris, L. Miolane, and L. Zdeborová, “Optimal errors and phase transitions in high-dimensional generalized linear models,” Proceedings of the National Academy of Sciences, vol. 116, no. 12, pp. 5451–5460, 2019.
  • [16] J. Barbier, “Overlap matrix concentration in optimal bayesian inference,” arXiv preprint arXiv:1904.02808, 2019.
  • [17] [Online]. Available: https://drive.google.com/file/d/1GhibO06rdRfqXX7VnxH0Zc9CFksFUzqq/view?usp=sharing
  • [18] J. Schur, “Bemerkungen zur theorie der beschränkten bilinearformen mit unendlich vielen veränderlichen.” Journal für die reine und angewandte Mathematik, vol. 140, pp. 1–28, 1911.
  • [19] P. Hartman, Ordinary Differential Equations: Second Edition, ser. Classics in Applied Mathematics.   Society for Industrial and Applied Mathematics, 1982.
  • [20] R. A. Horn and C. R. Johnson, Matrix Analysis, 2nd ed.   New York, NY, USA: Cambridge University Press, 2012.