Statistical properties of large data sets with linear latent features

by   Philipp Fleig, et al.

Analytical understanding of how low-dimensional latent features reveal themselves in large-dimensional data is still lacking. We study this by defining a linear latent feature model with additive noise constructed from probabilistic matrices, and analytically and numerically computing the statistical distributions of pairwise correlations and eigenvalues of the correlation matrix. This allows us to resolve the latent feature structure across a wide range of data regimes set by the number of recorded variables, observations, latent features and the signal-to-noise ratio. We find a characteristic imprint of latent features in the distribution of correlations and eigenvalues and provide an analytic estimate for the boundary between signal and noise even in the absence of a clear spectral gap.



There are no comments yet.


page 1

page 2

page 3

page 4


Weak detection of signal in the spiked Wigner model

We consider the problem of detecting the presence of the signal in a ran...

Tensor estimation with structured priors

We consider rank-one symmetric tensor estimation when the tensor is corr...

Multiple Output Regression with Latent Noise

In high-dimensional data, structured noise caused by observed and unobse...

Stabilizing Linear Prediction Models using Autoencoder

To date, the instability of prognostic predictors in a sparse high dimen...

A Scalable Approach to Estimating the Rank of High-Dimensional Data

A key challenge to performing effective analyses of high-dimensional dat...

On high-dimensional wavelet eigenanalysis

In this paper, we mathematically construct wavelet eigenanalysis in high...

An Investigation of Why Overparameterization Exacerbates Spurious Correlations

We study why overparameterization – increasing model size well beyond th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Appendix A Data distribution for the latent feature model with no noise, its variance and large limit

Each entry of the latent features data matrix is given by the sum of products of two i.i.d. Gaussian random variables and :


The product, , is distributed according to the normal product distribution [1]:


where is the modified Bessel function of the second kind:


To derive the probability density of the latent feature model entries , we first compute the characteristic function

by taking the Fourier transform of the normal product distribution. We then use the fact that the characteristic function

of the sum of products is given by . The inverse Fourier transform of then yields the sought after probability density.

Specifically, the characteristic function of the normal product distribution is


for , and is the Dirac delta function.

The characteristic function of the sum of products is given by


Finally, performing the inverse transformation we obtain the probability density function of the sum


Since the probability density function of is symmetric around zero, the mean of the distribution vanishes:

Figure S1: Comparison between the variances computed from simulated data of the latent feature model for (dashed gray lines) and the numerically evaluated limit expression for in Eq. (37) as a function of the limit value , for even values. We have chosen .

The variance is


The integral above can be evaluated in terms of generalized hypergeometric functions [2]. We present the calculation for when is even in detail:




with parameters


and is the generalized hypergeometric function


In the expression above, is the Pochhammer symbol, and is the cosecant. Since is even, we also have . Putting everything together, we obtain the following expression for the variance


where we have used the fact that the numerator after the first equality vanishes at . We can evaluate the limit , on the right-hand side numerically as shown in Fig. S1 and find that the variance of the latent feature data values is


This is in agreement with the intuition that every latent dimension contributes its own variance to the variance of the data.

We note that, for large values of the number of latent features , the distribution (30

) becomes normal, in agreement with the law of large numbers:


Crucially, the variance of remains -dependent. Figure S2

for compares exact analytical expression of the probability distribution and its Gaussian approximation to numerical simulations.

As a final note, if we were interested in the distribution of data with noise, we would need to convolve the density in Eq. (30) with the Gaussian density of the noise.

Figure S2: Comparison of simulated data (gray) and the analytical distribution (orange). In the limit of large , the distribution approaches a Gaussian form (blue). Simulated data constitutes a single realization of the model with .

Appendix B Probability density of the correlation coefficients

For our latent features model with noise, here we calculate the probability distribution of entries in the empirical data correlation matrix. Before doing this, a few notes are in order. First, the correlations depend on the basis, in which variables are measured, becoming a diagonal matrix in the special case when the measured variables are the principal axes of the data cloud. Thus to make statements independent of the basis, we consider the distribution of typical correlations, or correlations in the basis random w. r. t. the principal axes of the data. For a given realization, the -dimensional data cloud is typically anisotropic, with long directions dominated by the latent feature signal and short directions dominated by noise. When , principal axes of the data cloud do not align with the measured variables for the vast majority of random rotations, and correlations between any random pair of variables have contributions from all latent dimensions. Thus we expect the number of latent dimensions to be imprinted in the distribution of the elements of the correlation matrix, so that the statistics of the elements carries information about the underlying structure of the model.

b.1 Preliminaries: Density of the correlation coefficient of two random Gaussian variables

The correlation coefficient of two independent zero-mean variables and sampled times is


where the vectors’ components are mutually independent, i.i.d. random variables. The correlation coefficient is distributed according to [6]


This can be rewritten in terms of a Beta distribution


where and is the Beta function. Specifically, the density of correlations is given by the symmetric Beta distribution


where the location and scale are set such that the density is defined on the interval of correlation values [-1,1], and


We also note that the variance of a symmetric Beta distribution with the scale is


b.2 Density of correlations in the latent feature model

Figure S3: Distribution of pairwise correlations in the regimes of finite and small signal-to-noise ratio with and latent features. Analytic form (magenta) and simulated data (gray). (a) and (b) (large noise limit). Each simulation is run with variables and observations and constitutes independent model realisations.

There are multiple contributions to the correlations among the measured variables. We compute them individually, and then combine the contributions. We find that each contribution is distributed according to a symmetric Beta distribution. To obtain the overall density, we approximate the sum of Beta distributions by a single Beta distribution, the parameter of which is obtained by matching the variance to the sum of the variances of the individual components. To perform these analyses, we only keep terms to the leading order in the or the limit. Further, we assume that is small in accordance with the classical and intensive regimes limits.

We start with the pure noise contribution to the correlations


The expression on the right-hand side is the correlation coefficient between two random Gaussian variables. Using Eq. (43), we arrive at




and the variance of this density is


Next we compute the density of the pure signal contribution


and similarly for . Rearranging, we find


The expression in parentheses of both of the equations above is a (co)-variance of Gaussian random numbers. For , it follows the scaled -distribution with degrees of freedom. For , it is given by a rescaled version of the distribution in Eq. (30), with instead of . Crucially, the variance of either is . Thus in the limit , the terms in parentheses are , where the correction is probabilistic, but will be neglected in what follows. We get


We see that the sought after correlation is a correlation coefficient between Gaussian variables, but with samples instead of . Using again Eq. (43), we write


with parameter


We remind the reader that Eq. (57) holds to . The variance of this density is


This expression agrees with numerical simulations very well, cf. Fig. 1.

Finally, for the signal-noise cross terms in the correlation, we have


For the quantity in parentheses in Eq. (60), we define


This is a covariance between two independent Gaussian random numbers and again follows a rescaled form of the distribution in Eq. (30) with variance . Since is large, the distribution approaches a Gaussian and we further define , such that is a unit Gaussian random variable. Thus we obtain


where we have extracted the factor of to highlight that the expression in parenthesis is the correlation between Gaussian random numbers. From this, using Eq. (43), we conclude that


with parameter


The variance of this density is


An analogous expression holds for the contribution.

The empirical correlation matrix is given by




Using Eqs (46), (51) and (60), the correlation matrix can be written as a weighted sum of the three types of contributions


Each term on the right-hand side of this equation follows a Beta distribution as computed above. However, the parameter of each distribution is modified by the corresponding weight in the above sum. Consequently, the variance of each distribution is rescaled by the weight:


To determine an expression for the combined distribution of signal and noise correlations, we make use of the observation that the sum of Beta distributions can be well approximated by a single Beta distribution [3]. We determine the parameters of the Beta distribution by adding the means and variances of the distributions in the sum and analytically match the parameter of the single Beta distribution.

The means of the Beta distributions in Eq. (48), Eq. (57), and Eq. (B.2) are zero and thus the mean of the density of the combined contributions is also zero. Taking the sum of variances we obtain


In the limit when and are large enough such that contributions of and can be neglected, we have the following convergence of the empirical quantities


Consequently the variances of the contributions take the form


Thus, in this limit, the variance of the Beta distribution, Eq. (72), is of the form

var (80)

Finally, from the relation in Eq. (45), we obtain the parameter of the sought after Beta distribution.


A comparison between the analytic form of the density and simulated data is shown in Fig. 1 for , and in Fig. S3 for finite and . In the extreme noise limits, the analytic form closely matches the simulation. In the large noise limit of , shown in Fig. S3 (b), the density is close to a Gaussian, because the number of observations is large. In the regime of finite , shown in Fig. S3 (a), deviations between the analytic form and the simulation appear for small values of . We expect that these deviations will disappear by removing the various approximations made in the above analytic derivation.

Appendix C Spectrum of the normalized empirical covariance Matrix

To compute the eigenvalue density of the NECM , we use methods of Random Matrix Theory [4]. The standard approach is to compute the finite size Stieljtes transform



is the identity matrix,

and is a complex function. In the limit of large matrices – large or thermodynamic limit – the finite size Stieltjes transform becomes, . Then the eigenvalue density is obtained as the imaginary part of the limit of the Stieltjes transform:


where denotes the imaginary part.

We start with writing again the definition of the normalized empirical covariance matrix (NECM), which differs from the correlation matrix only by :


The NECM contains three different contributions: the from the pure latent feature signal, from pure noise, and two terms of the type , which are cross terms between the latent signal and the noise. Each contribution is an random matrix. Critical to computing the eigenvalue density of random matrices is the concept of matrix freeness [5], which is the generalization of statistical independence to matrices. The eigenvalue spectrum of sums and products of free matrices can be computed from spectra of summands and factors using the - and the -transforms, which are related to the Stieltjes transform and are additive and multiplicative, respectively. The signal-signal and the noise-noise contributions in the NECM definition are certainly free w. r. t. each other. We will argue in Appendix C.3 that, in our regimes of interest (the zero-noise limit (), the classical statistics limit from Eq. (9), and intensive limit from Eq. (10)), the cross-term contributions are negligible, so that we can drop them and approximate the NECM as


so that free matrix theory applies.

c.1 Parameterizing the random matrix problem and the large matrix limit

To calculate the spectrum of the signal-signal contribution to the NECM,


we note that, assuming , this matrix is of rank . Thus we can work in the basis, where




There are non-trivial eigenvalues associated with , while the remaining eigenvalues are zero. The finite size Stieltjes transform, , is then of the form


where are the eigenvalues of and is its finite size Stieltjes transform.

Now we note that in Eq. (88) is the product of two white Wishart matrices




is the Wishart matrix, and is a matrix with i.i.d. standard normal entries. The key parameter characterizing such standard is the ratio of the number of columns to that of rows


Since and are and matrices, respectively, a natural characterisation of is then


with , so that there are only two independent parameters.

It is now convenient to define


where we used Eq. (38), so that Eq. (90) becomes


In the following, we only consider the limit of large matrices. Here , , and go to infinity in such a way that , and SNR are all constant. Then in the thermodynamic limit the finite size Stieltjes transform in Eq. (C.1) becomes


where and are the large matrices limits of the Stieltjes transforms of and , respectively.

c.2 The spectrum of

We now compute the eigenvalue density of . The first step is to compute the Stieljtes transform . From Eq. (95), it is clear that this reduces to the problem of computing the eigenvalue spectrum of a product of two Wishart matrices.

The spectrum of a product of two free matrices can be computed with the help of the -transform, which is defined for a random matrix as


where is the functional inverse of the -transform . In turn, the -transform is related to the Stieltjes transform of through the relation


Crucially, for free matrices and , the -transform is multiplicative


and, furthermore, for a scalar ,


For the white Wishart matrix, Eq. (91), the -transform is known to be [4]


Thus we only need to use the multiplicative property of the -transform to compute the signal-signal contributions to the NECM. Specifically,


Equation (97) then yields


We now solve the equation for the functional inverse, , using the definition of the -transform, Eq. (98), and dividing by a common factor of . We obtain a cubic equation for the Stieltjes transform :


Finally, we divide by to obtain