Dimension-free Bounds for Sums of Independent Matrices and Simple Tensors via the Variational Principle

08/18/2021
by   Nikita Zhivotovskiy, et al.
ETH Zurich
0

We consider the deviation inequalities for the sums of independent d by d random matrices, as well as rank one random tensors. Our focus is on the non-isotropic case and the bounds that do not depend explicitly on the dimension d, but rather on the effective rank. In an elementary and unified manner, we show the following results: 1) A deviation bound for the sums of independent positive-semi-definite matrices of any rank. This result generalizes the dimension-free bound of Koltchinskii and Lounici [Bernoulli, 23(1): 110-133, 2017] on the sample covariance matrix in the sub-Gaussian case. 2) A dimension-free version of the bound of Adamczak, Litvak, Pajor and Tomczak-Jaegermann [Journal Of Amer. Math. Soc,. 23(2), 535-561, 2010] on the sample covariance matrix in the log-concave case. 3) Dimension-free bounds for the operator norm of the sums of random tensors of rank one formed either by sub-Gaussian or by log-concave random vectors. This complements the result of Guédon and Rudelson [Adv. in Math., 208: 798-823, 2007]. 4) A non-isotropic version of the result of Alesker [Geom. Asp. of Funct. Anal., 77: 1-4, 1995] on the deviation of the norm of sub-exponential random vectors. 5) A dimension-free lower tail bound for sums of positive semi-definite matrices with heavy-tailed entries, sharpening the bound of Oliveira [Prob. Th. and Rel. Fields, 166: 1175-1194, 2016]. Our approach is based on the duality formula between entropy and moment generating functions. In contrast to the known proofs of dimension-free bounds, we avoid Talagrand's majorizing measure theorem, as well as generic chaining bounds for empirical processes. Some of our tools were pioneered by O. Catoni and co-authors in the context of robust statistical estimation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/27/2018

Robust covariance estimation under L_4-L_2 norm equivalence

Let X be a centered random vector taking values in R^d and let Σ= E(X⊗ X...
05/13/2020

Testing Positive Semi-Definiteness via Random Submatrices

We study the problem of testing whether a matrix A ∈R^n × n with bounded...
09/24/2018

Moment bounds for autocovariance matrices under dependence

The goal of this paper is to obtain expectation bounds for the deviation...
04/08/2018

Moving Beyond Sub-Gaussianity in High-Dimensional Statistics: Applications in Covariance Estimation and Linear Regression

Concentration inequalities form an essential toolkit in the study of hig...
03/17/2019

Stability of the Shannon-Stam inequality via the Föllmer process

We prove stability estimates for the Shannon-Stam inequality (also known...
12/07/2017

Dimension-free PAC-Bayesian bounds for matrices, vectors, and linear least squares regression

This paper is focused on dimension-free PAC-Bayesian bounds, under weak ...
02/12/2021

Fast Non-Asymptotic Testing And Support Recovery For Large Sparse Toeplitz Covariance Matrices

We consider n independent p-dimensional Gaussian vectors with covariance...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and main results

We study the non-asymptotic bounds for the sums of some independent random matrices as well as a closely related question of estimating the largest and smallest singular values of random matrices with independent rows. Assume that we are given a random

by matrix such that all of its rows are isotropic independent sub-Gaussian vectors (see the formal definitions below). We are interested in providing the upper and lower bounds on its singular values

The question of upper bounding the largest singular values and lower bounding the smallest singular value is known to be essentially equivalent (see [47, Chapter 4]) to providing an upper bound on the sample covariance matrix formed by the rows

. That is, one is interested in providing a high probability, non-asymptotic bound on

(1)

Here and in what follows stands for the operator norm of the matrix and for the Euclidean norm of the vector respectively. The latter question is also central in mathematical statistics, where one is interested in estimating the underlying covariance structure using the sample covariance matrix. One of the usual assumptions made when analyzing (1) is that the rows are isotropic and zero mean; that is, and , where in what follows stands for the by identity matrix. The non-isotropic case can usually be reduced to the isotropic using a linear transformation. However, the problem is that in this case the bound on (1) will depend on the dimension

, whereas in many cases one expects that a dimension-free deviation bound is possible. The search for dimension-free bounds for sums of independent random matrices is motivated mainly by applications in statistics and data science, where it is usually assumed that the data lives on a low-dimensional manifold. Before providing our first result, we recall that for a random variable

and , its norm is defined as follows

Using the standard convention, we say that is the sub-Gaussian norm and is the sub-exponential norm. We say that is a sub-Gaussian random vector in if is finite. A zero mean random vector is isotropic, if . Here and in what follows, denotes the corresponding unit sphere and is the standard inner product in . One of the central quantities appearing in this paper is the effective rank.

Definition 1.

For a positive semi-definite matrix define its effective rank as

The effective rank is always smaller than the matrix rank of and, in particular, smaller than its dimensions. We also have . Our first result is a general upper bound for sums of independent positive semi-definite by matrices satisfying the sub-exponential norm equivalence assumption. This generalizes the question of upper bounding (1), since we neither assume that the matrix is of rank one nor that the covariance matrix is identity.

Theorem 1.

Assume that are independent copies of a by

positive semi-definite symmetric random matrix

with mean . Let satisfy for some ,

(2)

for all . Then, for any , with probability at least , it holds that

(3)

whenever .

Remark 1.

In the theorem above, we presented explicit constants. We place them to emphasize that, in contrast with existing dimension-free bounds, these constants are easy to obtain with the approach we follow. At the same time, little effort was made to get their optimal values.

Remark 2.

In Section 3.2 we show that the same dimension-free bound holds under a weaker assumption (allowing heavy-tailed distributions), namely , but only for the lower tails of (3). This complements several known dimension-dependent lower tail bounds.

The norm equivalence assumption (2) is quite standard in the literature. As a matter of fact, Theorem 1

recovers one of the central results in high-dimensional statistics, as the following example shows.

Example 1 (The sample covariance matrix in the sub-Gaussian case [22]).

The most natural application of Theorem 1 is when and is zero mean sub-Gaussian random vector with a covariance matrix . That is, there is such that for any , it holds that

(4)

Using this line, we have for any ,

which is sufficient for Theorem 1. This gives that, with probability at least , provided that , it holds that

(5)

where we used that for a symmetric by matrix it holds that .

Recently, much attention has been paid to the dimension-free bound for the sample covariance matrix formed by sub-Gaussian random vectors. Although the dimension-dependent version of (5) (that is, where is replaced by ) follows from a simple discretization argument, the known approaches to obtaining the dimension-free bounds are quite technical and deserve a separate discussion:

  • The bound (5) is essentially111For the sake of technical simplicity, we work with finite dimensional vectors, whereas the results in [22] are formulated for a general Hilbert space. the result of Koltchinskii and Lounici [22, Theorem 9] for the sample covariance matrix. Their proof is based on deep probabilistic results: the generic chaining tail bounds for quadratic processes and Talagrand’s majorizing measures theorem. Because of this, it is hard to provide any explicit constants using their approach and extend their bound to the setup of Theorem 1. The dependence on is not explicit in [22].

  • Using the matrix deviation inequality of Liaw, Mehrabian, Plan, and Vershynin [27], Vershynin [47] gives an alternative proof of the bound of Koltchinskii and Lounici, but with the term instead of in (5). The dependence on in [27] has been recently improved in [18]. However, this improved (and optimal for some problems) result only leads to term in (5).

  • Van Handel [45] gives an in-expectation version of (5) in the special case where is a Gaussian random vector. Despite not using Talagrand’s majorizing measures theorem, their analysis is based on a Gaussian comparison theorem and does not cover the sub-Gaussian case. Note that in the Gaussian case, the in-expectation bound for the sample covariance matrix can be converted into an optimal high probability bound using one of the special concentration inequalities provided in [22, 1, 20].

Our elementary approach, based on the variational inequality and described in detail in Section 2, bypasses several technical steps appearing in the literature. Informally speaking, we provide a smoothed version of the -net argument that allows to properly capture the complexity of elliptic indexing sets without resorting to generic chaining. This extension will be key to our multilinear results, where the above mentioned tools are hard to apply.

Note that even though Example 1 is sharp in the rank one case (see the lower bound in [22]), the result of Theorem 1, due to its generality, can be suboptimal in other cases. For example, let be a diagonal random matrix such that its diagonal elements are the same copy of the absolute value of a standard Gaussian random variable. In this case, Theorem 1 scales as whereas the correct order is We also remark that, at least in the rank one case, the bound of Theorem 1 is out of the scope of the so-called matrix concentration inequalities, since they provide additional logarithmic factors and suboptimal tails (see some related bounds in [44, 47, 20, 28]).

Motivated by the recent interest in random tensors [48, 15, 40, 7, 12], we show how our arguments can be extended to provide a multilinear extension of Theorem 1. That is, we are considering sums of independent random tensors of order higher than one and want to prove a bound similar to (5). Let us introduce this setup. Consider the simple (rank one) random symmetric tensor

where is a zero mean sub-Gaussian vector in and are its independent copies. We are interested in studying

(6)

where stands for the operator norm of the symmetric -linear form. Here we used that for symmetric forms the expression is maximized by a single vector (see e.g., [36, Section 2.3]).

The question of upper bounding (6) (with usually replaced by the absolute value in the right-hand side and non-integer values of are allowed) is well studied [13, 16, 31, 3, 46, 32]. The results are usually of the following form: Assuming that are i.i.d. copies of the isotropic vector satisfying certain regularity assumptions, one is interested in defining the smallest sample size such that, with high probability,

The general form of the assumption (see [16, 46] and, in particular, [3, Theorem 4.2]) required to achieve this precision for some regular families of distributions is

(7)

where depends only on . Although the condition is known to be optimal when is a constant [46], the dependence on is either suboptimal or not explicit in the existing results. In fact, a recent result of Mendelson [32] suggests that using a specific robust estimation procedure, it is possible to approximate the moments of the marginals for any with scaling as . At the same time, the inequality (7) becomes vacuous in this regime whenever . Before we proceed, recall the following definition.

Definition 2.

The measure in is log-concave, if for any measurable subsets and any ,

whenever the set is measurable.

Our next result shows that provided that the sample size is large enough, one can approximate the -th integer moment of the marginals using their empirical counterparts with scaling as . This is the best possible approximation rate when and is a multivariate Gaussian random vector. Moreover, because of (7) this approximation rate was not previously achieved even in the isotropic case.

Theorem 2.

Let be an integer. Assume that are independent copies of a zero mean vector that is either sub-Gaussian (4) or log-concave. There exist that depends only on and an absolute constant such that the following holds. Assume that

(8)

Then, with probability at least , it holds that

where in the sub-Gaussian case and in the log-concave case.

Moreover, in the sub-Gaussian case if , then the same bound holds with probability at least .

To simplify our proofs, we focus only on the tensor case; in particular, we consider the integer values of . In Theorem 2 we require that either or . These assumptions can be dropped by slightly inflating our upper bound (see also [3, Remark 4.3]). It is likely that in the sub-Gaussian case one can extend our arguments, namely, the decoupling-chaining argument discussed below, so that the assertion holds with probability whenever . Indeed, we know by (5) that this is the case at least when . We preferred a shorter proof instead of a more accurate estimate of the tail.

In the log-concave case when , Theorem 2 complements the renowned result of Adamczak, Litvak, Pajor and Tomczak-Jaegermann [3]. The main advantage of our result is the explicit dependence on the effective rank, similar to the sample covariance bound of Koltchinskii and Lounici in the Gaussian case [22]. Our next result sharpens the tail estimate in this specific case and coincides with the best known bound in the isotropic case.

Theorem 3.

Assume that are independent copies of a zero-mean, log-concave random vector with covariance . There are absolute constants such that the following holds. We have, with probability at least ,

whenever .

In both proofs we combine the variational inequality approach with the decoupling-chaining argument developed in [3, 42]. The last argument is adapted to the general non-isotropic case.

There is a general version of Theorem 3, which follows from our proof with minimal changes. For the reader’s convenience we also present an explicit tail bound.

Corollary 1.

Assume that are independent copies of a zero-mean random vector with covariance such that for some and all , it holds that

There are absolute constants such that the following holds. For any we have, with probability at least ,

whenever .

In Section 3 we provide two additional results: a bound on the deviation of the norm of a sub-exponential random vector and a lower tail version of Theorem 1. Finally, as a part of the proof of Theorem 2, we provide a simple proof of the bound by Hsu, Kakade and Zhang [17] on the deviation of the norm of a sub-Gaussian random vector.

2 An approach based on the variational equality

Our approach will be based on the following duality relation (see [6, Corollary 4.14]): for a probability space and any measurable function such that , it holds that

(9)

where the supremum is taken with respect to all measures absolutely continuous with respect to and

denotes the Kullback-Leibler divergence between

and . The equality (9) is used in the proof of the additivity of entropy [26, Proposition 5.6] and in the transportation method for proving concentration inequalities [6, Chapter 8]. A useful corollary of the variational equality is the following lemma (see e.g., [10, Proposition 2.1] and discussions therein).

Lemma 1.

Assume that are i.i.d. random variables defined on some measurable space . Assume also that (called the parameter space) is a subset of . Let be such that almost surely. Let be a distribution (called prior) on and let be any distribution (called posterior) on such that . Then, with probability at least , simultaneously for all such we have

where is distributed according to .

Proof.

We sketch the proof for the sake of completeness. Let in (9) be equal to . Let denote the expectation with respect to the i.i.d. sample . Using successively (9), Fubini’s theorem and independence of , we have

By Markov’s inequality for any random variable the identity implies that , with probability at least . The claim follows by taking

Remark 3.

In Lemma 1 we assumed that for all . However, this does not imply that is integrable with respect to . If it is in not the case, one can conventionally take the cases where is infinite into account, so that the inequality of Lemma 1 still holds. For more details see [9, Appendix A].

Our analysis is inspired by the application of (9) and Lemma 1 in the works of Catoni and co-authors [5, 8, 9, 10] on robust mean estimation as well as by the work of Oliveira [37] on the lower tails of sample covariance matrices under minimal assumptions. This approach is usually called the PAC-Bayesian method in the literature. In robust mean estimation, one is making minimal distributional assumptions (for example, by considering heavy-tailed distributions) aiming to estimate the mean of the random variable/vector/matrix using the estimators that necessarily differ from the sample mean (see [8, 34, 30, 10, 33, 11, 38, 19, 32] and the recent survey [29]). Our aim is somewhat different: We work with sums of independent random matrices and multilinear forms. It is important to note that except the recent works of Catoni and Giulini [14, 10] on robust mean estimation, statistical guarantees based on (9) are dimension-dependent. Moreover, in these applications it is always enough to use the Gaussian prior and posterior distributions with their covariance matrices being proportional to the identity matrix. In contrast, our analysis requires a more subtle choice of and in Lemma 1

: We consider truncated multivariate Gaussian distributions whose covariance is not proportional to the identity matrix.

2.1 Motivating examples: matrices with isotropic sub-Gaussian rows and the Gaussian complexity of ellipsoids

To motivate (and illustrate) the application of the variational equality (9) in the context of high-dimensional probability, we first show how Lemma 1 can be used to recover the standard bound on the largest and smallest singular values of the the by random matrix having independent, mean zero, isotropic () sub-Gaussian rows. In this case, (4) can be rewritten as

In view of [47, Lemma 4.1.5], it is enough to show the following statement.

Proposition 1.

([47, Theorem 4.6.1]) Let be an by random matrix whose rows are independent, mean zero, sub-gaussian isotropic random vectors. We have for any , with probability at least ,

whenever .

The standard way of proving Proposition 1 uses an -net argument combined with the Bernstein inequality in terms of the norm and the union bound. We demonstrate that if the prior and the posterior are correctly chosen, then Lemma 1 recovers the same bound without directly exploiting a discretization argument.

Remark 4.

The condition does not appear in [47, Theorem 4.6.1]. As a results, this bound contains an additional additive term scaling as . In our regime when , this term is naturally dominated by . In some sense, we only captured the sub-Gaussian regime in the deviation bound. This regime is arguably the most interesting when considering statistical estimation problems.

Our analysis requires the following standard result. Since we need a version with an explicit constant, we reproduce these lines for the sake of completeness.

Lemma 2.

Let be a zero mean random variable. Then for any such ,

Proof.

First, by Markov’s inequality and any , it holds that

In the following lines, we assume without loss of generality that . We have for ,

(10)

Finally, when by Taylor’s expansion and since , we have

The claim follows. ∎

Proof.

(of Proposition 1) Fix . Our aim is to choose and . Let

where is a unit ball in . Choose to be a product of two uniform measures each defined on . For let

be a product of uniform distributions on the balls

and . Observe that both balls belong to . Because of this, if is distributed according to , we have . By the additivity of -divergence for product measures and the formula for the volume of the -dimensional ball, we have

where denotes the volume of the set . Fix and consider the random variable , where is distributed according to . We want to plug this random variable into Lemma 1. Observe that conditionally on , we have, using ,

where the last inequality follows from the fact that almost surely. Conditionally on , combining the triangle and Jensen’s inequalities, we have

Further, since we have by Lemma 2, conditionally on ,

whenever . Therefore, since and Lemma 1 gives that simultaneously for all , with probability at least ,

(11)

We choose to guarantee that . Then, taking , we require . Simplifying (11) for this choice of parameters, we prove the claim. ∎

Another motivating fact is that (9) correctly reflects the Gaussian complexity of the ellipsoid. It is well-known that for the ellipsoids, the Dudley integral argument does not give an optimal bound, while the generic chaining does (see [42, Chapter 2.5]). Although one can instead directly use the Cauchy-Schwarz inequality, it is easy to show that the variational equality (9) captures the same bound.

Example 2 (The Gaussian complexity of ellipsoids via the variational equality).

Let be a standard normal random vector. Let be a positive semi-definite by matrix. It hold that

Proof.

Set and let . Let the prior distribution be a multivariate Gaussian distribution with mean zero and covariance . For let the distribution be a multivariate Gaussian distribution with mean and covariance . By the standard formula, we have

Let be distributed according to . By Jensen’s inequality for any ,

By the line of the proof of Lemma 1, we have

Since is a standard normal random vector, it holds that

Combining previous inequalities and simplifying, we have

The claim follows. ∎

Observe that the proof explicitly uses a bound on the expected squared norm of a multivariate normal vector (not for the norm of though), which is closer to a more algebraic approach based on the Cauchy-Schwarz inequality, whereas the generic chaining is a geometric approach; we refer to [42, Chapter 2.5] for a detailed discussion of the Gaussian complexity of ellipsoids.

2.2 Proof of Theorem 1

In view of Proposition 1 and Example 2, a natural idea is to use the uniform distribution for and on ellipsoids induced by the structure of the matrix . It appears that working with ellipsoids directly is quite tedious. To avoid these technical problems, we work with the non-isotropic truncated Gaussian distribution. Throughout the proof, we assume without loss of generality that is invertable. If it is not the case, the distribution of lives almost surely in a lower-dimensional subspace. We can project on this subspace and continue the proof without changes. Fix . Let

and choose the prior distribution on as the product of two multivariate Gaussian distributions in both with mean zero and covariance matrix . For let the posterior distribution be defined as follows. For consider the density function in given by

(12)

where is a normalization constant. That is, the distribution defined by

is a multivariate normal distribution restricted to the ball

. Our distribution on is now defined as a product of two distributions given by and respectively. Observe that since is symmetric around (that is, far any , we have ), we have for distributed according to ,

(13)

where and denote the marginals of .

Let us now compute the Kullback-Leibler divergence between and . Let denote the density function of a multivariate Gaussian distribution with mean zero and covariance . By the additivity of the Kullback-Leibler divergence for product measures, we have

Both terms are now analyzed similarly. For distributed according to ,

where in the last line we used . Let be a random vector having a multivariate Gaussian distribution with mean zero and covariance . By (12) and using the translation , we have By Markov’s inequality we have . We choose

and get . Therefore, we have . Finally, for this choice of ,

(14)

For we want to plug the function into Lemma 1, where is distributed according to . By (13) we have

(15)

It is only left to compute Conditionally on , we have as in the proof of Proposition 1

where both the norm and the expectation are considered with respect to the distribution of , and the second line uses the Cauchy-Schwarz inequality. Taking again the expectation with respect to only, we have by Lemma 2

provided that . Observe that by our choice of , we have almost surely

Let us choose . Thus, by (14), (15) and Lemma 1 we have for any fixed such that simultaneously for all ,

where we used . We choose and finish the proof. ∎

2.3 Proofs of Theorem 2 and Theorem 3

We first present some auxiliary results, then we prove Theorem 3 and Theorem 2. The technique of the proof combines the analysis of Theorem 1 with a careful truncation argument. We also use the decoupling-chaining argument to control the large components in the sums. In the last part, we are mainly adapting the previously known techniques.

We need the following result, which is similar to the deviation inequality appearing in [17]. As above, we provide a simple proof based on the variational equality (9).

Lemma 3.

Assume that is a zero mean sub-Gaussian random vector (4). Then, with probability at least ,

(16)
Remark 5.

We will be frequently using the following relaxation of the bounds (16):

Proof.

Observe that . Thus, we upper bound uniformly over the sphere. Set . Let the prior distribution be a multivariate Gaussian distribution with mean zero and covariance . For let be a multivariate Gaussian distribution with mean and covariance . By the standard formula, we have

Our function is , where is distributed according to . To apply Lemma 1 (with ) we only need to compute Conditionally on , by the sub-Gaussian assumption, we have

(17)

where to get the explicit constant, one should keep track of the constant factors in the implications of [47, Proposition 2.5.2]. We have

Therefore, for any , simultaneously for all , we have, with probability at least ,

Choosing