Hanson-Wright inequality in Hilbert spaces with application to K-means clustering for non-Euclidean data

10/26/2018 ∙ by Xiaohui Chen, et al. ∙ University of Illinois at Urbana-Champaign 0

We derive a dimensional-free Hanson-Wright inequality for quadratic forms of independent sub-gaussian random variables in a separable Hilbert space. Our inequality is an infinite-dimensional generalization of the classical Hanson-Wright inequality for finite-dimensional Euclidean random vectors. We illustrate an application to the generalized K-means clustering problem for non-Euclidean data. Specifically, we establish the exponential rate of convergence for a semidefinite relaxation of the generalized K-means, which together with a simple rounding algorithm imply the exact recovery of the true clustering structure.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The Hanson-Wright inequality is a fundamental tool for studying the concentration phenomenon for quadratic forms in sub-gaussian random variables [11, 31]. Recently, it has triggered a wide range of statistical applications such as semidefinite programming (SDP) relaxations for -means clustering [21, 10] and Gaussian approximation bounds for high-dimensional -statistics (of order two) [6]

. Classical form of the Hanson-Wright inequality bounds the tail probability for the quadratic form of a finite-dimensional random vector in a Euclidean space. Below is a version that is frequently cited in literature (cf. Theorem 1.1 in

[22]).

Theorem 1.1 (Hanson-Wright inequality for quadratic forms of independent sub-gaussian random variables in ).

Let be a random vector with independent components such that and . Let be an matrix. Then there exists a universal constant such that for every ,

(1)

where is the Hilbert-Schmidt (i.e., Frobenius) norm of and is the operator (i.e., spectral) norm of .

There are some variants of the finite-dimensional Hanson-Wright inequality. Sharp upper and lower tail inequalities for quadratic forms of independent Gaussian random variables are derived in [15]. [20] and [4] derive the Hanson-Wright inequality for zero-diagonal matrix with independent Bernoulli and centered sub-gaussian random variables, respectively. [13] establishes an upper tail inequality for positive semidefinite quadratic forms in a sub-gaussian random vector with dependent components. [29] proves a dimension-dependent concentration inequality for a centered random vector under the convex concentration property. [1] further improves the inequality of [29] by removing the dimension dependence in .

In this paper, we first derive an infinite-dimensional analog of the Hanson-Wright inequality (1.1) for sub-gaussian random variables taking values in a Hilbert space, which can be seen as a unified generalization of the aforementioned papers in finite dimensions. Motivation of deriving the dimension-free Hanson-Wright inequality stems from the generalized -means clustering for non-Euclidean data with non-linear features, which covers the functional data clustering and kernel clustering as special examples. It is well-known that the (classical) Euclidean distance based -means clustering is computationally -hard. Various SDP relaxations in literature (cf. [18, 16, 7, 21, 10]) aim to provide exact and approximate recovery of the true clustering structure. However, it remains a challenging task to provide strong statistical guarantees for computationally tractable (i.e., polynomial-time) algorithms to cluster non-Euclidean data taking values in a general Hilbert space with non-linear features. As we shall see in Section 3, the Hilbert space version of the Hanson-Wright inequality offers a powerful tool to establish the exponential rate of convergence for an SDP relaxation of the generalized -means. This partial recovery bound implies the exact recovery of the generalized -means clustering via a simple rounding algorithm. Thus our results settle a conjecture by [24]

in the kernel clustering setting, where only a heuristic greedy algorithm is provided.

2. Hanson-Wright inequality in Hilbert spaces

To state the Hanson-Wright inequality in a general Hilbert space, we first need to properly specify the sub-gaussian random variables therein.

2.1. Sub-gaussian random variables in Hilbert spaces

Let be a real separable Hilbert space and be the class of bounded linear operators . If the operator is positive definite (i.e., it is self-adjoint and for all ), then there is a unique positive definite (and thus self-adjoint) square root operator satisfying (cf. Theorem 3.4.3 in [12]).

Definition 2.1 (Trace class of linear operators on a separable Hilbert space).

Let . Then is trace class if

where is a complete orthonormal system (CONS) of . In this case, is the trace norm of .

Note that the trace norm does not depend on the choice of the CONS. A self-adjoint and positive definite trace class linear operator is compact and it plays a similar role as a covariance matrix, where the trace norm is simply the trace of the covariance matrix. In particular, if is positive definite trace class, then . Let be a probability space.

Definition 2.2 (Hilbert space valued sub-gaussian random variable).

Let be a random variable in and be a positive definite trace class linear operator. Then is sub-gaussian with respect to (denote as ) if there exists an such that for all ,

(2)

where the expectation is defined as a Bochner integral (cf. Chapter 2.6 in [12]). Moreover, if with mean , then the (or sub-gaussian) norm of with respect to is defined as

Note that Definition 2.2 corresponds to the -sub-gaussianity in [2], and it is an infinite-dimensional analog of the sub-gaussian random vectors in (see for example [28] and [13]). Unsurprisingly, the Gaussian random variables in is a special case of sub-gaussian random variables in .

Definition 2.3 (Hilbert space valued Gaussian random variable).

A random variable in is Gaussian with respect to and with mean (denote as ) if for all ,

(3)
Lemma 2.4.

If , then and , where is the covariance operator of . More generally, if with mean , then , i.e., is positive semidefinite.

Notation. We shall use to denote positive and finite universal constants, whose values may vary from place to place. For , denote and . For , the operator norm of

is defined as the square root of the largest eigenvalue of

. If , then is a Hilbert-Schmidt (HS) operator and . For a matrix , .

2.2. Hanson-Wright inequality in Hilbert spaces

Throughout Section 2.2, we assume that is a real separable Hilbert space and is a positive definite trace class operator on . First, we present a Hanson-Wright inequality with zero diagonal in Proposition 2.5.

Proposition 2.5 (Hanson-Wright inequality for quadratic forms of sub-gaussian random variables in Hilbert spaces: zero diagonal).

Let , be a sequence of independent centered random variables in and . Let be an matrix and . Then there exists a universal constant such that for any ,

(4)

where .

Theorem 2.5 is a dimension-free version of the Hanson-Wright inequality with a zero diagonal weighting matrix for independent sub-gaussian random variables in [22]. Specifically, Theorem 1.1 (i.e., Theorem 1.1 in [22]) is a special case of Theorem 2.5 with and . In this case, we may take and thus . Different from Theorem 1.1, Theorem 2.5 is also able to capture the dependency encoded in for general Hilbert spaces, thus covering certain quadratic forms in a finite-dimensional sub-gaussian random vector with dependent components. Our next result is an upper tail inequality (i.e., one-sided Hanson-Wright inequality) with non-negative diagonal weights in Theorem 2.6 below.

Theorem 2.6 (Upper tail inequality for quadratic forms of sub-gaussian random variables in Hilbert spaces: non-negative diagonal).

Let , be a sequence of independent centered random variables in and . Let be an matrix such that , and . Then there exists a universal constant such that for any ,

(5)

where .

Both Theorem 2.5 and 2.6 allow , to have different covariance operators , provided that (cf. Lemma 2.4).

Remark 2.7 (Connections to the existing upper tail inequality in finite-dimensional Euclidean spaces).

For non-negative diagonal weights, Theorem 2.6 is an infinite-dimensional (and thus dimension-free) generalization of the tail inequality for quadratic forms a sub-gaussian random vector with dependent components in [13]. In particular, if is a centered sub-gaussian random vector in (i.e., there exists a such that for all ), then Theorem 2.1 in [13] states that: for any positive semidefinite matrix and ,

The last inequality is a special case (up to a universal constant) of (5) with , , , , and . In addition, we note that the positive semidefinite condition is not needed in our Theorem 2.6. Instead, only a weaker condition on the non-negativity of the diagonal entries in the weighting matrix is required.

There are two limitations of Theorem 2.6. First, is typically not centered at . For the generalized -means application in Section 3, this means that consistency of solutions of the SDP relaxation (13) cannot be attained unless tends to . Second, the non-negativity condition on the diagonal weights in Theorem 2.6 is not entirely innocuous for obtaining a concentration inequality for (i.e., two-sided Hanson-Wright inequality). Without imposing additional assumptions, we cannot expect a lower tail bound for sub-gaussian random variables even in [1]. To simultaneously fix these two issues and obtain a concentration inequality for , we make the following Bernstein-type condition on the squared norm, in addition to the assumption that are independent with mean zero.

Assumption 2.8 (Bernstein condition on the squared norm).

There exists a universal constant such that

(6)

where is the covariance operator of .

Remark 2.9 (Comments on Assumption 2.8).

Since , Assumption 2.8 is a mild condition on the sub-exponential tail behavior of . For , (6) is an automatic consequence of the sub-gaussianality (2). For , if , where has independent components with bounded sub-gaussian norms, then

Such linear transformation of an independent random vector in

with sub-gaussian components is a popular statistical model for the -means clustering [10, 21]. For the general Hilbert space , it is easy to verify that Gaussian random variable in satisfies (6). Comparing with the “centering” term in (5), we shall see that the correct centering terms in (6) together with the parameters are crucial to yield a concentration inequality for . By Lemma 2.4, we know that for any . In fact, even in , it is easy to construct a random variable such that where (cf. Example 4.1 and 4.2 in [6]). In particular, here we give a counterexample in (so that ). Let

follow a mixture of Gaussian distributions

, where and . Then we have and , where for some sufficiently large constant . Thus if as , then and

where . Hence is a sub-gaussian random variable satisfying Assumption 2.8 and , provided that as .

Now we are ready to state the Hanson-Wright inequality for the general case.

Theorem 2.10 (Hanson-Wright inequality for quadratic forms of sub-gaussian random variables in Hilbert spaces: general version).

Let , be a sequence of independent centered random variables in and . Let be an matrix and . If in addition Assumption 2.8 holds, then there exists a universal constant such that for any ,

(7)

where .

[29] and [1] derive Hanson-Wright inequalities under the convex concentration property of a finite-dimensional random vector, which is difficult to verify in general. In contrast, our Theorem 2.10 holds under more transparent conditions (i.e., the sub-gaussian and Bernstein-type assumptions). Note that Theorem 2.10 can be seen as a unified generalization of the finite-dimensional Hanson-Wright inequality to Hilbert spaces for both independent sub-gaussian random variables in [22] and a sub-gaussian random vector with dependent components in [13].

2.3. Proof of the main results in Section 2.2

In this section, we prove Proposition 2.5, Theorem 2.6, and 2.10. Before proceeding to the rigorous proof, we would like to mention that, although our general proof strategy is based on that of Theorem 1.1 in [22], two innovative ingredients are needed to accommodate the Hilbert space structures.

First, we diagonalize the operator (together with the decoupling) in order to perform the calculations in an isometric space of , where linear operators can be conveniently represented by (infinite-dimensional) matrices. This turns out to be the crux to obtain the trade-off between and in the tail probability bound for the off-diagonal sum in .

Second, we derive a sharp upper tail probability bound for the non-negatively weighted diagonal sum of squared norm of independent sub-gaussian random variables in (cf. Lemma 4.2). If we simply apply Bernstein’s inequality (cf. Theorem 2.8.1 in [28]) for the real-valued sub-exponential random variables (cf. Lemma 4.4), then the diagonal sum in has the following probability bound: for any ,

(8)

Note that the right-hand side of (8) is controlled by one parameter , which is strictly less sharper than (4) since and . For instance, if , then is often the covariance matrix of . In the special case for , then , , and . Therefore, direct application of the diagonal sum bound (8) does not yield the probability bound in Theorem 2.5. In particular, for the generalized

-means clustering problem, this implies that a much more restrictive lower bound condition on the signal-to-noise ratio is required for exact recovery of the true clustering structure for high-dimensional data (more details can be found in the discussion after Theorem 

3.3).

Proof of Proposition 2.5.

By Markov’s inequality, we have for any and ,

Step 1: decoupling. Let be i.i.d. Rademacher random variables (i.e., ) that are independent of . Since

we have , where and is the expectation taken with respect to the random variables . Below, is similarly defined. By Jensen’s inequality, we get

Let . Then we can write

Taking the expectation with respect to (i.e., conditioning on and ), it follows from the assumption are independent with mean zero that

where . Thus we get

Step 2: reduction to Gaussian random variables. For , let be independent random variables in that are independent of and . Define

Then, by the definition of Gaussian random variables in , we have

So it follows that

Since , we have

which implies that

(9)

where .

Step 3: diagonalization. Since is trace class (thus compact) and positive definite, it follows from Theorem 4.2.4 in [12] that the eigendecomposition of is given by

where are eigenvalues of and

are eigenfunctions forming a CONS of

; namely for every . Here,

denotes the tensor product and

denotes the closure of the image of . In addition, there exists a unique positive definite square root operator such that (cf. Theorem 3.4.3 in [12]). Then we have and

where the last step follows from Parseval’s identity. Note that

Thus for any ,

which implies that , are independent random variables. Now let , where for . Then , where with and . Note that

Thus is an matrix of all zeros if , and .

Step 4: bound the eigenvalues. Let be the restriction matrix such that if and otherwise. Let further and , where with and are i.i.d. standard Gaussian random variables in . By the rotational invariance of Gaussian distributions, we have

where are the eigenvalues of . So it follows that

where

In addition, we also have

Invoking (9), we get

Since are i.i.d.

random variables with the moment generating function

for , we have

Using for , we get that if , then

Note that the last inequality is uniform in . Taking expectation with respect to , we obtain that