Hanson-Wright inequality in Hilbert spaces with application to K-means clustering for non-Euclidean data

We derive a dimensional-free Hanson-Wright inequality for quadratic forms of independent sub-gaussian random variables in a separable Hilbert space. Our inequality is an infinite-dimensional generalization of the classical Hanson-Wright inequality for finite-dimensional Euclidean random vectors. We illustrate an application to the generalized K-means clustering problem for non-Euclidean data. Specifically, we establish the exponential rate of convergence for a semidefinite relaxation of the generalized K-means, which together with a simple rounding algorithm imply the exact recovery of the true clustering structure.

Authors

• 17 publications
• 24 publications
• Hanson-Wright inequality in Banach spaces

We discuss two-sided bounds for moments and tails of quadratic forms in ...

• Convergence Rates for the Generalized Fréchet Mean via the Quadruple Inequality

For sets Q and Y, the generalized Fréchet mean m ∈ Q of a random varia...
12/19/2018 ∙ by Christof Schötz, et al. ∙ 0

• An implementation of the relational k-means algorithm

A C# implementation of a generalized k-means variant called relational k...
04/25/2013 ∙ by Balazs Szalkai, et al. ∙ 0

• Learning from Non-IID Data in Hilbert Spaces: An Optimal Recovery Perspective

The notion of generalization in classical Statistical Learning is often ...
06/05/2020 ∙ by Simon Foucart, et al. ∙ 0

• A Bernstein-type inequality for stochastic processes of quadratic forms of Gaussian variables

We introduce a Bernstein-type inequality which serves to uniformly contr...
09/19/2009 ∙ by Ikhlef Bechar, et al. ∙ 0

• On the optimality of kernels for high-dimensional clustering

This paper studies the optimality of kernel methods in high-dimensional ...
12/01/2019 ∙ by Leena Chennuru Vankadara, et al. ∙ 0

• A vector-contraction inequality for Rademacher complexities

The contraction inequality for Rademacher averages is extended to Lipsch...
05/01/2016 ∙ by Andreas Maurer, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The Hanson-Wright inequality is a fundamental tool for studying the concentration phenomenon for quadratic forms in sub-gaussian random variables [11, 31]. Recently, it has triggered a wide range of statistical applications such as semidefinite programming (SDP) relaxations for -means clustering [21, 10] and Gaussian approximation bounds for high-dimensional -statistics (of order two) [6]

. Classical form of the Hanson-Wright inequality bounds the tail probability for the quadratic form of a finite-dimensional random vector in a Euclidean space. Below is a version that is frequently cited in literature (cf. Theorem 1.1 in

[22]).

Theorem 1.1 (Hanson-Wright inequality for quadratic forms of independent sub-gaussian random variables in R).

Let be a random vector with independent components such that and . Let be an matrix. Then there exists a universal constant such that for every ,

 P(|XTAX−E[XTAX]|⩾t)⩽2exp⎡⎣−Cmin⎛⎝t2L4∥A∥2HS,tL2∥A∥op⎞⎠⎤⎦, (1)

where is the Hilbert-Schmidt (i.e., Frobenius) norm of and is the operator (i.e., spectral) norm of .

There are some variants of the finite-dimensional Hanson-Wright inequality. Sharp upper and lower tail inequalities for quadratic forms of independent Gaussian random variables are derived in [15]. [20] and [4] derive the Hanson-Wright inequality for zero-diagonal matrix with independent Bernoulli and centered sub-gaussian random variables, respectively. [13] establishes an upper tail inequality for positive semidefinite quadratic forms in a sub-gaussian random vector with dependent components. [29] proves a dimension-dependent concentration inequality for a centered random vector under the convex concentration property. [1] further improves the inequality of [29] by removing the dimension dependence in .

In this paper, we first derive an infinite-dimensional analog of the Hanson-Wright inequality (1.1) for sub-gaussian random variables taking values in a Hilbert space, which can be seen as a unified generalization of the aforementioned papers in finite dimensions. Motivation of deriving the dimension-free Hanson-Wright inequality stems from the generalized -means clustering for non-Euclidean data with non-linear features, which covers the functional data clustering and kernel clustering as special examples. It is well-known that the (classical) Euclidean distance based -means clustering is computationally -hard. Various SDP relaxations in literature (cf. [18, 16, 7, 21, 10]) aim to provide exact and approximate recovery of the true clustering structure. However, it remains a challenging task to provide strong statistical guarantees for computationally tractable (i.e., polynomial-time) algorithms to cluster non-Euclidean data taking values in a general Hilbert space with non-linear features. As we shall see in Section 3, the Hilbert space version of the Hanson-Wright inequality offers a powerful tool to establish the exponential rate of convergence for an SDP relaxation of the generalized -means. This partial recovery bound implies the exact recovery of the generalized -means clustering via a simple rounding algorithm. Thus our results settle a conjecture by [24]

in the kernel clustering setting, where only a heuristic greedy algorithm is provided.

2. Hanson-Wright inequality in Hilbert spaces

To state the Hanson-Wright inequality in a general Hilbert space, we first need to properly specify the sub-gaussian random variables therein.

2.1. Sub-gaussian random variables in Hilbert spaces

Let be a real separable Hilbert space and be the class of bounded linear operators . If the operator is positive definite (i.e., it is self-adjoint and for all ), then there is a unique positive definite (and thus self-adjoint) square root operator satisfying (cf. Theorem 3.4.3 in [12]).

Definition 2.1 (Trace class of linear operators on a separable Hilbert space).

Let . Then is trace class if

 ∥Σ∥tr:=∞∑j=1⟨(Σ∗Σ)1/2ej,ej⟩<∞,

where is a complete orthonormal system (CONS) of . In this case, is the trace norm of .

Note that the trace norm does not depend on the choice of the CONS. A self-adjoint and positive definite trace class linear operator is compact and it plays a similar role as a covariance matrix, where the trace norm is simply the trace of the covariance matrix. In particular, if is positive definite trace class, then . Let be a probability space.

Definition 2.2 (Hilbert space valued sub-gaussian random variable).

Let be a random variable in and be a positive definite trace class linear operator. Then is sub-gaussian with respect to (denote as ) if there exists an such that for all ,

 E[e⟨z,Z−E[Z]⟩]⩽eα2⟨Γz,z⟩/2, (2)

where the expectation is defined as a Bochner integral (cf. Chapter 2.6 in [12]). Moreover, if with mean , then the (or sub-gaussian) norm of with respect to is defined as

 ∥Z∥ψ2,Γ=inf{α⩾0:E[e⟨z,Z−μ⟩]⩽eα2⟨Γz,z⟩/2∀z∈H}.

Note that Definition 2.2 corresponds to the -sub-gaussianity in [2], and it is an infinite-dimensional analog of the sub-gaussian random vectors in (see for example [28] and [13]). Unsurprisingly, the Gaussian random variables in is a special case of sub-gaussian random variables in .

Definition 2.3 (Hilbert space valued Gaussian random variable).

A random variable in is Gaussian with respect to and with mean (denote as ) if for all ,

 E[e⟨z,Z−μ⟩]=e⟨Γz,z⟩/2. (3)
Lemma 2.4.

If , then and , where is the covariance operator of . More generally, if with mean , then , i.e., is positive semidefinite.

Notation. We shall use to denote positive and finite universal constants, whose values may vary from place to place. For , denote and . For , the operator norm of

is defined as the square root of the largest eigenvalue of

. If , then is a Hilbert-Schmidt (HS) operator and . For a matrix , .

2.2. Hanson-Wright inequality in Hilbert spaces

Throughout Section 2.2, we assume that is a real separable Hilbert space and is a positive definite trace class operator on . First, we present a Hanson-Wright inequality with zero diagonal in Proposition 2.5.

Proposition 2.5 (Hanson-Wright inequality for quadratic forms of sub-gaussian random variables in Hilbert spaces: zero diagonal).

Let , be a sequence of independent centered random variables in and . Let be an matrix and . Then there exists a universal constant such that for any ,

 P(S⩾t)⩽exp⎡⎣−Cmin⎛⎝t2L4∥Γ∥2HS∥A∥2HS,tL2∥Γ∥op∥A∥op⎞⎠⎤⎦, (4)

where .

Theorem 2.5 is a dimension-free version of the Hanson-Wright inequality with a zero diagonal weighting matrix for independent sub-gaussian random variables in [22]. Specifically, Theorem 1.1 (i.e., Theorem 1.1 in [22]) is a special case of Theorem 2.5 with and . In this case, we may take and thus . Different from Theorem 1.1, Theorem 2.5 is also able to capture the dependency encoded in for general Hilbert spaces, thus covering certain quadratic forms in a finite-dimensional sub-gaussian random vector with dependent components. Our next result is an upper tail inequality (i.e., one-sided Hanson-Wright inequality) with non-negative diagonal weights in Theorem 2.6 below.

Theorem 2.6 (Upper tail inequality for quadratic forms of sub-gaussian random variables in Hilbert spaces: non-negative diagonal).

Let , be a sequence of independent centered random variables in and . Let be an matrix such that , and . Then there exists a universal constant such that for any ,

 P(Q⩾n∑i=1aiiL2i∥Γ∥tr+t)⩽2exp⎡⎣−Cmin⎛⎝t2L4∥Γ∥2HS∥A∥2HS,tL2∥Γ∥op∥A∥op⎞⎠⎤⎦, (5)

where .

Both Theorem 2.5 and 2.6 allow , to have different covariance operators , provided that (cf. Lemma 2.4).

Remark 2.7 (Connections to the existing upper tail inequality in finite-dimensional Euclidean spaces).

For non-negative diagonal weights, Theorem 2.6 is an infinite-dimensional (and thus dimension-free) generalization of the tail inequality for quadratic forms a sub-gaussian random vector with dependent components in [13]. In particular, if is a centered sub-gaussian random vector in (i.e., there exists a such that for all ), then Theorem 2.1 in [13] states that: for any positive semidefinite matrix and ,

 P(XTΓX⩾σ2(∥Γ∥tr+2∥Γ∥HS√t+2∥Γ∥opt))≤e−t.

The last inequality is a special case (up to a universal constant) of (5) with , , , , and . In addition, we note that the positive semidefinite condition is not needed in our Theorem 2.6. Instead, only a weaker condition on the non-negativity of the diagonal entries in the weighting matrix is required.

There are two limitations of Theorem 2.6. First, is typically not centered at . For the generalized -means application in Section 3, this means that consistency of solutions of the SDP relaxation (13) cannot be attained unless tends to . Second, the non-negativity condition on the diagonal weights in Theorem 2.6 is not entirely innocuous for obtaining a concentration inequality for (i.e., two-sided Hanson-Wright inequality). Without imposing additional assumptions, we cannot expect a lower tail bound for sub-gaussian random variables even in [1]. To simultaneously fix these two issues and obtain a concentration inequality for , we make the following Bernstein-type condition on the squared norm, in addition to the assumption that are independent with mean zero.

Assumption 2.8 (Bernstein condition on the squared norm).

There exists a universal constant such that

 E∣∣∥Xi∥2−E∥Xi∥2∣∣k⩽Ck!Lk−2i∥Γ∥k−2op∥Σi∥2HS∀k=3,4,…, (6)

where is the covariance operator of .

Remark 2.9 (Comments on Assumption 2.8).

Since , Assumption 2.8 is a mild condition on the sub-exponential tail behavior of . For , (6) is an automatic consequence of the sub-gaussianality (2). For , if , where has independent components with bounded sub-gaussian norms, then

 E[∥X∥2−E∥X∥2]2=E[ZTΣZ−tr(Σ)]2≲∥Σ∥2HS.

Such linear transformation of an independent random vector in

with sub-gaussian components is a popular statistical model for the -means clustering [10, 21]. For the general Hilbert space , it is easy to verify that Gaussian random variable in satisfies (6). Comparing with the “centering” term in (5), we shall see that the correct centering terms in (6) together with the parameters are crucial to yield a concentration inequality for . By Lemma 2.4, we know that for any . In fact, even in , it is easy to construct a random variable such that where (cf. Example 4.1 and 4.2 in [6]). In particular, here we give a counterexample in (so that ). Let

follow a mixture of Gaussian distributions

, where and . Then we have and , where for some sufficiently large constant . Thus if as , then and

 E|Y2n−EY2n|k≲a2k−4nE|Z|2k=a2k−4n(2k−1)!!≤4k!(2a2n)k−2≲k!(γ2n)k−2(σ2n)2,

where . Hence is a sub-gaussian random variable satisfying Assumption 2.8 and , provided that as .

Now we are ready to state the Hanson-Wright inequality for the general case.

Theorem 2.10 (Hanson-Wright inequality for quadratic forms of sub-gaussian random variables in Hilbert spaces: general version).

Let , be a sequence of independent centered random variables in and . Let be an matrix and . If in addition Assumption 2.8 holds, then there exists a universal constant such that for any ,

 P(|Q−E[Q]|⩾t)⩽2exp⎡⎣−Cmin⎛⎝t2L4∥Γ∥2HS∥A∥2HS,tL2∥Γ∥op∥A∥op⎞⎠⎤⎦, (7)

where .

[29] and [1] derive Hanson-Wright inequalities under the convex concentration property of a finite-dimensional random vector, which is difficult to verify in general. In contrast, our Theorem 2.10 holds under more transparent conditions (i.e., the sub-gaussian and Bernstein-type assumptions). Note that Theorem 2.10 can be seen as a unified generalization of the finite-dimensional Hanson-Wright inequality to Hilbert spaces for both independent sub-gaussian random variables in [22] and a sub-gaussian random vector with dependent components in [13].

2.3. Proof of the main results in Section 2.2

In this section, we prove Proposition 2.5, Theorem 2.6, and 2.10. Before proceeding to the rigorous proof, we would like to mention that, although our general proof strategy is based on that of Theorem 1.1 in [22], two innovative ingredients are needed to accommodate the Hilbert space structures.

First, we diagonalize the operator (together with the decoupling) in order to perform the calculations in an isometric space of , where linear operators can be conveniently represented by (infinite-dimensional) matrices. This turns out to be the crux to obtain the trade-off between and in the tail probability bound for the off-diagonal sum in .

Second, we derive a sharp upper tail probability bound for the non-negatively weighted diagonal sum of squared norm of independent sub-gaussian random variables in (cf. Lemma 4.2). If we simply apply Bernstein’s inequality (cf. Theorem 2.8.1 in [28]) for the real-valued sub-exponential random variables (cf. Lemma 4.4), then the diagonal sum in has the following probability bound: for any ,

 P(∣∣ ∣∣n∑i=1aii(∥Xi∥2−E∥Xi∥2)∣∣ ∣∣⩾t)⩽2exp[−Cmin(t2L4∥Γ∥2tr∑ni=1a2ii,tL2∥Γ∥trmax1⩽i⩽n|aii|)]. (8)

Note that the right-hand side of (8) is controlled by one parameter , which is strictly less sharper than (4) since and . For instance, if , then is often the covariance matrix of . In the special case for , then , , and . Therefore, direct application of the diagonal sum bound (8) does not yield the probability bound in Theorem 2.5. In particular, for the generalized

-means clustering problem, this implies that a much more restrictive lower bound condition on the signal-to-noise ratio is required for exact recovery of the true clustering structure for high-dimensional data (more details can be found in the discussion after Theorem

3.3).

Proof of Proposition 2.5.

By Markov’s inequality, we have for any and ,

 P(S⩾t)⩽e−λtE[eλS].

Step 1: decoupling. Let be i.i.d. Rademacher random variables (i.e., ) that are independent of . Since

 E[δi(1−δi)]={0if i=j1/4if i≠j,

we have , where and is the expectation taken with respect to the random variables . Below, is similarly defined. By Jensen’s inequality, we get

 E[eλS]⩽EX,δ[e4λSδ].

Let . Then we can write

 Sδ=∑i∈Λδ∑j∈Λcδaij⟨Xi,Xj⟩=∑j∈Λcδ⟨∑i∈ΛδaijXi,Xj⟩.

Taking the expectation with respect to (i.e., conditioning on and ), it follows from the assumption are independent with mean zero that

 E(Xj)j∈Λcδ[e4λSδ]⩽ e8λ2σ2δ,

where . Thus we get

 EX[e4λSδ]⩽EX[e8λ2σ2δ].

Step 2: reduction to Gaussian random variables. For , let be independent random variables in that are independent of and . Define

 T:=∑j∈Λcδ⟨gj,∑i∈ΛδaijXi⟩.

Then, by the definition of Gaussian random variables in , we have

 Eg[eλT]= ∏j∈ΛcδEg[e⟨gj,λ∑i∈ΛδaijXi⟩] = exp⎛⎜⎝8λ2∑j∈ΛcδL2j⟨Γ(∑i∈ΛδaijXi),(∑i∈ΛδaijXi)⟩⎞⎟⎠=exp(8λ2σ2δ).

So it follows that

 EX[e4λSδ]⩽EX,g[eλT].

Since , we have

 E(Xi)i∈Λδ[eλT]⩽exp⎛⎜⎝λ22∑i∈ΛδL2i⟨Γ(∑j∈Λcδaijgj),(∑j∈Λcδaijgj)⟩⎞⎟⎠,

which implies that

 EX[e4λSδ]⩽Eg[exp(λ2τ2δ/2)], (9)

where .

Step 3: diagonalization. Since is trace class (thus compact) and positive definite, it follows from Theorem 4.2.4 in [12] that the eigendecomposition of is given by

 Γ=∞∑k=1γk(ek⊗ek),

where are eigenvalues of and

are eigenfunctions forming a CONS of

; namely for every . Here,

denotes the tensor product and

denotes the closure of the image of . In addition, there exists a unique positive definite square root operator such that (cf. Theorem 3.4.3 in [12]). Then we have and

 τ2δ= ∑i∈ΛδL2i⟨Γ1/2(∑j∈Λcδaijgj),Γ1/2(∑j∈Λcδaijgj)⟩=∑i∈ΛδL2i∥Γ1/2(∑j∈Λcδaijgj)∥2 =∑i∈ΛδL2i∥∑j∈ΛcδaijΓ1/2gj∥2=∑i∈ΛδL2i∥∞∑k=1γ1/2k(∑j∈Λcδaij⟨gj,ek⟩)ek∥2 =∞∑k=1γk∑i∈Λδ⎛⎜⎝∑j∈ΛcδLiaij⟨gj,ek⟩⎞⎟⎠2,

where the last step follows from Parseval’s identity. Note that

 ∥Γ1/2ek∥2=⟨Γek,ek⟩=⟨γkek,ek⟩=γk.

Thus for any ,

 Eeλ⟨gj,ek⟩=e8L2jλ2⟨Γek,ek⟩=e8L2jλ2∥Γ1/2ek∥2=e8L2jλ2γk,

which implies that , are independent random variables. Now let , where for . Then , where with and . Note that

 E[GjkGjm]= E[⟨⟨gj,ek⟩gj,ek⟩]=⟨(E⟨gj⊗gj⟩)ek,em⟩ = 16L2j⟨Γek,em⟩=16L2j⟨γkek,em⟩=16L2jγk1(k=m).

Thus is an matrix of all zeros if , and .

Step 4: bound the eigenvalues. Let be the restriction matrix such that if and otherwise. Let further and , where with and are i.i.d. standard Gaussian random variables in . By the rotational invariance of Gaussian distributions, we have

 τ2δ=∥Rδf∥2d=∥∥Rδ˜Γ1/2Z∥∥2=ZT˜Γ1/2RTδRδ˜Γ1/2Zd=∞∑k=1s2kZ2k,

where are the eigenvalues of . So it follows that

 maxks2k⩽∥Rδ∥2op∥˜Γ∥op⩽∥˜A∥2op∥˜Γ∥op⩽L2∥A∥2op∥˜Γ∥op% ,

where

 ∥˜Γ∥op⩽16(max1⩽j⩽n∥Xj∥2ψ2)(maxkγ2k)⩽16L2∥Γ∥2op.

 ∑ks2k= tr(˜Γ1/2RTδRδ˜Γ1/2)=tr(Rδ˜ΓRTδ)=∞∑k=1tr([Pδ˜A(In−Pδ)]˜Γkk[Pδ˜A(In−Pδ)]T) ⩽ ∞∑k=116L2γ2k∥Pδ˜A(In−Pδ)∥2HS⩽∞∑k=116L2γ2k∥˜A∥2HS⩽16L4∥Γ∥2HS∥A∥2HS.

Invoking (9), we get

 EX[e4λSδ]⩽∞∏k=1EZ[exp(λ2s2kZ2k/2)].

Since are i.i.d.

random variables with the moment generating function

for , we have

 EX[e4λSδ]⩽∞∏k=11√1−λ2s2k,if maxkλ2s2k<1.

Using for , we get that if , then

 EX[e4λSδ]⩽exp(λ2∞∑k=1s2k)⩽exp(16λ2L4∥Γ∥2HS∥A∥2HS).

Note that the last inequality is uniform in . Taking expectation with respect to , we obtain that

 EX[eλS]⩽EX,δ[e4λSδ]⩽exp(16λ2