Minimax Supervised Clustering in the Anisotropic Gaussian Mixture Model: A new take on Robust Interpolation

We study the supervised clustering problem under the two-component anisotropic Gaussian mixture model in high dimensions and in the non-asymptotic setting. We first derive a lower and a matching upper bound for the minimax risk of clustering in this framework. We also show that in the high-dimensional regime, the linear discriminant analysis (LDA) classifier turns out to be sub-optimal in the minimax sense. Next, we characterize precisely the risk of ℓ_2-regularized supervised least squares classifiers. We deduce the fact that the interpolating solution may outperform the regularized classifier, under mild assumptions on the covariance structure of the noise. Our analysis also shows that interpolation can be robust to corruption in the covariance of the noise when the signal is aligned with the "clean" part of the covariance, for the properly defined notion of alignment. To the best of our knowledge, this peculiar phenomenon has not yet been investigated in the rapidly growing literature related to interpolation. We conclude that interpolation is not only benign but can also be optimal, and in some cases robust.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/19/2018

Sharp optimal recovery in the Two Component Gaussian Mixture Model

In this paper, we study the problem of clustering in the Two component G...
12/19/2018

Sharp optimal recovery in the Two Gaussian Mixture Model

In this paper, we study the non-asymptotic problem of exact recovery in ...
11/01/2017

A Large Dimensional Analysis of Regularized Discriminant Analysis Classifiers

This article carries out a large dimensional analysis of standard regula...
06/29/2020

Sharp Statistical Guarantees for Adversarially Robust Gaussian Classification

Adversarial robustness has become a fundamental requirement in modern ma...
03/28/2016

Hierarchical Gaussian Mixture Model with Objects Attached to Terminal and Non-terminal Dendrogram Nodes

A hierarchical clustering algorithm based on Gaussian mixture model is p...
08/21/2018

Curse of Heterogeneity: Computational Barriers in Sparse Mixture Models and Phase Retrieval

We study the fundamental tradeoffs between statistical accuracy and comp...
12/16/2021

High-dimensional logistic entropy clustering

Minimization of the (regularized) entropy of classification probabilitie...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The topic of overparametrization has gained tremendous attention in recent literature devoted to the problems in high dimensional statistics. Previously, it was widely believed that regularization yields the best generalization power. Recently, it was discovered that estimators that interpolate the training data also yield good generalization error bounds when the number of covariates exceeds the sample size. This phenomenon, termed “benign overfitting” in

[2], has been extensively investigated in the regression setting. In this work, we study the problem of clustering. In particular, we derive the bounds for the generalization error under different sets of assumptions. The model we consider is a binary sub-Gaussian mixture model with unknown, anisotropic noise.

1.1 Statement of the problem

Consider the simple two-component Gaussian mixture model, where we observe for all the pairs such that

(1)

where

is a center vector,

a label vector and

a random matrix with i.i.d columns

that are sub-Gaussian. More precisely, given the spectral decomposition , we assume that

where has components that are independent and -sub-Gaussian; that is, for all ,

where denotes the Euclidean norm. Moreover, we will always assume that has full rank. We mostly focus on the supervised setting, where we are given a classifier based on the training data and a new independent observation such that

is a Rademacher random variable taking values

with probability

each. We want to analyze the generalization error defined as

both in probability and expectation, where is the probability under the model described above. Observe that

When there is no ambiguity, we will omit the subscript from . In particular, we want to analyze the minimax risk

and study the necessary and sufficient conditions on for consistent clustering, i.e. conditions on such that

The case when was investigated in [12]. In particular, it was shown that

When is known and the noise is normal, is it easy to see that follows the isotropic Gaussian Mixture Model where the signal vector is given by . Following the reasoning similar to [12], we can show in the general case that

where . In particular, the condition is necessary and sufficient for consistency under the norm . Moreover, the minimax optimal classifier is the Linear Discriminant Analysis (LDA) classifier given by

The LDA estimator and its adaptive variants have been recently studied in several works, including [15, 8, 7] and [5]. All these papers consider anisotropic mixtures in the low dimensional case. In this project we are interested in the case of unknown in high dimensions (i.e. ) where estimation of both and becomes challenging.

We are also interested in the supervised risk of certain classifiers of the form

where describes a separating half-space. More precisely, we will focus on the minimum

-norm solutions of the optimization problems corresponding either to the Ordinary Least Squares (OLS) or the Support Vector Machines (SVM). The hard margin SVM classifier

is the solution to the problem

(2)

Similarly, is defined as the solution to

(3)

While SVM is more commonly used for classification, both approaches have attracted a lot of attention recently, in part due to the discovery [9] of the phenomenon known as “Proliferation of Support Vectors” (SVP). Specifically, SVP corresponds to the situation when , and occurs typically in the high-dimensional setting.

1.2 Notation and Definitions

For , we denote its Euclidean norm by and its Mahalanobis norm corresponding to a positive definite matrix by . For a symmetric matrix

with eigenvalues

, we define the spectral norm of via and the Frobenius norm by . For a positive semidefinite (PSD) matrix , we define its effective rank by and its -effective rank by . Moreover, we define the orthogonal projectors , where is the non-increasing sequence of eigenvalues of . For a given sequences and , we say that (resp. ) if for some , (resp. ) for all integers large enough. We will also write if as goes to .

1.3 Related work

Recent papers, for instance [3] and [4], suggest that interpolating solutions (specifically, solutions that fit the training data perfectly) can achieve optimal rates for the problems of nonparametric regression and -nearest neighbour clustering respectively. Termed “benign overfitting” by [2]

, this phenomenon has been studied analytically in the framework of linear regression with isotropic noise. It was extended to the anisotropic case later in

[16], where the authors proposed the fruitful idea of “alignment” and “misalignment” between the signal and the covariance matrix of the noise.

There is a number of recent works exploring the subject of overfitting in regression, while our focus is on the derivation of tight bounds for the misclassification rate for clustering in the framework of anisotopic sub-Gaussian mixture models. In moderate dimensions (), this problem has been studied extensively, as in the most recent works [15, 8] and [7], and efficient approaches such as perturbed gradient descent and modifications of Lloyd’s algorithm have been proposed. When overparametrization is possible (corresponding to the case ), support vector proliferation [11] has emerged as an important aspect of the analysis of interpolating SVM classifiers. In particular, papers including [1, 14] establish sufficient conditions for consistency of SVM interpolation. On the other hand, the work [10] approaches the topic in a more general setting via analyzing properties of Reproducing Kernel Hilbert Spaces (RKHS). Recent papers [6] and [14] are closely aligned with our work, but only consider the case when and no regularization is present. We summarize their key findings below.

Wang and Thrampoulidis [14] Cao et al.[6]
SVP conditions
and   and  
Error bounds

1.4 Contribution

Our main contributions are summarized below and compared with the previous state of the art.

  • First, we derive the minimax bounds for the generalization risk of clustering in the anisotropic sub-Gaussian model, and show that the averaging classifier, defined in Section 2, is adaptive and minimax optimal.

  • We study the risk of the regularized least squares classifiers and show that, under mild assumptions, the interpolating solution is minimax optimal, leading to a better bound than the previous works. We also expose interesting cases where the interpolating solution may outperform regularized classifiers, under mild assumptions on the covariance structure of the noise.

  • Next, we show that the SVP phenomenon occurs under the mild conditions: and . As a consequence, we derive risk bounds for the hard-margin SVM classifier. The aforementioned conditions are strictly better than previously known ones, as the latter require that .

  • Finally, we propose the framework where the covariance of the noise can be corrupted, and show that interpolation leads to robust minimax optimal classifiers over a large class of signal vectors in this case, while both the averaging and the LDA estimators fail. Hence not only interpolation is benign, but it can also be optimal and robust.

2 Minimax clustering: the supervised case

In this section we consider the supervised clustering problem with Gaussian noise with unknown covariance . Our upper bounds are still valid for the sub-Gaussian noise, but we only consider the Gaussian case and focus on other important aspects of the problem instead. We first state a lower bound.

Theorem 2.1.

Let . Then

for some absolute constants and where the infimum is taken over all measurable classifiers.

Notice that the inequality is stated in terms of the Euclidean norm of , and not the Mahalanobis norm that would lead to a different lower bound. This particular choice is motivated by the fact that we are seeking bounds that hold adaptively for large classes of which is typically unknown in practical applications. The proof of the lower bound is inspired by the argument in [12] that only holds for isotropic noise. As for the upper bound, we show that the “averaging” linear classifier defined as

is minimax optimal. In fact, we will prove the following inequality.

Theorem 2.2.

Let . For any covariance matrix we have, with probability at least , that

for some absolute constants . Moreover,

Theorem 2.2 provides a matching upper bound to Theorem 2.1. In particular, it implies that for consistency under the Euclidean norm, it suffices to assume that . Among other conclusions, it yields that from a minimax perspective, the averaging classifier outperforms the LDA one. This fact is stated explicitly next.

Proposition 2.1.

Let . Then for any

for some .

Observe that for the worst case scenario, the vector is chosen so that , whence is sub-optimal. This phenomenon is only possible in high dimensions. It is in fact easy to see that when , LDA outperforms the averaging classifier for any given vector whenever consistency is possible. Indeed, in this case

Moreover it is always true that

Hence, whenever consistency is possible and when , we have

Notice that the last statement is stronger than a minimax comparison.

3 Interpolation vs Regularization for sub-Gaussian mixtures

In this section, we study the risk of the regularized OLS estimators. While it is more common to study SVM for classification, recent works [1, 9] have shown that in high dimensions (specifically, ), SVM and OLS solutions coincide under mild conditions. This phenomenon is known in the literature as “proliferation of support vectors”. One of its implications is that in high dimensions, it is sufficient to study the properties of the least squares estimator and then demonstrate that it coincides with the hard-margin SVM. For the rest of this section, our goal is to study the risk of the family of supervised estimators defined as solutions to the problem

Observe that the case and leads to interpolation, specifically to the minimum -norm interpolating solution. For each , the corresponding estimator is proportional to

Each estimator leads to a linear classifier defined via

In what follows, we will derive upper bounds for the risk . We will also provide sufficient conditions for the matrix ensuring that the interpolating classifier (corresponding to ) achieves at least the same performance as the oracle (or the averaging classifier that is minimax optimal).

Notice that as goes to , we recover the averaging classifier. We emphasize here that if we replace the regularization term by (not adaptive to ), then our claims may no longer be true.

As a side note, we suspect that the excess risk of estimating itself gets smaller for large values of although this is not necessarily the case for the classification risk. This suggests that the excess risk of estimation may not be the right metric to evaluate classification performance.

In order to state our main result, we will need the following quantity. For a given covariance matrix , we define the function via

for some absolute constant (that can be chosen to be large enough) and if the above set if empty. The reader may observe that is decreasing with and that if . In what follows, we will require to be smaller than . For , this means that we are not allowing more than a fraction of all eigenvalues to be much larger than the remainder of the spectrum. Such conditions encompass many covariance matrices of interest, and in particular cover the case where can be viewed a low rank perturbation/corruption component. In what follows, will stand for .

Theorem 3.1.

Let . Assume that and that . Then for some constants we have with probability that

where .

The condition connecting with can be understood as follows. One may think of the eigenvectors corresponding to largest eigenvalues of

as “outliers”, or directions affecting dramatically the rest of the spectrum. Our condition prohibits

from concentrating too much of its mass in the subspace spanned by the latter eigenvectors. For instance, if is a random vector that is well spread out on the sphere (spherically distributed) then . The same remark holds also if we fix and think of the range of as being a random subspace making a random vector. Hence the latter condition means simply that the vector is only allowed to be aligned with the “clean” part of the covariance .

When (or equivalently ), we recover the bound obtained in [6] by taking . Notice that in this case the alignment condition is always satisfied as . Our result is stronger since we show that the bound holds with probability while in [6] authors only prove that the same bound holds with probability . Moreover, under the mild condition and by taking , we show that

with probability at least . Therefore, taking the limit as leads to the same bound as the averaging oracle in this case.

In the case of moderate values of , it is worth noticing that is a decreasing function of . As a consequence, is an increasing function of such that

The latter quantity could be seen as a truncated trace of , where truncation is applied to the largest eigenvalues of . Unlike the previous works [6, 14], our result is more general since we allow to be non-zero. The bound in Theorem 3.1 in particular gets smaller as goes to , which suggests that interpolation may outperform regularization in some cases, especially in the case where a finite number of eigenvalues are much larger than the rest of the spectrum.

In the case where can be large we get the following result, without any further assumptions on . Let us define

Proposition 3.1.

Let . Assume that . Then for some absolute constants

This result suggests that under a mild condition on the covariance of the noise, not only interpolation is benign but it is also minimax optimal on the set . This also means that interpolation is better for classification than for regression since it does not suffer from a bias term which often leads to bad worst-case performance ([2]).

Recall that all our results hold under the condition . We may wonder here what happens if is much larger than , and numerical experiments suggest that interpolation indeed behaves poorly in this case.

4 Proliferation of support vectors in high dimensions under the sub-Gaussian mixture model

In this section, we provide sufficient conditions for proliferation of support vectors. Based on the results in [9], and , as defined in (2)-(3), coincide if and only if

where is the Euclidean canonical basis. In the remainder of the section we denote . The main result is stated next.

Theorem 4.1.

Assume that , and for some absolute constant . Then with probability at least ,

As a consequence, attains the same performance as under the conditions of Theorem 4.1.

When (i.e. ), the sufficient conditions read as

  • ;

  • .

The first condition (that is signal-dependent) is also required in prior works [6, 14]. As for the “dimension-dependent” second condition, it is much milder than the one proposed in both previous papers on the topic. To compare these results, consider the case , where our condition reads as while the earlier analogues require . Our result also suggests that suffices for proliferation under the sub-Gaussian mixture model, which confirms the general conjecture stated in [9].

5 Application to robust supervised clustering

In this section, we present the framework for robust clustering in the sub-Gaussian mixture model. In the rest of this section we consider the case of Gaussian noise and identity covariance . We will assume that the training set has a different covariance than the covariance of the new observation to be classified, due to the action of a malicious adversary. More precisely, given the vector of observations , the adversary can corrupt the sample as follows: she chooses up to eigenvalues of the covariance matrix and positive scalars , and then adds i.i.d. random noise to such that the new observations become

where are i.i.d. standard normal vectors that are also independent from , and where is the canonical Euclidean basis of and is the set of indices corresponding to the corrupted eigenvalues. Observe that the covariance of the noise is now given by . In what follows, denotes the projection . Our goal is to show that under minimal assumptions, the interpolating estimator is still minimax optimal while both the averaging estimator and the LDA one fail to perform well.

Theorem 5.1.

Assume that and that . Then

One implication of this theorem is the fact that is minimax optimal on the set and robust with respect to the corruption , for moderate values of . When and the direction of is not too closely aligned with the eigenvectors corresponding to the corrupted part of the spectrum, then mimicks the performance of the averaging estimator in the absence of outliers. In order to compare the bound stated above to estimates available for other estimators, we rely on the next proposition.

Proposition 5.1.

Assume that the noise is Gaussian and that satisfies where . Then for any such that ,

for some absolute constant .

In summary, when , there exists a regime (where ) such that interpolation is minimax optimal over a large class of vectors , under the contamination model we presented, while both and can fail with constant non-zero probability.

6 Numerical experiments

The goal of this section is to compare the performance of several estimators for different values of . The case corresponds to interpolation, while recovers the averaging classifier . In all our simulations we will only consider Gaussian noise and spherically distributed.

Figure 1: Comparison of the generalization error of four classifiers. The left plot corresponds to the case of large effective rank while the right one corresponds to medium effective rank.

Our simulation setup is defined as follows. We choose and to allow for overparametrization. We compare 4 estimators: the interpolating classifier , classifiers for and , as well as the averaging estimator . For each value of , simulation was repeated times; finally, we plot the empirical generalization error.

For our first experiment (Figure 1), we compare performance of the four classifiers in two cases:

  • The case of large effective rank where we choose to be a diagonal matrix with for all . This case corresponds to ;

  • The case of medium effective rank where we choose to be a diagonal matrix with and . This case corresponds to .

Consistent with the theoretical predictions, our simulations suggest that all classifiers have similar performance in the regime of large effective rank. Interestingly, interpolation seems to perform best when the effective rank is smaller than . This confirms our observation that interpolation can be superior to regularization in some cases.

As for our second experiment (Figure 2), we choose and we corrupt the training sample, as explained in Section 5, by setting randomly diagonal entries of the covariance to . Remember that this modification only impacts the training sample while the test sample has isotropic noise.

Figure 2: Comparison of the performance of four classifiers under corruption.

In this case, the averaging classifier fails to predict new labels correctly while the interpolating classifier is still able to classify well despite the corruption. Observe also that regularized classifiers (corresponding to the small regularization parameter) perform similar to the interpolating classifier. This occurs simply due to the fact that for all values of much smaller than the magnitude of the corruption, where is the corrupted covariance matrix. We conclude by reaffirming the claim that interpolation is not only harmless in high dimensions, but can outperform regularization and in some cases exhibits an unexpected feature of robustness.

References

Appendix A Proofs of minimax clustering

a.1 Proof of Theorem 2.1

Fix and let be a diagonal PSD matrix such that and . Then

We can simply focus on showing the result for the above diagonal matrix . We will use the fact that

for any prior on such that . The quantity stands for the expectation corresponding to following model (1). The proof is decomposed in two steps.

  • A dimension-independent lower bound: Let be a fixed vector in such that . We place independent Rademacher priors on each for . It follows that

    (4)

    where as we average over . The last inequality holds in view of Jensen’s inequality and the independence between and . We define, for , the density of the observation conditionally on the value of . Now, using Neyman-Pearson lemma and the explicit form of , we get that the selector given by

    is optimal as it achieves the minimum of the RHS of (4).

    To show that, we remind the reader that the distribution of , conditionally on , is given by . Hence

    It follows that

    By Neyman-Pearson lemma, we can now conclude that

    Plugging this value in (4), we get further that

    It is straightforward to see that

    for some where is the standard normal cumulative function. The last inequality holds for any as long as . The worst case is reached for being co-linear with the top eigenvector of since . Hence we get the lower bound

    In order to conclude we only need to derive the other lower bound

    For the rest of the proof we only focus on the case where , otherwise the dimension independent lower bound dominates.

  • A dimension-dependent lower bound:

    Since is diagonal, we write where . According to Theorem in [12] we have

    for any prior on . The second term in the above lower bound accounts for the constraint on . In what follows, we fix and choose to be a product prior on (,) such that is a Rademacher random variable and is an independent random vector such that , where is a diagonal matrix such that . Using the Hanson-Wright inequality [13], we deduce that

    for some . Since , we only need to show that, for large enough, we have

    for some . We define, for , to be the density of the observation given . Using Neyman-Pearson lemma, we get that

    minimizes over all functions of with values in . Using the independence of the rows of , we have

    where is the -th row of the matrix and (here is the binary vector such that for all