The topic of overparametrization has gained tremendous attention in recent literature devoted to the problems in high dimensional statistics. Previously, it was widely believed that regularization yields the best generalization power. Recently, it was discovered that estimators that interpolate the training data also yield good generalization error bounds when the number of covariates exceeds the sample size. This phenomenon, termed “benign overfitting” in, has been extensively investigated in the regression setting. In this work, we study the problem of clustering. In particular, we derive the bounds for the generalization error under different sets of assumptions. The model we consider is a binary sub-Gaussian mixture model with unknown, anisotropic noise.
1.1 Statement of the problem
Consider the simple two-component Gaussian mixture model, where we observe for all the pairs such that
is a center vector,a label vector and
a random matrix with i.i.d columnsthat are sub-Gaussian. More precisely, given the spectral decomposition , we assume that
where has components that are independent and -sub-Gaussian; that is, for all ,
where denotes the Euclidean norm. Moreover, we will always assume that has full rank. We mostly focus on the supervised setting, where we are given a classifier based on the training data and a new independent observation such that
is a Rademacher random variable taking values
with probabilityeach. We want to analyze the generalization error defined as
both in probability and expectation, where is the probability under the model described above. Observe that
When there is no ambiguity, we will omit the subscript from . In particular, we want to analyze the minimax risk
and study the necessary and sufficient conditions on for consistent clustering, i.e. conditions on such that
The case when was investigated in . In particular, it was shown that
When is known and the noise is normal, is it easy to see that follows the isotropic Gaussian Mixture Model where the signal vector is given by . Following the reasoning similar to , we can show in the general case that
where . In particular, the condition is necessary and sufficient for consistency under the norm . Moreover, the minimax optimal classifier is the Linear Discriminant Analysis (LDA) classifier given by
The LDA estimator and its adaptive variants have been recently studied in several works, including [15, 8, 7] and . All these papers consider anisotropic mixtures in the low dimensional case. In this project we are interested in the case of unknown in high dimensions (i.e. ) where estimation of both and becomes challenging.
We are also interested in the supervised risk of certain classifiers of the form
where describes a separating half-space. More precisely, we will focus on the minimumis the solution to the problem
Similarly, is defined as the solution to
While SVM is more commonly used for classification, both approaches have attracted a lot of attention recently, in part due to the discovery  of the phenomenon known as “Proliferation of Support Vectors” (SVP). Specifically, SVP corresponds to the situation when , and occurs typically in the high-dimensional setting.
1.2 Notation and Definitions
For , we denote its Euclidean norm by and its Mahalanobis norm corresponding to a positive definite matrix by . For a symmetric matrix
with eigenvalues, we define the spectral norm of via and the Frobenius norm by . For a positive semidefinite (PSD) matrix , we define its effective rank by and its -effective rank by . Moreover, we define the orthogonal projectors , where is the non-increasing sequence of eigenvalues of . For a given sequences and , we say that (resp. ) if for some , (resp. ) for all integers large enough. We will also write if as goes to .
1.3 Related work
Recent papers, for instance  and , suggest that interpolating solutions (specifically, solutions that fit the training data perfectly) can achieve optimal rates for the problems of nonparametric regression and -nearest neighbour clustering respectively. Termed “benign overfitting” by 
, this phenomenon has been studied analytically in the framework of linear regression with isotropic noise. It was extended to the anisotropic case later in, where the authors proposed the fruitful idea of “alignment” and “misalignment” between the signal and the covariance matrix of the noise.
There is a number of recent works exploring the subject of overfitting in regression, while our focus is on the derivation of tight bounds for the misclassification rate for clustering in the framework of anisotopic sub-Gaussian mixture models. In moderate dimensions (), this problem has been studied extensively, as in the most recent works [15, 8] and , and efficient approaches such as perturbed gradient descent and modifications of Lloyd’s algorithm have been proposed. When overparametrization is possible (corresponding to the case ), support vector proliferation  has emerged as an important aspect of the analysis of interpolating SVM classifiers. In particular, papers including [1, 14] establish sufficient conditions for consistency of SVM interpolation. On the other hand, the work  approaches the topic in a more general setting via analyzing properties of Reproducing Kernel Hilbert Spaces (RKHS). Recent papers  and  are closely aligned with our work, but only consider the case when and no regularization is present. We summarize their key findings below.
Our main contributions are summarized below and compared with the previous state of the art.
First, we derive the minimax bounds for the generalization risk of clustering in the anisotropic sub-Gaussian model, and show that the averaging classifier, defined in Section 2, is adaptive and minimax optimal.
We study the risk of the regularized least squares classifiers and show that, under mild assumptions, the interpolating solution is minimax optimal, leading to a better bound than the previous works. We also expose interesting cases where the interpolating solution may outperform regularized classifiers, under mild assumptions on the covariance structure of the noise.
Next, we show that the SVP phenomenon occurs under the mild conditions: and . As a consequence, we derive risk bounds for the hard-margin SVM classifier. The aforementioned conditions are strictly better than previously known ones, as the latter require that .
Finally, we propose the framework where the covariance of the noise can be corrupted, and show that interpolation leads to robust minimax optimal classifiers over a large class of signal vectors in this case, while both the averaging and the LDA estimators fail. Hence not only interpolation is benign, but it can also be optimal and robust.
2 Minimax clustering: the supervised case
In this section we consider the supervised clustering problem with Gaussian noise with unknown covariance . Our upper bounds are still valid for the sub-Gaussian noise, but we only consider the Gaussian case and focus on other important aspects of the problem instead. We first state a lower bound.
Let . Then
for some absolute constants and where the infimum is taken over all measurable classifiers.
Notice that the inequality is stated in terms of the Euclidean norm of , and not the Mahalanobis norm that would lead to a different lower bound. This particular choice is motivated by the fact that we are seeking bounds that hold adaptively for large classes of which is typically unknown in practical applications. The proof of the lower bound is inspired by the argument in  that only holds for isotropic noise. As for the upper bound, we show that the “averaging” linear classifier defined as
is minimax optimal. In fact, we will prove the following inequality.
Let . For any covariance matrix we have, with probability at least , that
for some absolute constants . Moreover,
Theorem 2.2 provides a matching upper bound to Theorem 2.1. In particular, it implies that for consistency under the Euclidean norm, it suffices to assume that . Among other conclusions, it yields that from a minimax perspective, the averaging classifier outperforms the LDA one. This fact is stated explicitly next.
Let . Then for any
for some .
Observe that for the worst case scenario, the vector is chosen so that , whence is sub-optimal. This phenomenon is only possible in high dimensions. It is in fact easy to see that when , LDA outperforms the averaging classifier for any given vector whenever consistency is possible. Indeed, in this case
Moreover it is always true that
Hence, whenever consistency is possible and when , we have
Notice that the last statement is stronger than a minimax comparison.
3 Interpolation vs Regularization for sub-Gaussian mixtures
In this section, we study the risk of the regularized OLS estimators. While it is more common to study SVM for classification, recent works [1, 9] have shown that in high dimensions (specifically, ), SVM and OLS solutions coincide under mild conditions. This phenomenon is known in the literature as “proliferation of support vectors”. One of its implications is that in high dimensions, it is sufficient to study the properties of the least squares estimator and then demonstrate that it coincides with the hard-margin SVM. For the rest of this section, our goal is to study the risk of the family of supervised estimators defined as solutions to the problem
Observe that the case and leads to interpolation, specifically to the minimum -norm interpolating solution. For each , the corresponding estimator is proportional to
Each estimator leads to a linear classifier defined via
In what follows, we will derive upper bounds for the risk . We will also provide sufficient conditions for the matrix ensuring that the interpolating classifier (corresponding to ) achieves at least the same performance as the oracle (or the averaging classifier that is minimax optimal).
Notice that as goes to , we recover the averaging classifier. We emphasize here that if we replace the regularization term by (not adaptive to ), then our claims may no longer be true.
As a side note, we suspect that the excess risk of estimating itself gets smaller for large values of although this is not necessarily the case for the classification risk. This suggests that the excess risk of estimation may not be the right metric to evaluate classification performance.
In order to state our main result, we will need the following quantity. For a given covariance matrix , we define the function via
for some absolute constant (that can be chosen to be large enough) and if the above set if empty. The reader may observe that is decreasing with and that if . In what follows, we will require to be smaller than . For , this means that we are not allowing more than a fraction of all eigenvalues to be much larger than the remainder of the spectrum. Such conditions encompass many covariance matrices of interest, and in particular cover the case where can be viewed a low rank perturbation/corruption component. In what follows, will stand for .
Let . Assume that and that . Then for some constants we have with probability that
The condition connecting with can be understood as follows. One may think of the eigenvectors corresponding to largest eigenvalues of
as “outliers”, or directions affecting dramatically the rest of the spectrum. Our condition prohibitsfrom concentrating too much of its mass in the subspace spanned by the latter eigenvectors. For instance, if is a random vector that is well spread out on the sphere (spherically distributed) then . The same remark holds also if we fix and think of the range of as being a random subspace making a random vector. Hence the latter condition means simply that the vector is only allowed to be aligned with the “clean” part of the covariance .
When (or equivalently ), we recover the bound obtained in  by taking . Notice that in this case the alignment condition is always satisfied as . Our result is stronger since we show that the bound holds with probability while in  authors only prove that the same bound holds with probability . Moreover, under the mild condition and by taking , we show that
with probability at least . Therefore, taking the limit as leads to the same bound as the averaging oracle in this case.
In the case of moderate values of , it is worth noticing that is a decreasing function of . As a consequence, is an increasing function of such that
The latter quantity could be seen as a truncated trace of , where truncation is applied to the largest eigenvalues of . Unlike the previous works [6, 14], our result is more general since we allow to be non-zero. The bound in Theorem 3.1 in particular gets smaller as goes to , which suggests that interpolation may outperform regularization in some cases, especially in the case where a finite number of eigenvalues are much larger than the rest of the spectrum.
In the case where can be large we get the following result, without any further assumptions on . Let us define
Let . Assume that . Then for some absolute constants
This result suggests that under a mild condition on the covariance of the noise, not only interpolation is benign but it is also minimax optimal on the set . This also means that interpolation is better for classification than for regression since it does not suffer from a bias term which often leads to bad worst-case performance ().
Recall that all our results hold under the condition . We may wonder here what happens if is much larger than , and numerical experiments suggest that interpolation indeed behaves poorly in this case.
4 Proliferation of support vectors in high dimensions under the sub-Gaussian mixture model
where is the Euclidean canonical basis. In the remainder of the section we denote . The main result is stated next.
Assume that , and for some absolute constant . Then with probability at least ,
As a consequence, attains the same performance as under the conditions of Theorem 4.1.
When (i.e. ), the sufficient conditions read as
The first condition (that is signal-dependent) is also required in prior works [6, 14]. As for the “dimension-dependent” second condition, it is much milder than the one proposed in both previous papers on the topic. To compare these results, consider the case , where our condition reads as while the earlier analogues require . Our result also suggests that suffices for proliferation under the sub-Gaussian mixture model, which confirms the general conjecture stated in .
5 Application to robust supervised clustering
In this section, we present the framework for robust clustering in the sub-Gaussian mixture model. In the rest of this section we consider the case of Gaussian noise and identity covariance . We will assume that the training set has a different covariance than the covariance of the new observation to be classified, due to the action of a malicious adversary. More precisely, given the vector of observations , the adversary can corrupt the sample as follows: she chooses up to eigenvalues of the covariance matrix and positive scalars , and then adds i.i.d. random noise to such that the new observations become
where are i.i.d. standard normal vectors that are also independent from , and where is the canonical Euclidean basis of and is the set of indices corresponding to the corrupted eigenvalues. Observe that the covariance of the noise is now given by . In what follows, denotes the projection . Our goal is to show that under minimal assumptions, the interpolating estimator is still minimax optimal while both the averaging estimator and the LDA one fail to perform well.
Assume that and that . Then
One implication of this theorem is the fact that is minimax optimal on the set and robust with respect to the corruption , for moderate values of . When and the direction of is not too closely aligned with the eigenvectors corresponding to the corrupted part of the spectrum, then mimicks the performance of the averaging estimator in the absence of outliers. In order to compare the bound stated above to estimates available for other estimators, we rely on the next proposition.
Assume that the noise is Gaussian and that satisfies where . Then for any such that ,
for some absolute constant .
In summary, when , there exists a regime (where ) such that interpolation is minimax optimal over a large class of vectors , under the contamination model we presented, while both and can fail with constant non-zero probability.
6 Numerical experiments
The goal of this section is to compare the performance of several estimators for different values of . The case corresponds to interpolation, while recovers the averaging classifier . In all our simulations we will only consider Gaussian noise and spherically distributed.
Our simulation setup is defined as follows. We choose and to allow for overparametrization. We compare 4 estimators: the interpolating classifier , classifiers for and , as well as the averaging estimator . For each value of , simulation was repeated times; finally, we plot the empirical generalization error.
For our first experiment (Figure 1), we compare performance of the four classifiers in two cases:
The case of large effective rank where we choose to be a diagonal matrix with for all . This case corresponds to ;
The case of medium effective rank where we choose to be a diagonal matrix with and . This case corresponds to .
Consistent with the theoretical predictions, our simulations suggest that all classifiers have similar performance in the regime of large effective rank. Interestingly, interpolation seems to perform best when the effective rank is smaller than . This confirms our observation that interpolation can be superior to regularization in some cases.
As for our second experiment (Figure 2), we choose and we corrupt the training sample, as explained in Section 5, by setting randomly diagonal entries of the covariance to . Remember that this modification only impacts the training sample while the test sample has isotropic noise.
In this case, the averaging classifier fails to predict new labels correctly while the interpolating classifier is still able to classify well despite the corruption. Observe also that regularized classifiers (corresponding to the small regularization parameter) perform similar to the interpolating classifier. This occurs simply due to the fact that for all values of much smaller than the magnitude of the corruption, where is the corrupted covariance matrix. We conclude by reaffirming the claim that interpolation is not only harmless in high dimensions, but can outperform regularization and in some cases exhibits an unexpected feature of robustness.
- Ardeshir, Sanford and Hsu  [author] Ardeshir, NavidN., Sanford, ClaytonC. Hsu, DanielD. (2021). Support Vector Machines and Linear Regression Coincide with Very High-dimensional Features.
- Bartlett et al.  [author] Bartlett, Peter LP. L., Long, Philip MP. M., Lugosi, GáborG. Tsigler, AlexanderA. (2020). Benign Overfitting in Linear Regression. Proceedings of the National Academy of Sciences.
- Belkin, Hsu and Mitra  [author] Belkin, MikhailM., Hsu, DanielD. Mitra, ParthaP. (2018). Overfitting or Perfect Fitting? Risk Bounds for Classification and Regression Rules that Interpolate. arXiv preprint arXiv:1806.05161.
Belkin, Rakhlin and
[author] Belkin, MikhailM., Rakhlin, AlexanderA. Tsybakov, Alexandre BA. B. (2019). Does Data Interpolation Contradict Statistical Optimality? In The 22nd International Conference on Artificial Intelligence and Statistics 1611–1619. PMLR.
- Cai and Zhang  [author] Cai, T. TonyT. T. Zhang, LinjunL. (2018). High-dimensional Linear Discriminant Analysis: Optimality, Adaptive Algorithm, and Missing Data.
- Cao, Gu and Belkin  [author] Cao, YuanY., Gu, QuanquanQ. Belkin, MikhailM. (2021). Risk Bounds for Over-parameterized Maximum Margin Classification on Sub-Gaussian Mixtures.
- Chen and Zhang  [author] Chen, XinX. Zhang, Anderson YA. Y. (2021). Optimal Clustering in Anisotropic Gaussian Mixture Models. arXiv preprint arXiv:2101.05402.
- Davis, Diaz and Wang  [author] Davis, DamekD., Diaz, MateoM. Wang, KaizhengK. (2021). Clustering a Mixture of Gaussians with Unknown Covariance.
- Hsu, Muthukumar and Xu  [author] Hsu, DanielD., Muthukumar, VidyaV. Xu, JiJ. (2020). On the Proliferation of Support Vectors in High Dimensions.
- Liang and Recht  [author] Liang, TengyuanT. Recht, BenjaminB. (2021). Interpolating Classifiers Make Few Mistakes.
et al. 
[author] Muthukumar, VidyaV., Narang, AdhyyanA., Subramanian, VigneshV., Belkin, MikhailM., Hsu, DanielD. Sahai, AnantA. (2020). Classification vs Regression in Overparameterized Regimes: Does the Loss Function Matter?
- Ndaoud  [author] Ndaoud, MohamedM. (2018). Sharp Optimal Recovery in the Two Component Gaussian Mixture Model. arXiv preprint arXiv:1812.08078.
- Rudelson and Vershynin  [author] Rudelson, MarkM. Vershynin, RomanR. (2013). Hanson-Wright inequality and sub-gaussian concentration.
- Wang and Thrampoulidis  [author] Wang, KeK. Thrampoulidis, ChristosC. (2020). Benign Overfitting in Binary Classification of Gaussian Mixtures. arXiv preprint arXiv:2011.09148.
- Wang, Yan and Diaz  [author] Wang, KaizhengK., Yan, YulingY. Diaz, MateoM. (2020). Efficient Clustering for Stretched Mixtures: Landscape and Optimality. arXiv preprint arXiv:2003.09960.
- Wu and Xu  [author] Wu, DennyD. Xu, JiJ. (2020). On the Optimal Weighted Regularization in Overparameterized Linear Regression. arXiv preprint arXiv:2006.05800.
Appendix A Proofs of minimax clustering
a.1 Proof of Theorem 2.1
Fix and let be a diagonal PSD matrix such that and . Then
We can simply focus on showing the result for the above diagonal matrix . We will use the fact that
for any prior on such that . The quantity stands for the expectation corresponding to following model (1). The proof is decomposed in two steps.
A dimension-independent lower bound: Let be a fixed vector in such that . We place independent Rademacher priors on each for . It follows that
where as we average over . The last inequality holds in view of Jensen’s inequality and the independence between and . We define, for , the density of the observation conditionally on the value of . Now, using Neyman-Pearson lemma and the explicit form of , we get that the selector given by
is optimal as it achieves the minimum of the RHS of (4).
To show that, we remind the reader that the distribution of , conditionally on , is given by . Hence
It follows that
By Neyman-Pearson lemma, we can now conclude that
Plugging this value in (4), we get further that
It is straightforward to see that
for some where is the standard normal cumulative function. The last inequality holds for any as long as . The worst case is reached for being co-linear with the top eigenvector of since . Hence we get the lower bound
In order to conclude we only need to derive the other lower bound
For the rest of the proof we only focus on the case where , otherwise the dimension independent lower bound dominates.
A dimension-dependent lower bound:
Since is diagonal, we write where . According to Theorem in  we have
for any prior on . The second term in the above lower bound accounts for the constraint on . In what follows, we fix and choose to be a product prior on (,) such that is a Rademacher random variable and is an independent random vector such that , where is a diagonal matrix such that . Using the Hanson-Wright inequality , we deduce that
for some . Since , we only need to show that, for large enough, we have
for some . We define, for , to be the density of the observation given . Using Neyman-Pearson lemma, we get that
minimizes over all functions of with values in . Using the independence of the rows of , we have
where is the -th row of the matrix and (here is the binary vector such that for all