High-dimensional datasets encountered in modern science often exhibit nonlinear lower-dimensional structures. Efforts to uncover such low-dimensional structures revolve around the discovery of meaningful measures of pairwise discrepancy between data points [22, 16, 3, 4]. In the so-called metric design problem, the data analyst aims to find a useful metric representing the relationship between data points embedded in high-dimensional space. In this paper, we study the Mahalanobis Distance (MD) – a popular, and arguably the first, method for metric design [12, 14]. MD was originally proposed in 1936 with the classical low-dimensional setting in mind, namely, for the case where the ambient dimension is much larger than the dataset size .
Interestingly, due to its useful statistical and invariance properties, MD became the basis of several geometric data analysis [28, 23, 26] and manifold learning [17, 20] techniques aimed specifically at the high-dimensional regime . It was recently shown that MD is also implicitly used in the seminal Locally Linear Embedding algorithm , when the barycenter step is properly expressed [25, (2.17)].
As the number of dimensions in typical data analysis applications continues to grow, it becomes increasingly crucial to understand the behavior of MD, as well as other metric design algorithms, in the high-dimensional regime . At a first glance, it might seem that this regime poses little more than a computational inconvenience for metric design using MD. Indeed, it is easy to show that in the absence of measurement noise, MD cares little about the increase in the ambient dimension .
This paper calls attention to the following key observation. In the high-dimensional regime , in the presence of ambient measurement noise, a new phenomenon emerges, which introduces nontrivial effects on the performance of MD. Depending on the noise level, in the high-dimensional regime, MD may be adversarially affected or even fail completely. Clearly, the assumption of measurement noise cannot be realistically excluded, and yet, to the best of our knowledge, this phenomenon has not been previously fully studied. A first step in this direction was taken in , with the calculation of the distribution of MD under specific assumptions.
Let us describe this key phenomenon informally at first. The computation of MD involves estimation of the inverse covariance matrix, or precision matrix, corresponding to the data at hand. Classically, the estimation relies on the sample covariance, which is inverted using the Moore-Penrose pseudo-inverse. It is well-known that, in the regime
, the sample covariance matrix is a poor estimator of the underlying covariance matrix. Indeed, advances in random matrix theory from the last decade imply that the eigenvalues and eigenvectors of the sample covariance matrix are inconsistent, namely, do not converge to the corresponding eigenvalues and eigenvectors of the underlying covariance matrix[15, 7]. Importantly, such inconsistencies in small eigenvalues, which lead to inaccurate covariance matrix estimation, become immense when applying the Moore-Penrose pseudo-inverse.
In this paper, we study this problem and propose a remedy. By relying on formal existing results in covariance matrix estimation, we measure the sensitivity of MD to measurement noise. Under the assumption that the data lie on a low-dimensional linear subspace or manifold in the ambient space and that the measurement noise is white Gaussian, we are able to determine the exact asymptotic signal-to-noise ratio at which MD fails, and quantify its performance otherwise. In addition, it has been known since the 1970’s  that by shrinking the sample covariance eigenvalues one can significantly mitigate the noise effects and improve the covariance estimation in high-dimensions. We formulate the classical MD as a particular choice of shrinkage estimator for the eigenvalues of the sample covariance. Building on recent results in high-dimensional covariance estimation, including the general theory in  and a special case with application in random tomography [18, Section 4.4], we find an asymptotically optimal shrinker
, which is beneficial over the classical implementation of the MD, whenever MD is computed from noisy high-dimensional data. We show that under a suitable choice of a loss function for the estimation of MD, our shrinker is the unique asymptotically optimal shrinker; the improvement in asymptotic loss it offers over the classical MD is calculated exactly.
While the present paper focuses on MD, we posit that the same phenomenon holds much more broadly and in fact affects several widely-used manifold learning and metric learning algorithms. In this regard, the present paper seeks to highlight the fact that manifold learning and metric learning algorithms will not perform as predicted by the noiseless theory in high dimensions, and may fail completely beyond a certain noise level.
2. Problem Setup
2.1. Signal Model
Consider a point cloud in supported on a -dimensional linear subspace, where . For example, we may assume that we are sampling locally from a -dimensional embedded submanifold of , in which case the linear subspace is a specific tangent space to the manifold. Assuming the standard Gaussian model, we may assume that the point cloud is sampled independently and identically (i.i.d.) from
where is the population covariance matrix and its rank is equal to . The MD between an arbitrary point and the underlying signal distribution is defined by
where denotes the Moore-Penrose pseudo-inverse. Note that since is semi-positive definite, by the Cholesky decomposition, we have , where . Hence
which indicates that geometrically, MD evaluates the relationship between and, where and , where denotes the group of -by- orthogonal matrices. We know that its population covariance matrix is
and its population mean is also rotated and rescaled to . If , then the MD between and is
demonstrating the invariance.
Suppose that the samples of , which we refer to as the signal, are not directly observable. Instead, the observed data consist of samples from the random variable
where is Gaussian measurement noise, which we assume for simplicity to be white, independent of , and . Assume that is a sample of data points. Since is unknown, the quantity , or in (1), must be estimated from the (noisy) data. For simplicity of exposition, we assume that and are known; these assumptions can easily be removed.
2.2. A Class of Estimators and a Loss Function
For any -by- matrix estimated from , consider the estimator for MD
In order to quantitatively measure the performance of any MD estimator , it is useful to introduce a loss function. For any estimator of the form (4), the absolute value of the estimation error with respect to the true value (1) is
As the test vectoris arbitrary, it is natural to consider the worst case, and define the loss of at the (unknown) underlying low-dimensional covariance by
where is the matrix operator norm. To keep the notation light, the dependence of on and as well as the dependence of on the sample are implicit.
2.3. Shrinkage Estimators
Consider matrices of the form , where and is the sample covariance. We call the shrinkage estimator of with . A typical example is the classical MD estimator, which is a shrinkage estimator with , where
From , in the traditional setup when the dimension is fixed and , the classical MD estimator obtains zero loss asymptotically.
Let be fixed independently of . Then
Since it is well known that as , substituting with in (2.2) and taking limit with complete the proof. ∎
When grows with , such that with , the situation is quite different. It is known that, in this situation, the sample covariance matrix is an inconsistent estimate of the population covariance matrix , and Theorem 1 might not hold; that is, the classical MD might not be optimal. The following questions naturally arise when :
Is there an optimal shrinkage estimator with respect to the loss ?
How does the optimal loss compare with the loss ?
In the sequel, we attempt to answer these questions.
3. Optimal Shrinkage for Mahalanobis Distance
To formally capture the low-dimensional structure assumption, consider the following model, based on Johnstone’s Spike Covariance Model . Without loss of generality, we set the noise level and will discuss the general case subsequently.
Assumption 1 (Asymptotic()).
The number of variables grows with the number of observations , such that as , for .
Assumption 2 (Spiked model).
Suppose with the eigendecompostion:
where , is a matrix whose diagonal consists of spikes , which are fixed and independent of and , and the off-diagonal elements are set to zero. For completeness, denote . Note that we assume that all spikes are simple. When , it is the null case.
Denote the eigendecompostion of as
where are the empirical eigenvalues and is the matrix, whose columns are the empirical eigenvectors , . Under Assumption 1 and Assumption 2, results collected from [13, 1, 2, 15, 6] imply three important facts about the sample covariance matrix .
Top eigenvector inconsistency. Suppose Assumption 1 and Assumption 2 hold. Let and be the cosine and sine value of the angel between the -th population eigenvector and the -th empirical eigenvector after properly flipping the sign of each empirical eigenvector. Note that there exists a sequence of so that converges almost surely (a.s.) to . In the following we assume that the empirical eigenvectors have been properly rotated. It is known that when , and , where
are defined on .
The above three properties imply that the classical MD may not be the best estimator. Inspired by , we may “correct” the biased eigenvalues to improve the estimation. Denote the asymptotic loss function by
assuming the limit exists.
To find a shrinkage estimator that minimizes , it is natural to construct the estimator by recovering the spikes using the biased eigenvalues . From the inversion of (10), recalling that , we can define
when , and consider the shrinkage function
However, since true (population) eigenvectors from and empirical eigenvectors from are not collinear , it is reasonable to expect the existence of an optimal shrinkage function satisfying
for any spikes . Below we show that is in fact the optimal shrinkage.
3.1. Derivation of the Optimal Shrinker when
A function is called a shrinker if it is continuous when , and when .
Note that this shrinker is a bulk shrinker considered in [6, Definition 3]. Based on the assumption of a shrinker , the associated shrinkage estimator converges almost surely, that is
Thus, the sequence of loss functions also almost surely converges as
As a result, the limit in (13) exists when is a shrinker, and we have the following theorem which in turn establishes the optimal shrinker.
Theorem 2 (Characterization of the asymptotic loss).
where is given by
Based on the property of “simultaneous block-diagonalization” for and in [6, Section 2], the properties of “orthogonal invariance” and “max-decomposability” for the operator norm in [6, Section 3], and the convergence of and in (11) and (17), we have
when and otherwise, and
When , the loss converges a.s. to , where
Now we evaluate for different .
When , denote the eigenvalues of as and . If we have , and hence ; otherwise, we have , and hence . For , since , we have
which equals since by the definition of shrinkage function. Thus, . Finally, for , is a zero matrix, and thus . This concludes the proof. ∎
Figure 1 illustrates the obtained asymptotic loss for several values as a function of the spike strength in a single spike model. It is clear that for each , there is a transition at .
We note the following interesting phenomenon stemming from Theorem 2. If there exists a nontrivial spike that is weak enough so that , then is dominated by . Consequently, it implies that in this large large regime, we cannot “rescue” this spike, and the associated signal is lost in the noise.
An immediate consequence of Theorem 2 is that is an optimal shrinker.
Note that this result coincides with the findings reported in . Precisely, it is shown in [6, (1.12)] that for the operator norm, (14) is the optimal shrinkage for the covariance matrix and precision matrix. In this corollary, we show that for the Mahalanobis distance, which is related to the precision matrix, the optimal estimator is also achieved by the optimal shrinkage, taking into account.
Based on Theorem 2, the optimal shrinker leads to . By the max-min inequality, we have
Note that for any given shrinker , we have that maximizes . By the same argument in , if we could solve for any , we find the optimal shrinkage. To simplify the notation, we abbreviate by .
For and , we have . By a direct calculation, we get
For and , we have , and similarly by taking the derivative of (20) we have
As a result, the partial derivative of the loss function is decreasing when and increasing when with a discontinuity at while the loss function is continuous. Thus, the loss function reaches the minimum when . These facts imply that when . Furthermore, by substituting with in (20) or (21), we get . By definition, when . Thus, for , , and for , . Finally, it is clear that is continuous when , and when . We thus conclude that is the optimal shrinker. ∎
Figure 2 illustrates the obtained optimal shrinker with the classical shrinker overlay, for and . Clearly, compared with the classical shrinker, the obtained optimal shrinker truncates the eigenvalues more aggressively.
3.2. Derivation of the Optimal Shrinker when
To handle the general case when , we first rescale the data and model by setting and , and consider the following shrinker defined on :
Note that since plays the role of estimating the precision matrix, we normalize it back by dividing by . The shrinkage estimator for becomes , and the general optimal shrinker becomes
and the associated loss is
4. Simulation Study
To numerically compare the optimal shrinker and the classical shrinker, we set and consider the number of samples so that . For simplicity, we set for . We consider . Suppose , are sampled i.i.d. from the random vector , where is the unit vector with -th entry , for , and is independent of when . The noisy data is simulated by , where is the noise matrix, is independent of and is randomly sampled from . In the simulation, we take . For each , we repeat the experiment
times and report the mean and variance of the loss.
Figure 3 shows the log loss of the optimal and classical shrinkers when and . We observe that the loss using the classical shrinker is significantly larger. This stems from the fact that in the large and large regime, there are eigenvalues greater than that are not associated with the signal. When applying the classical shrinker (6) (the Moore-Penrose pseudo-inverse), these irrelevant eigenvalues contribute significantly, leading to high loss. Conversely, the optimal shrinker is much more ‘selective’ (as illustrated in Figure 2), associating larger eigenvalues with the noise, thereby increasing the robustness of the estimator.
Our main motivation for considering MD in the high-dimensional regime comes from manifold learning. In manifold learning, point clouds with possible nonlinear structures are modeled by manifolds; that is, data points in are assumed to lie on or close to a -dimensional smooth manifold . This mathematical definition can be understood intuitively – the set of data points in a small region can be well approximated by a -dimensional affine subspace of . The dimension of the manifold is usually fixed when we sample more and more data, representing intrinsic properties of the data, whereas may vary, depending, for example, on the specific observation modality or the advance of the sampling technology. Traditionally, the goal is to learn the structure of the manifold from the data points, and in turn, use the learned structure to embed the high-dimensional data in low-dimension, facilitating a compact representation of the ‘essence’ of the data. In a recent line of work [17, 20], a variant of MD was proposed and used to reveal hidden intrinsic manifolds underlying high-dimensional, possibly multimodal, observations. The main purpose of MD in this hidden manifold setup is handling possible deformations caused by the observation/sampling process, which is equivalent to estimating the precision matrix locally on the manifold. Studying this case and examining the benefits stemming from the incorporation the proposed optimal shrinker extends the scope of this paper and is left for future work.
Here we only test the performance of the proposed optimal shrinker in a simplified setup, where the high-dimensional data lie on a lower-dimensional manifold. Consider the model , where is sampled from a curvy manifold with one chart embedded in that can be parametrized by , , and is the noise level. The ambient dimension is fixed at . For each , samples are taken with a uniform sampling from . The normalized loss of MD is computed by
where is an arbitrary point on the manifold. We examine two cases, where and
. Each case was repeated for 500 times, with the mean and standard deviation reported. In Table1, we compare the performance of the optimal shrinker estimator with the performance of the classical estimator . We observe that the optimal shrinker outperforms the classical estimator in this well controlled manifold setup.
|18.78 1.16||0.78 0.54||55.98 2.09||1.32 0.93|
|23.72 1.19||1.41 0.87||59.42 2.10||2.59 1.74|
|33.54 1.53||2.18 1.31||64.04 1.69||5.19 2.99|
|26.86 2.70||2.41 1.55||60.88 4.22||4.06 2.54|
|42.66 3.43||4.78 2.65||69.06 3.43||10.65 5.38|
|58.59 3.24||9.84 4.63||77.24 2.81||31.52 17.56|
|34.70 4.89||4.05 2.35||64.11 5.57||8.39 3.91|
|54.72 4.46||10.62 4.69||75.54 3.63||23.97 12.49|
|69.65 3.97||21.35 7.94||83.30 2.78||62.99 19.34|
We proposed a new estimator for MD based on precision matrix shrinkage. For an appropriate loss function, we show that the proposed estimator is asymptotically optimal and outperforms the classical implementation of MD using the Moore-Penrose pseudo-inverse of the sample covariance. Importantly, the proposed estimator is particularly beneficial when the data is noisy and in high-dimension, a case in which the classical MD estimator might completely fail. Consequently, we believe that the new estimator may be useful in modern data analysis applications, involving for example, local principal component analysis, metric design, and manifold learning.
In this work, we focused on the case in which the intrinsic dimensionality of the data (the rank of the covariance matrix) is unknown, and therefore, it was not explicitly used in the estimation. Yet, in many scenarios, this dimension is known. In this case, it could be beneficial to consider a direct truncation and use only the top eigen-pairs for the estimation of the precision matrix. Note that in the particular manifold setup, this case is essentially different from the rank-aware shrinker discussed in ; in the manifold setup, the rank of the covariance matrix associated with points residing inside a small neighborhood of any point on the manifold could be much larger than . The benefit from this truncation approach has been shown in several applications [21, 24, 27, 11]. However, often getting the rank of the signal, or estimating the dimension of a manifold, is not realistic and estimating it is highly challenging [9, 10]. We leave the above mentioned challenges in applying the established theory to manifold learning to future work.
MG has been supported by H-CSRC Security Research Center and Israeli Science Foundation grant no. 1523/16. MG and RT were supported by a grant from the Tel-Aviv University ICRC Research Center.
-  J. Baik, G. Ben Arous, and S. Peche. Phase transition of the largest eigenvalue for non-null complex sample covariance matrices. Ann. Probab., 33(5):1643–1697, 2005.
J. Baik and J. W. Silverstein.
Eigenvalues of large sample covariance matrices of spiked population
Journal of Multivariate Analysis, 97(6):1382 – 1408, 2006.
-  M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396, 2003.
-  R. R. Coifman and S. Lafon. Diffusion maps. Applied and computational harmonic analysis, 21(1):5–30, 2006.
-  D. Dai and T. Holgersson. High-dimensional CLTs for individual mahalanobis distances. In Müjgan Tez and Dietrich von Rosen, editors, Trends and Perspectives in Linear Statistical Inference, pages 57–68, 2018.
-  D. L. Donoho, M. Gavish, and I. M. Johnstone. Optimal Shrinkage of Eigenvalues in the Spiked Covariance Model. The Annals of Statistics, 46(4):1742–1778, November 2018.
-  I. M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. Annals of Statistics, 29(2):295–327, 2001.
-  I. M Johnstone. High Dimensional Statistical Inference and Random Matrices. In Proc. Int. Congr. Math., pages 307–333, 2006.
-  S. Kritchman and B. Nadler. Non-parametric detection of the number of signals: Hypothesis testing and random matrix theory. Trans. Sig. Proc., 57(10):3930–3941, October 2009.
-  E. Levina and P. J. Bickel. Maximum likelihood estimation of intrinsic dimension. In Advances in neural information processing systems, pages 777–784, 2005.
-  G.-R. Liu, Y.-L. Lo, Y.-C. Sheu, and H.-T. Wu. Diffuse to fuse eeg spectra–intrinsic geometry of sleep dynamics for classification. arXiv preprint arXiv:1803.01710, 2018.
-  P. C. Mahalanobis. On the generalized distance in statistics. Proceedings of the National Institute of Sciences (Calcutta), 2:49–55, 1936.
-  V A Marcenko and L A Pastur. Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1(4):457, 1967.
-  G. J. McLachlan. Mahalanobis distance. Resonance, 4(6):20–26, 1999.
-  D. Paul. Asymptotics of Sample Eigenstructure for a Large Dimensional Spiked Covariance Model. Statistica Sinica, 17:1617–1642, 2007.
-  S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000.
A. Singer and R. R. Coifman.
Non-linear independent component analysis with diffusion maps.Applied and Computational Harmonic Analysis, 25(2):226–239, 2008.
-  A. Singer and H.-T. Wu. Two-dimensional tomography from noisy projections taken at unknown random directions. SIAM J. Imaging Sciences, 6(1):136–175, 2012.
-  C. M. Stein. Lectures on the theory of estimation of many parameters. Journal of Soviet Mathematics, 74(5), 1986.
-  R. Talmon and R. R. Coifman. Empirical intrinsic geometry for nonlinear modeling and time series filtering. Proceedings of the National Academy of Sciences, 110(31):12535–12540, 2013.
-  R. Talmon, S. Mallat, H. Zaveri, and R. R. Coifman. Manifold learning for latent variable inference in dynamical systems. IEEE Transactions on Signal Processing, 63(15):3843–3856, 2015.
-  J. B. Tenenbaum, V. De Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.
-  K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res., 10:207–244, 2009.
-  H.-T. Wu, R. Talmon, and Y.-L. Lo. Assess sleep stage by modern signal processing techniques. IEEE Trans. Biomed. Engineering, 62(4):1159–1168, 2015.
H.-T. Wu and N Wu.
Think globally, fit locally under the Manifold Setup: Asymptotic Analysis of Locally Linear Embedding.Annuls of Statistics, In press, 2018.
-  S. Xiang, F. Nie, and C. Zhang. Learning a Mahalanobis distance metric for data clustering and classification. Pattern Recognition, 41:3600–3612, 2008.
-  O. Yair, R. Talmon, R. R. Coifman, and I. G. Kevrekidis. Reconstruction of normal forms by learning informed observation geometries from data. Proceedings of the National Academy of Sciences, 114(38):E7865–E7874, 2017.
-  Liu Yang and Rong Jin. Distance metric learning: A comprehensive survey. Michigan State Universiy, 2(2):4, 2006.