Optimal Recovery of Mahalanobis Distance in High Dimension

04/19/2019 ∙ by Matan Gavish, et al. ∙ Technion Duke University Hebrew University of Jerusalem 0

In this paper, we study the problem of Mahalanobis distance (MD) estimation from high-dimensional noisy data. By relying on recent transformative results in covariance matrix estimation, we demonstrate the sensitivity of MD to measurement noise, determining the exact asymptotic signal-to-noise ratio at which MD fails, and quantifying its performance otherwise. In addition, for an appropriate loss function, we propose an asymptotically optimal shrinker, which is shown to be beneficial over the classical implementation of the MD, both analytically and in simulations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

High-dimensional datasets encountered in modern science often exhibit nonlinear lower-dimensional structures. Efforts to uncover such low-dimensional structures revolve around the discovery of meaningful measures of pairwise discrepancy between data points [22, 16, 3, 4]. In the so-called metric design problem, the data analyst aims to find a useful metric representing the relationship between data points embedded in high-dimensional space. In this paper, we study the Mahalanobis Distance (MD) – a popular, and arguably the first, method for metric design [12, 14]. MD  was originally proposed in 1936 with the classical low-dimensional setting in mind, namely, for the case where the ambient dimension is much larger than the dataset size .

Interestingly, due to its useful statistical and invariance properties, MD  became the basis of several geometric data analysis [28, 23, 26] and manifold learning [17, 20] techniques aimed specifically at the high-dimensional regime . It was recently shown that MD is also implicitly used in the seminal Locally Linear Embedding algorithm [16], when the barycenter step is properly expressed [25, (2.17)].

As the number of dimensions in typical data analysis applications continues to grow, it becomes increasingly crucial to understand the behavior of MD, as well as other metric design algorithms, in the high-dimensional regime . At a first glance, it might seem that this regime poses little more than a computational inconvenience for metric design using MD. Indeed, it is easy to show that in the absence of measurement noise, MD cares little about the increase in the ambient dimension .

This paper calls attention to the following key observation. In the high-dimensional regime , in the presence of ambient measurement noise, a new phenomenon emerges, which introduces nontrivial effects on the performance of MD. Depending on the noise level, in the high-dimensional regime, MD may be adversarially affected or even fail completely. Clearly, the assumption of measurement noise cannot be realistically excluded, and yet, to the best of our knowledge, this phenomenon has not been previously fully studied. A first step in this direction was taken in [5], with the calculation of the distribution of MD under specific assumptions.

Let us describe this key phenomenon informally at first. The computation of MD involves estimation of the inverse covariance matrix, or precision matrix, corresponding to the data at hand. Classically, the estimation relies on the sample covariance, which is inverted using the Moore-Penrose pseudo-inverse. It is well-known that, in the regime

, the sample covariance matrix is a poor estimator of the underlying covariance matrix. Indeed, advances in random matrix theory from the last decade imply that the eigenvalues and eigenvectors of the sample covariance matrix are inconsistent, namely, do not converge to the corresponding eigenvalues and eigenvectors of the underlying covariance matrix

[15, 7]. Importantly, such inconsistencies in small eigenvalues, which lead to inaccurate covariance matrix estimation, become immense when applying the Moore-Penrose pseudo-inverse.

In this paper, we study this problem and propose a remedy. By relying on formal existing results in covariance matrix estimation, we measure the sensitivity of MD to measurement noise. Under the assumption that the data lie on a low-dimensional linear subspace or manifold in the ambient space and that the measurement noise is white Gaussian, we are able to determine the exact asymptotic signal-to-noise ratio at which MD fails, and quantify its performance otherwise. In addition, it has been known since the 1970’s [19] that by shrinking the sample covariance eigenvalues one can significantly mitigate the noise effects and improve the covariance estimation in high-dimensions. We formulate the classical MD as a particular choice of shrinkage estimator for the eigenvalues of the sample covariance. Building on recent results in high-dimensional covariance estimation, including the general theory in [6] and a special case with application in random tomography [18, Section 4.4], we find an asymptotically optimal shrinker

, which is beneficial over the classical implementation of the MD, whenever MD is computed from noisy high-dimensional data. We show that under a suitable choice of a loss function for the estimation of MD, our shrinker is the unique asymptotically optimal shrinker; the improvement in asymptotic loss it offers over the classical MD is calculated exactly.

While the present paper focuses on MD, we posit that the same phenomenon holds much more broadly and in fact affects several widely-used manifold learning and metric learning algorithms. In this regard, the present paper seeks to highlight the fact that manifold learning and metric learning algorithms will not perform as predicted by the noiseless theory in high dimensions, and may fail completely beyond a certain noise level.

2. Problem Setup

2.1. Signal Model

Consider a point cloud in supported on a -dimensional linear subspace, where . For example, we may assume that we are sampling locally from a -dimensional embedded submanifold of , in which case the linear subspace is a specific tangent space to the manifold. Assuming the standard Gaussian model, we may assume that the point cloud is sampled independently and identically (i.i.d.) from

where is the population covariance matrix and its rank is equal to . The MD between an arbitrary point and the underlying signal distribution is defined by

(1)

where denotes the Moore-Penrose pseudo-inverse. Note that since is semi-positive definite, by the Cholesky decomposition, we have , where . Hence

which indicates that geometrically, MD evaluates the relationship between and

by a proper linear transform. A primary merit of MD for manifold learning stems from the invariance of rotation and rescaling. To see this property, consider a random variable

, where and , where denotes the group of -by- orthogonal matrices. We know that its population covariance matrix is

(2)

and its population mean is also rotated and rescaled to . If , then the MD between and is

(3)

demonstrating the invariance.

Suppose that the samples of , which we refer to as the signal, are not directly observable. Instead, the observed data consist of samples from the random variable

where is Gaussian measurement noise, which we assume for simplicity to be white, independent of , and . Assume that is a sample of data points. Since is unknown, the quantity , or in (1), must be estimated from the (noisy) data. For simplicity of exposition, we assume that and are known; these assumptions can easily be removed.

2.2. A Class of Estimators and a Loss Function

For any -by- matrix estimated from , consider the estimator for MD

(4)

In order to quantitatively measure the performance of any MD estimator , it is useful to introduce a loss function. For any estimator of the form (4), the absolute value of the estimation error with respect to the true value (1) is

As the test vector

is arbitrary, it is natural to consider the worst case, and define the loss of at the (unknown) underlying low-dimensional covariance by

(5)

where is the matrix operator norm. To keep the notation light, the dependence of on and as well as the dependence of on the sample are implicit.

2.3. Shrinkage Estimators

Consider matrices of the form , where and is the sample covariance. We call the shrinkage estimator of with . A typical example is the classical MD estimator, which is a shrinkage estimator with , where

(6)

From [12], in the traditional setup when the dimension is fixed and , the classical MD  estimator obtains zero loss asymptotically.

Theorem 1.

Let be fixed independently of . Then

Proof.

Since it is well known that as , substituting with in (2.2) and taking limit with complete the proof. ∎

When grows with , such that with , the situation is quite different. It is known that, in this situation, the sample covariance matrix is an inconsistent estimate of the population covariance matrix [8], and Theorem 1 might not hold; that is, the classical MD might not be optimal. The following questions naturally arise when :

  1. Is there an optimal shrinkage estimator with respect to the loss ?

  2. How does the optimal loss compare with the loss ?

In the sequel, we attempt to answer these questions.

3. Optimal Shrinkage for Mahalanobis Distance

To formally capture the low-dimensional structure assumption, consider the following model, based on Johnstone’s Spike Covariance Model [7]. Without loss of generality, we set the noise level and will discuss the general case subsequently.

Assumption 1 (Asymptotic()).

The number of variables grows with the number of observations , such that as , for .

Assumption 2 (Spiked model).

Suppose with the eigendecompostion:

(7)

where , is a matrix whose diagonal consists of spikes , which are fixed and independent of and , and the off-diagonal elements are set to zero. For completeness, denote . Note that we assume that all spikes are simple. When , it is the null case.

Denote the eigendecompostion of as

(8)

where are the empirical eigenvalues and is the matrix, whose columns are the empirical eigenvectors , . Under Assumption 1 and Assumption 2, results collected from [13, 1, 2, 15, 6] imply three important facts about the sample covariance matrix .

  1. Eigenvalue spread. Suppose Assumption 1 holds and consider the null case where . As , the spread of the empirical eigenvalues converges to a continuous distribution called the “Marcenko-Pastur” law [13],

    (9)

    where and are the limiting bulk edges.

  2. Top eigenvalue bias. Suppose Assumption 1 and Assumption 2 hold. For , the empirical eigenvalues

    as , where

    (10)

    is defined on and . For , since the empirical eigenvalues follow the Marcenko-Pastur law (9).

  3. Top eigenvector inconsistency. Suppose Assumption 1 and Assumption 2 hold. Let and be the cosine and sine value of the angel between the -th population eigenvector and the -th empirical eigenvector after properly flipping the sign of each empirical eigenvector. Note that there exists a sequence of so that converges almost surely (a.s.) to . In the following we assume that the empirical eigenvectors have been properly rotated. It is known that when , and , where

    (11)

    and

    (12)

    are defined on .

The above three properties imply that the classical MD may not be the best estimator. Inspired by [6], we may “correct” the biased eigenvalues to improve the estimation. Denote the asymptotic loss function by

(13)

assuming the limit exists.

To find a shrinkage estimator that minimizes , it is natural to construct the estimator by recovering the spikes using the biased eigenvalues . From the inversion of (10), recalling that , we can define

(14)

when , and consider the shrinkage function

(15)

However, since true (population) eigenvectors from and empirical eigenvectors from are not collinear [6], it is reasonable to expect the existence of an optimal shrinkage function satisfying

for any spikes . Below we show that is in fact the optimal shrinkage.

3.1. Derivation of the Optimal Shrinker when

Definition 1.

A function is called a shrinker if it is continuous when , and when .

Note that this shrinker is a bulk shrinker considered in [6, Definition 3]. Based on the assumption of a shrinker , the associated shrinkage estimator converges almost surely, that is

(16)

Thus, the sequence of loss functions also almost surely converges as

(17)

As a result, the limit in (13) exists when is a shrinker, and we have the following theorem which in turn establishes the optimal shrinker.

Theorem 2 (Characterization of the asymptotic loss).

Suppose . Consider the spike covariance model satisfying Assumption 1 and Assumption 2 and a shrinkage function . We have a.s.

(18)

where is given by

(19)

where

(20)
(21)
Proof.

Based on the property of “simultaneous block-diagonalization” for and in [6, Section 2], the properties of “orthogonal invariance” and “max-decomposability” for the operator norm in [6, Section 3], and the convergence of and in (11) and (17), we have

where

when and otherwise, and

When , the loss converges a.s. to , where

Now we evaluate for different .

When , denote the eigenvalues of as and . If we have , and hence ; otherwise, we have , and hence . For , since , we have

which equals since by the definition of shrinkage function. Thus, . Finally, for , is a zero matrix, and thus . This concludes the proof. ∎

Figure 1. The asymptotic loss for several values as a function of the spike strength in a single spike model, i.e., , and . The noise level is fixed and set to .

Figure 1 illustrates the obtained asymptotic loss for several values as a function of the spike strength in a single spike model. It is clear that for each , there is a transition at .

We note the following interesting phenomenon stemming from Theorem 2. If there exists a nontrivial spike that is weak enough so that , then is dominated by . Consequently, it implies that in this large large regime, we cannot “rescue” this spike, and the associated signal is lost in the noise.

An immediate consequence of Theorem 2 is that is an optimal shrinker.

Corollary 1.

Suppose and Assumption 1 and Assumption 2 hold. Define the asymptotically optimal shrinkage function as

(22)

where argmin is evaluated on the set of all possible shrinkage functions. Then, is unique and equals given in (15). Moreover, its associated loss is

(23)

where

(24)

Note that this result coincides with the findings reported in [6]. Precisely, it is shown in [6, (1.12)] that for the operator norm, (14) is the optimal shrinkage for the covariance matrix and precision matrix. In this corollary, we show that for the Mahalanobis distance, which is related to the precision matrix, the optimal estimator is also achieved by the optimal shrinkage, taking into account.

Proof.

Based on Theorem 2, the optimal shrinker leads to . By the max-min inequality, we have

(25)

Note that for any given shrinker , we have that maximizes . By the same argument in [6], if we could solve for any , we find the optimal shrinkage. To simplify the notation, we abbreviate by .

For and , we have . By a direct calculation, we get

For and , we have , and similarly by taking the derivative of (20) we have

As a result, the partial derivative of the loss function is decreasing when and increasing when with a discontinuity at while the loss function is continuous. Thus, the loss function reaches the minimum when . These facts imply that when . Furthermore, by substituting with in (20) or (21), we get . By definition, when . Thus, for , , and for , . Finally, it is clear that is continuous when , and when . We thus conclude that is the optimal shrinker. ∎

Figure 2. The obtained optimal shrinker with the classical shrinker overlay, for and . To enhance the visualization, the y-axis of the figure is truncated at .

Figure 2 illustrates the obtained optimal shrinker with the classical shrinker overlay, for and . Clearly, compared with the classical shrinker, the obtained optimal shrinker truncates the eigenvalues more aggressively.

3.2. Derivation of the Optimal Shrinker when

To handle the general case when , we first rescale the data and model by setting and , and consider the following shrinker defined on :

(26)

Note that since plays the role of estimating the precision matrix, we normalize it back by dividing by . The shrinkage estimator for becomes , and the general optimal shrinker becomes

(27)

and the associated loss is

4. Simulation Study

To numerically compare the optimal shrinker and the classical shrinker, we set and consider the number of samples so that . For simplicity, we set for . We consider . Suppose , are sampled i.i.d. from the random vector , where is the unit vector with -th entry , for , and is independent of when . The noisy data is simulated by , where is the noise matrix, is independent of and is randomly sampled from . In the simulation, we take . For each , we repeat the experiment

times and report the mean and variance of the loss

.

Figure 3 shows the log loss of the optimal and classical shrinkers when and . We observe that the loss using the classical shrinker is significantly larger. This stems from the fact that in the large and large regime, there are eigenvalues greater than that are not associated with the signal. When applying the classical shrinker (6) (the Moore-Penrose pseudo-inverse), these irrelevant eigenvalues contribute significantly, leading to high loss. Conversely, the optimal shrinker is much more ‘selective’ (as illustrated in Figure 2), associating larger eigenvalues with the noise, thereby increasing the robustness of the estimator.

Figure 3. (top) The performance of the optimal shrinker and the classical shrinker when . (bottom) The performance of the optimal shrinker and the classical shrinker when . The black curve represents the median of the difference between the loss of the classical shrinker and that of the theoretical optimal loss presented in log scale. The blue curve represents the median of the difference between the loss of the optimal shrinker and the theoretical optimal loss presented in log scale. The error bars depict the interquartile range of each shrinker (in log scale). The vertical blue line is , which indicates the tolerable noise level for the given and signal strength.

Our main motivation for considering MD in the high-dimensional regime comes from manifold learning. In manifold learning, point clouds with possible nonlinear structures are modeled by manifolds; that is, data points in are assumed to lie on or close to a -dimensional smooth manifold . This mathematical definition can be understood intuitively – the set of data points in a small region can be well approximated by a -dimensional affine subspace of . The dimension of the manifold is usually fixed when we sample more and more data, representing intrinsic properties of the data, whereas may vary, depending, for example, on the specific observation modality or the advance of the sampling technology. Traditionally, the goal is to learn the structure of the manifold from the data points, and in turn, use the learned structure to embed the high-dimensional data in low-dimension, facilitating a compact representation of the ‘essence’ of the data. In a recent line of work [17, 20], a variant of MD was proposed and used to reveal hidden intrinsic manifolds underlying high-dimensional, possibly multimodal, observations. The main purpose of MD in this hidden manifold setup is handling possible deformations caused by the observation/sampling process, which is equivalent to estimating the precision matrix locally on the manifold. Studying this case and examining the benefits stemming from the incorporation the proposed optimal shrinker extends the scope of this paper and is left for future work.

Here we only test the performance of the proposed optimal shrinker in a simplified setup, where the high-dimensional data lie on a lower-dimensional manifold. Consider the model , where is sampled from a curvy manifold with one chart embedded in that can be parametrized by , , and is the noise level. The ambient dimension is fixed at . For each , samples are taken with a uniform sampling from . The normalized loss of MD is computed by

where is an arbitrary point on the manifold. We examine two cases, where and

. Each case was repeated for 500 times, with the mean and standard deviation reported. In Table

1, we compare the performance of the optimal shrinker estimator with the performance of the classical estimator . We observe that the optimal shrinker outperforms the classical estimator in this well controlled manifold setup.

18.78 1.16 0.78 0.54 55.98 2.09 1.32 0.93
23.72 1.19 1.41 0.87 59.42 2.10 2.59 1.74
33.54 1.53 2.18 1.31 64.04 1.69 5.19 2.99
26.86 2.70 2.41 1.55 60.88 4.22 4.06 2.54
42.66 3.43 4.78 2.65 69.06 3.43 10.65 5.38
58.59 3.24 9.84 4.63 77.24 2.81 31.52 17.56
34.70 4.89 4.05 2.35 64.11 5.57 8.39 3.91
54.72 4.46 10.62 4.69 75.54 3.63 23.97 12.49
69.65 3.97 21.35 7.94 83.30 2.78 62.99 19.34
Table 1. The normalized loss of the optimal shrinker estimator and the classical estimator in the manifold setup. The mean and the standard deviation over realizations are reported.

5. Conclusions

We proposed a new estimator for MD based on precision matrix shrinkage. For an appropriate loss function, we show that the proposed estimator is asymptotically optimal and outperforms the classical implementation of MD using the Moore-Penrose pseudo-inverse of the sample covariance. Importantly, the proposed estimator is particularly beneficial when the data is noisy and in high-dimension, a case in which the classical MD estimator might completely fail. Consequently, we believe that the new estimator may be useful in modern data analysis applications, involving for example, local principal component analysis, metric design, and manifold learning.

In this work, we focused on the case in which the intrinsic dimensionality of the data (the rank of the covariance matrix) is unknown, and therefore, it was not explicitly used in the estimation. Yet, in many scenarios, this dimension is known. In this case, it could be beneficial to consider a direct truncation and use only the top eigen-pairs for the estimation of the precision matrix. Note that in the particular manifold setup, this case is essentially different from the rank-aware shrinker discussed in [6]; in the manifold setup, the rank of the covariance matrix associated with points residing inside a small neighborhood of any point on the manifold could be much larger than . The benefit from this truncation approach has been shown in several applications [21, 24, 27, 11]. However, often getting the rank of the signal, or estimating the dimension of a manifold, is not realistic and estimating it is highly challenging [9, 10]. We leave the above mentioned challenges in applying the established theory to manifold learning to future work.

Acknowledgements

MG has been supported by H-CSRC Security Research Center and Israeli Science Foundation grant no. 1523/16. MG and RT were supported by a grant from the Tel-Aviv University ICRC Research Center.

References

  • [1] J. Baik, G. Ben Arous, and S. Peche. Phase transition of the largest eigenvalue for non-null complex sample covariance matrices. Ann. Probab., 33(5):1643–1697, 2005.
  • [2] J. Baik and J. W. Silverstein. Eigenvalues of large sample covariance matrices of spiked population models.

    Journal of Multivariate Analysis

    , 97(6):1382 – 1408, 2006.
  • [3] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396, 2003.
  • [4] R. R. Coifman and S. Lafon. Diffusion maps. Applied and computational harmonic analysis, 21(1):5–30, 2006.
  • [5] D. Dai and T. Holgersson. High-dimensional CLTs for individual mahalanobis distances. In Müjgan Tez and Dietrich von Rosen, editors, Trends and Perspectives in Linear Statistical Inference, pages 57–68, 2018.
  • [6] D. L. Donoho, M. Gavish, and I. M. Johnstone. Optimal Shrinkage of Eigenvalues in the Spiked Covariance Model. The Annals of Statistics, 46(4):1742–1778, November 2018.
  • [7] I. M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. Annals of Statistics, 29(2):295–327, 2001.
  • [8] I. M Johnstone. High Dimensional Statistical Inference and Random Matrices. In Proc. Int. Congr. Math., pages 307–333, 2006.
  • [9] S. Kritchman and B. Nadler. Non-parametric detection of the number of signals: Hypothesis testing and random matrix theory. Trans. Sig. Proc., 57(10):3930–3941, October 2009.
  • [10] E. Levina and P. J. Bickel. Maximum likelihood estimation of intrinsic dimension. In Advances in neural information processing systems, pages 777–784, 2005.
  • [11] G.-R. Liu, Y.-L. Lo, Y.-C. Sheu, and H.-T. Wu. Diffuse to fuse eeg spectra–intrinsic geometry of sleep dynamics for classification. arXiv preprint arXiv:1803.01710, 2018.
  • [12] P. C. Mahalanobis. On the generalized distance in statistics. Proceedings of the National Institute of Sciences (Calcutta), 2:49–55, 1936.
  • [13] V A Marcenko and L A Pastur. Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1(4):457, 1967.
  • [14] G. J. McLachlan. Mahalanobis distance. Resonance, 4(6):20–26, 1999.
  • [15] D. Paul. Asymptotics of Sample Eigenstructure for a Large Dimensional Spiked Covariance Model. Statistica Sinica, 17:1617–1642, 2007.
  • [16] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000.
  • [17] A. Singer and R. R. Coifman.

    Non-linear independent component analysis with diffusion maps.

    Applied and Computational Harmonic Analysis, 25(2):226–239, 2008.
  • [18] A. Singer and H.-T. Wu. Two-dimensional tomography from noisy projections taken at unknown random directions. SIAM J. Imaging Sciences, 6(1):136–175, 2012.
  • [19] C. M. Stein. Lectures on the theory of estimation of many parameters. Journal of Soviet Mathematics, 74(5), 1986.
  • [20] R. Talmon and R. R. Coifman. Empirical intrinsic geometry for nonlinear modeling and time series filtering. Proceedings of the National Academy of Sciences, 110(31):12535–12540, 2013.
  • [21] R. Talmon, S. Mallat, H. Zaveri, and R. R. Coifman. Manifold learning for latent variable inference in dynamical systems. IEEE Transactions on Signal Processing, 63(15):3843–3856, 2015.
  • [22] J. B. Tenenbaum, V. De Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.
  • [23] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res., 10:207–244, 2009.
  • [24] H.-T. Wu, R. Talmon, and Y.-L. Lo. Assess sleep stage by modern signal processing techniques. IEEE Trans. Biomed. Engineering, 62(4):1159–1168, 2015.
  • [25] H.-T. Wu and N Wu.

    Think globally, fit locally under the Manifold Setup: Asymptotic Analysis of Locally Linear Embedding.

    Annuls of Statistics, In press, 2018.
  • [26] S. Xiang, F. Nie, and C. Zhang. Learning a Mahalanobis distance metric for data clustering and classification. Pattern Recognition, 41:3600–3612, 2008.
  • [27] O. Yair, R. Talmon, R. R. Coifman, and I. G. Kevrekidis. Reconstruction of normal forms by learning informed observation geometries from data. Proceedings of the National Academy of Sciences, 114(38):E7865–E7874, 2017.
  • [28] Liu Yang and Rong Jin. Distance metric learning: A comprehensive survey. Michigan State Universiy, 2(2):4, 2006.