Mahalanonbis Distance Informed by Clustering

08/13/2017 ∙ by Almog Lahav, et al. ∙ Technion Yale University 0

A fundamental question in data analysis, machine learning and signal processing is how to compare between data points. The choice of the distance metric is specifically challenging for high-dimensional data sets, where the problem of meaningfulness is more prominent (e.g. the Euclidean distance between images). In this paper, we propose to exploit a property of high-dimensional data that is usually ignored - which is the structure stemming from the relationships between the coordinates. Specifically we show that organizing similar coordinates in clusters can be exploited for the construction of the Mahalanobis distance between samples. When the observable samples are generated by a nonlinear transformation of hidden variables, the Mahalanobis distance allows the recovery of the Euclidean distances in the hidden space.We illustrate the advantage of our approach on a synthetic example where the discovery of clusters of correlated coordinates improves the estimation of the principal directions of the samples. Our method was applied to real data of gene expression for lung adenocarcinomas (lung cancer). By using the proposed metric we found a partition of subjects to risk groups with a good separation between their Kaplan-Meier survival plot.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 15

page 17

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent technological advances give rise to the collection and storage of massive complex data sets. Consider a data matrix , whose columns consist of high-dimensional samples (of dimension ) and whose rows consist of coordinates or features. We denote the columns (samples) in by , and the rows (coordinates) in by .

Many analysis techniques, such as kernel methods [13, 22, 14] and manifold learning [2, 8, 21, 26], are based on comparisons between samples. Such analysis can be applied to the data stored in the matrix , either by measuring similarities between columns or between rows. Furthermore, for high dimensional data set, this can be done in both dimensions, columns and rows.

An example for analysis that is based on distances in both dimensions was presented by Brazma et al. [3] in the context of gene expression. In this case, the columns represent subjects and the rows are their gene expressions. By comparing rows, we may recover co-regulated or functionally associated genes. In addition, similarity between pair of columns indicates subjects with similar gene expression profiles. Both comparisons (of the rows and the columns) are independent, yet, often, and in particular in this example, the columns and rows might be coupled, e.g., a group of genes are co-expressed, only within a subset of subjects which belong to a specific population group.

Existing approaches for simultaneous analysis of samples and coordinates are based on bi-clustering [17, 4, 7]. Chi et al. proposed in [7] a convex formulation of the bi-clustering problem based on the assumption that the matrix exhibits a latent checkerboard structure. Their solution attempts to approximate this structure by alternating between convex clustering of rows and convex clustering of columns, until convergence to a global minimum. However, this method, as well as other bi-clustering techniques, is restricted by the checkerboard structure assumption. Even if this assumption is indeed satisfied, the outcome represents the data in a very specific manner: to which 2D cluster each entry belongs.

Manifold learning methods that take into account the coupling between dimensions (e.g., rows and columns) are presented in [1, 19]. These methods describe algorithms for the construction of various data structures at multiple scales. Yet, they merely present constructive procedures lacking definitive (optimization) criteria.

In this work we present a framework for analysis of high dimensional data sets, where the metric on the samples (columns) is informed by an organization of the coordinates (rows). With the new informed metric

we can organize the samples, as well as perform other analysis tasks, such as: classification, data representation, dimensionality reduction, etc. We show that in the case where the rows have a clustering structure, i.e. the k-means cost function takes on values which are low enough, it assists us in the estimation of the (local or global) principal directions of the columns. Specifically we show that clusters of rows can be exploited for the computation of the Mahalanobis distance

[18]

between columns, which is based on estimation of the inverse covariance matrix. One of the important properties of the Mahalanobis distance is its invariance to linear transformations. Assuming the observable columns are a linear function,

, of a set of hidden variables, the above property allows the recovery of the Euclidean distances in the hidden space. Based on a local version of the Mahalanobis distance proposed in [24], our approach is extended to the nonlinear case, where the Euclidean distances in the hidden space can be recovered even when is nonlinear.

Our framework is beneficial especially when there is a limited number of columns for the computation of Mahalanobis distance. Either the whole data set is small compared to the rank of the inverse covariance matrix, or a good locality requires a small neighborhood and therefore fewer samples are taken for the estimation of the local inverse covariance.

We illustrate the advantage of this framework with a synthetic example, where the discovery of clusters of correlated coordinates improves the estimation of the principal directions of the samples. The global Mahalanobis distance based on the proposed principal directions allows the separation between two classes of samples, while the conventional Mahalanobis distance fails.

The framework was applied to real data of gene expression for lung adenocarcinomas (lung cancer). We use a data set of gene expression for lung adenocarcinomas. Consider the above setup, the patients were defined as the samples, and the

genes with the highest variance as the coordinates. We built a metric on the subjects that is informed by clusters of genes. By using this metric we found a partition of the subjects to risk groups with a good separation between their Kaplan-Meier survival plot. Based on the p-value measure for this partition, our method achieves better results than other data-driven algorithms

[15], and it is comparable to competing supervised algorithms [23], some of them based on additional clinical information.

The remainder of the paper is organized as follows. In Section 2 we review the mathematical background of our method. Sections 3 and 4 present the global and the local versions of the proposed metric, respectively. In Section 5 we examine the global distance on a synthetic data set. We next present the application of the proposed local Mahalanobis distance to gene expression data in Section 6. Section 7 concludes the paper.

2 Background

2.1 Global Mahalanobis Distance and PCA

Consider an unobservable

-dimensional random vector

with zero mean and covariance matrix defined on . An -dimensional accessible random vector , where , is given by a linear function of :

(1)

where is an unknown matrix of rank . The covariance matrix of is therefore given by:

(2)

Realizations of and are denoted by and , respectively. The above model represents many real-world problems in signal processing and data analysis where the data is driven by some latent process, and the observable samples (columns) are a function of a set of intrinsic hidden variables . The Mahalanobis distance between two samples in the hidden space, , is defined as follows:

(3)

One of the important properties of this distance is its invariance to any linear transformation of rank of the data samples in . The Mahalanobis distance between any two samples from the random vector , is defined by:

(4)

Note that here the Mahalanobis distance is written with the pseudo-inverse of since it is not full rank (see appendix A):

(5)
Theorem 2.1.

Let be two samples from the random vector . The Mahalanobis distance between them satisfies:

(6)
Proof.

Substituting (5) into (4):

(7)

A consequence of Theorem 2.1 is the statement in the next corollary.

Corollary 2.1.1.

If we assume that the coordinates of are uncorrelated with unit variance, i.e. , then

(8)

In other words, we can recover the Euclidean distance in the hidden space by computing the Mahalanobis distance in the observed space.

According to the model assumptions, the linear transformation is unknown and therefore the computation of requires an estimation of the covariance . A common estimator of is the sample covariance . Since the Mahalanobis distance requires the pseudo-inverse of the covariance matrix, we use Principal Component Analysis (PCA) to find the principal directions corresponding to the rank of the true covariance matrix . The Eigenvalue Decomposition (EVD) of the sample covariance is given by:

(9)

where the columns of are the principal directions of , and is a diagonal matrix consisting of the variance of each of the principal directions. Hence, the estimated Mahalanobis distance using PCA is given by:

(10)

Note that since the pseudo-inverse is semi-positive definite (and symmetric) it can be written as , where . Substituting it into (10) gives:

(11)

Therefore, the Mahalanobis distance between and is equivalent to the Euclidean distance between their projections on the subspace spanned by the principal directions.

2.2 The Relation Between PCA and K-means Clustering

Following [10], we present the relation between k-means clustering and PCA. Here, we consider a set of samples which are the rows of the matrix . Applying k-means to gives rise to clusters of rows, which minimize the following cost function:

(12)

where is the set of indices of samples belonging to the th cluster, and is the centroid of the th cluster. The clustering minimizing (12) is denoted by , where the th column of is an indicator vector representing which rows belong to the th cluster:

(13)

and is the size of th cluster. Recall that are the principal directions of used in the previous section, the relation between the matrix and is given by the following theorem.

Theorem 2.2.

Let be the principal directions of . The relation between and the clustering denoted by is given by:

(14)

where is matrix of rank .

Proof.

We define the matrix:

(15)

where is an all ones vector. According to Theorem 3.3 in [10] there exists an orthonormal transformation , which satisfies:

(16)

By removing the first column in we define:

(17)

which is a matrix of rank from the definition of , satisfying:

(18)

Theorem 2.2 implies that a linear transformation maps the clustering of the rows to the principal direction of the columns. Note that the first column of is:

(19)

since it maps the clusters to the first column of which is a constant vector:

(20)

Since we are interested only in the remaining columns of , which are the principal directions, we can ignore as presented in the proof of Theorem 2.2.

3 Global Mahalanobis Distance With Clustering

3.1 Derivation

The Mahalanobis distance presented in Section 2.1 is defined with the pseudo-inverse of the covariance matrix . For a finite number of samples (columns) of dimension (the number of rows), the estimation of the principal directions becomes more difficult when the dimension increases. As a result, is less accurate, leading to inaccurate Mahalanobis distance between samples (columns).

This conventional approach does not take into account the structure of the rows when the distance between the columns is computed. Our observation is that based on the relation between k-means and PCA, presented in Section 2.2, the clustering structure of the rows can be exploited for the estimation of .

To this end, we would like to estimate the transformation which maps clusters of rows to a set of new principal directions . We start with initial principal directions computed by PCA, i.e. the eigenvectors of the sample covariance. In addition, we have the k-means clusters, which are ordered in the matrix as in (13). Since the clusters do not overlap, the rank of is , and the pseudo-inverse exists. Multiplying (14) by give the estimated transformation:

(21)

Plugging in (21) into (14) yields:

(22)

which constitutes new principal directions by mapping onto an appropriate -dimensional subspace. According to (13), the columns of are orthonormal and therefore . Therefore is denoted by:

(23)

To gain further insight, we examine the explicit expression of the th entry of the matrix :

This implies that is derived by averaging the coordinates of the principal directions according to the clusters of the rows. Specifically, if the th row belongs to the cluster , the th principal direction is:

(24)

It further implies that if the entries of each principal direction are organized according to the clusters of the rows, the resulting principal directions are piecewise constant.

The new principal directions can be used to establish a new estimate of the covariance matrix by:

(25)

Since the new principal directions are piecewise constant, exhibits a checkerboard structure. This property will be demonstrated in the sequel.

The Mahalanobis distance between columns, which is informed by the clustering structure of the rows, is given by:

(26)

where

(27)

3.2 Optimal Solution

The computation of the principal directions proposed in Section 3.1 is composed of two steps. The first step is finding the optimal principal directions according to the standard PCA optimization criterion, i.e., minimizing the reconstruction error (or maximizing the variance). The second step appears in (23), where the principal directions are mapped onto the -dimensional subspace defined by , which we denote by . This two-step procedure only guarantees that the obtained principal directions are in the subspace , yet there is no guarantee that they attain the minimal reconstruction error among all sets of vectors in .

To guarantee local optimality, we propose a new constrained PCA problem formulation. According to (23) and the definition of the matrix in (13), the subspace is spanned by the columns of the matrix , which are the indicator vectors of each cluster:

(28)

namely, . Let be the optimization criterion of PCA based on the reconstruction error:

(29)

We define the following constrained optimization problem:

(30)
subject to

where we require that the principal directions , which are the columns of , attain minimal reconstruction error and are in the subspace .

This problem can be solved iteratively by the gradient projection algorithm [5]:

for (31)

In the first stage of each iteration in (31) we take a step of size in the negative direction of the gradient of , which is explicitly given by:

(32)

In the second stage we project on . In [5], it was shown that this method is guaranteed to converge to a stationary point.

We conclude this section with three remarks. First, note that if we initialize (31) with the principal directions computed by standard PCA, the first stage does not change since it is the minimum point of by definition, and therefore . In the second stage we get . Hence, the computation of derived in Section 3.1 is equivalent to the first iteration of the optimal solution in (31). Similar to the description, which appears after (24), our interpretation for (31) is that the optimal principal directions are obtained by averaging the coordinates of according to the clustering of the rows of after each step toward the minimum of .

Second, after the last iteration in (31), the obtained principal directions are not necessarily unit vectors. Therefore the computation of the pseudo-inverse of the covariance matrix in (27), requires a normalization of each of the obtained principal directions . Based on , the entire construction procedure of the informed distance in (26) is summarized in Algorithm 1.

Third, we use an empirical stopping criterion; the iterative algorithm stops if the update of the principal directions (the gradient) becomes smaller than a threshold, which is determined empirically in each experiment.

input: Data matrix

  1. Start with rows . Apply k-means clustering and get

  2. Find principal directions, , by taking the eigenvectors of with the largest eigenvalues

  3. Start with . Compute new principal directions:

    for

    where:

  4. Normalize the principal directions

  5. Define , and the distance between columns is given by:

Algorithm 1 Global Mahalanobis Distance Informed by Clustering

3.3 Example: Recovering Distances in a Hidden Space

The following example demonstrates the property of Mahalanobis distance presented in Section 2.1: the invariance to a linear transformation and the ability to recover distances in a hidden space. We assume that samples in a hidden space are distributed uniformly in the unit square . The observable data samples are given by a linear function of :

The matrix consists of three sub-matrices:

where . The rows of each of the sub-matrices are realizations of Gaussian random vectors with covariance and the following respective means:

Note that once realized, the matrix is maintained fixed through out the experiment. The structure of the matrix was chosen such that coordinates of the observable samples belong to three different clusters. Figure 1 presents which consists of realizations of . It can be observed that the rows of have a clear structure of three clusters.

Figure 1: The matrix . Each column of

is a realization of the random variable

. The rows have a clear structure of three clusters.

The conventional Mahalanobis distance and the distance proposed in Algorithm 1, and , are computed from columns of and compared to the Euclidean distance in the hidden space . As can be seen in Fig. 2, both methods demonstrate accurate recovery of the Euclidean distance in . Particularly, both attain correlation with the Euclidean distance between the unobservable data samples.

(a)
(b)
Figure 2: Comparison between the Euclidean distance in the hidden space and the Mahalanobis distance computed by: (a) the conventional approach using PCA (), (b) the proposed approach (). Both obtain correlation of 0.99 with the original distance.

To examine the recovered distances in a different way, we find a parametrization of using an algorithm that uses a kernel built from small distances called diffusion maps [9]

. Following is a brief review of the main steps of diffusion maps. We compute an affinity matrix with the following Gaussian kernel:

(33)

Note that is any distance between the samples and . The affinity matrix is normalized by the diagonal matrix :

(34)

where:

(35)

The diffusion maps embedding of the point is defined by:

(36)

where and are the eigenvectors of corresponding to the largest eigenvalues and , respectively (ignoring the trivial eigenvalue ). It was proven in [9] that the operator:

(37)

is an approximation of a continuous Laplace operator defined in the unit square, when is distributed uniformly, as in this example. Therefore, the parametrization of in (36

) is approximated by the two first eigenfunctions of the Laplace operator:

Figure (a)a presents the data points in the original space colored by the first eigenvector. We compute the affinity matrix defined in (33) for the following distances: Euclidean, and . The embedding obtained by each of the distances is presented in Fig. (b)b, (c)c and (d)d, where the points are colored according to . It can be observed that the embeddings computed by both and are accurate representations for the data in the hidden space, and the columns are mapped into a shape homeomorphic to the unit square. Conversely, using the Euclidean distance between the columns, as demonstrated in Figure (b)b, gave rise to a different embedding, which does not represent the hidden space. This example demonstrates the ability to recover distances in a hidden space by computing the Mahalanobis distance, either with the conventional approach (using ) or with the one proposed in Algorithm 1 (using ). The advantage of the proposed distance will be presented in Section 5.

(a)
(b)
(c)
(d)
Figure 3: (a) Data points distribute uniformly in the hidden space colored by the first eigenfunction . The embedding of computed by: Euclidean distance, Mahalanobis distance and the informed distance are presented in (b), (c) and (d).

4 Local Mahalanobis Distance With Clustering

4.1 Local Mahalanobis Distance

The advantage of using Mahalanobis distance holds as longs as the data arises from the model (1), where the random vector is a linear function of a hidden variable . In this section, based on [24], we present an extension to the nonlinear case, i.e:

(38)

where is a non-linear function. For simplicity, in this section, we assume that . This assumption is common practice, for example, in probabilistic PCA [28]. Under this assumption, the Mahalanobis distance between points in the hidden space defined in (3) is Euclidean.

To deal with the nonlinearity introduced by , an extension of the Mahalanobis distance in (4) was presented in [24]. Since is a non-linear function the covariance of the random variable around each point is different. We denote the local covariance around the column by . To estimate the local covariance we define the matrix which consists of all the neighbors of the column , namely, all the columns in such that for some . Let be the principal directions of (obtained from the EVD of ), and is a diagonal matrix containing the variance in each of the principal directions. The estimated local covariance is given by:

(39)

For any two columns and , a modified (local) Mahalanobis distance is defined by:

(40)

If is invertible, by denoting and the two hidden samples in corresponding to and , namely, and , the Euclidean distance between and is given by:

(41)

where is the Jacobian of at , . It was proven in [24] that under the assumption that is a smooth map between two smooth manifolds:

(42)

the local covariance is an approximation of . Therefore, by substituting in (41), the Euclidean distance between points and in the hidden space is given by:

(43)

If the dimension of the manifold is , the data samples close to approximately lie on a -dimensional tangent space . Hence, is not a full rank matrix, so that the inverse of the local covariance matrix in (43) is replaced by its pseudo-inverse:

(44)

Using the pseudo-inverse of the estimated covariance in (39), we recast (44) as:

(45)

In other words, the local Mahalanobis distance between two columns and is an approximation of the Euclidean distance between the hidden points and in the hidden space . This result implies that although the data is generated from a nonlinear transformation of the hidden samples, the distance between these samples can be computed by the modified Mahalanobis distance using the pseudo-inverse of the local covariance matrices. Based on this distance we can, for example, find a representation of the samples in the hidden space, as has been done in [24, 25].

4.2 Informed Distance Derivation

In the linear case presented in Section 2.1, all the available samples are assumed to be realizations from the same statistical model. Conversely, in the nonlinear case, each sample may correspond to different statistical model (each column corresponds to a different covariance matrix ). Consequently, we consider neighborhoods of similar columns as coming from the same (local) statistical model and use them in order to estimate the local covariance matrices. This implies the following trade-off: on the one hand, when we use larger neighborhoods, more columns are available for the covariance estimation, thereby yielding a more accurate covariance estimation on the expense of locality. On the other hand, when we use smaller neighborhoods, fewer columns are available for the estimation, so that the covariance estimation is more local, but suffers from a larger error.

This problem can be seen as the well-known bias-variance trade-off and it is illustrated in Fig. 4. The covariance matrix required for the computation of the distance is estimated using different size of neighborhoods: in Fig. (a)a the choice of small neighborhood respect the manifold but consists of a small number of points. In Fig. (a)a, due to the manifold curvature, large neighborhood with more samples, do not respect the manifold structure.

(a)
(b)
Figure 4: An illustration of the manifold , demonstrates the trade-off between locality (a) and the large number of columns required for the estimation of a local inverse covariance (b).

Considering the above trade-off, we propose a different construction of the local Mahalanobis distance based on the concept of the informed distance presented in Section 3.

Let be the clustering matrix obtained by the k-means of the rows of defined in Section 4.1. Similarly to (31), we start with , and compute the new principal directions by:

for (46)

Using (46), the pseudo-inverse of the local covariance is given by:

(47)

By replacing with we rewrite (40) and obtain the informed local Mahalanobis distance:

(48)

The construction procedure of is summarized in Algorithm 2.

input: Data matrix

  1. For each columns compute the local inverse covariance:

    1. Create the matrix which consists of nearest neighbors of in its columns

    2. Apply k-means clustering to the rows of , and get

    3. Find principal directions of , , by taking the eigenvectors of with the largest eigenvalues

    4. Compute new principal directions:

      for

      where:

    5. Normalize the principal directions

    6. Compute the local pseudo-inverse covariance of the point :

  2. The distance between two points:

Algorithm 2 Local Mahalanobis Distance Informed by Clustering

In the example presented in Section 3.3 we have shown an equivalence between the global Mahalanobis distance, , and the one informed by clustering, . However where there is a constraint on the number of columns, as required for locality in the local Mahalanobis distance, the estimated pseudo-inverse of the local covariance is less accurate. For a covariance of rank , estimated by columns, the estimation error is increased with . However, these two parameters do not necessarily degrade the performance of the clustering of the rows: k-means can be applied successfully to rows even for a large (or ), and regardless of the dimension of the rows Once the clustering of the rows is computed, we utilize it to improve the estimation of the local pseudo-inverse covariance. Therefore we expect our approach to be beneficial whenever the number of columns available for the estimation of the high-rank pseudo-inverse covariance is limited, provided that the rows exhibit a typical clustering structure. This will be demonstrated in the next section.

5 Synthetic Toy Problem

5.1 Description

To examine our approach we generate synthetic data according to the problem setting described previously: samples are in high dimensional space, and coordinates have a distinct structure of clusters. Consider high-dimensional data stored in the matrix

, where the columns and rows represent, for example, subjects and their properties, respectively. We assume that the columns (e.g., subjects) can belong to two types (e.g., two types of population), namely, TypeA and TypeB, with the same probability

. The rows (e.g, properties) are divided into clusters (e.g., demographic properties, physiological properties, etc.), , such that each two properties are correlated if and only if they belong to the same cluster. Given the type of the column and the cluster of the row, the distribution of the elements of the data is given by:

(49)

The covariance matrix of the columns is given by a block diagonal matrix:

where contains the variance and the covariance elements between the properties in cluster :

Here represents the size of cluster . Let and be the conditional expected value of a column given its type, i.e. and . Using the law of total expectation, the covariance matrix of all columns is given by:

(50)

Figure (a)a presents a data set of samples (columns) generated according to the above model. For illustration purposes, in Fig. (b)b the columns and rows are ordered according to their type and cluster, respectively. Note that the proposed approach does not take into account the specific order of the columns and rows of the given data set.

(a)
(b)
Figure 5: Data set of samples generated according to the model described in Section 5. Entry represents the th property of the th subject. The sample represents all the properties of the th subject.(a) The matrix where the order of columns and rows is random. (b) Columns and rows are organized according to clusters and types respectively.

5.2 Results

(a)
(b)
Figure 6: Principal directions estimation error as a function of the number of subjects , computed with: (a) Forbenius norm , (b) 2-norm . In yellow line the PCA error. In blue and red lines the error of the proposed principal directions using k-means and a known clustering solution respectively.

We compare between the principal directions obtained by our method in (31) and the principal directions obtained by PCA. To isolate the error obtained solely by the clustering assumption, we also compute the principal directions using a known clustering solution.

Denote the estimated principal directions by . The objective measure used to evaluate the obtained principal directions is the empirical mean error with respect to the first eigenvectors of the true covariance calculated analytically from the model (50). The empirical mean error is computed with the Forbenius norm:

and with the 2-norm:

where is the empirical mean over the realizations of the entire data set. Note that this criterion is invariant to any orthogonal transformation (rotation) of . We note that in Section 3 the number of principal directions is the number of the k-means clusters minus one. Here, to complete the number of principal directions to the rank of the true covariance , we compute the th principal direction, , in the same way as in (31).

We consider data with rows, divided into clusters. For different sizes of , we generate samples (columns) according to (5.1) and repeat the experiment times. The error averaged over the experiments is presented in Fig. 6 as a function of . We observe that the error of based on clustering is lower than the error of PCA, especially for small values of . While a small amount of columns results in a significant estimation error of the principal directions, rows can be clustered more easily. By exploiting these clusters, the proposed principal directions indeed yields a better estimation. As increases, the difference between the errors obtained by our method in (31) and by PCA decreases. This happens for two reasons. First, PCA-based estimation improves when more samples are available for the sample covariance estimation . Second, since the dimension of the rows increases, finding the optimal solution of k-means clustering becomes more difficult. Conversely, we observe that for values of smaller than , the known clustering has negligible contribution compared to the estimated clustering (since the dimension of the rows is small). We note that errors computed with different norms (e.g. ) yield comparable results and trends.

As was discussed in Section 3, an intuitive explanation for the definition of is the averaging of the principal directions obtained by PCA according to the clustering of the rows found by k-means. To illustrate this property, we present the true covariance matrix of the columns (ordered according to the clustering of the rows) in Fig. (a)a and the estimated covariance matrices obtained by PCA and by our method in Fig. (b)b and Fig. (c)c, respectively.

(a)
(b)
(c)
(d)
Figure 7: True and estimated covariance matrices of subjects with zoom on properties 750-900: (a) the true covariance , (b) the covariance obtained by PCA, (c) the covariance estimated by the proposed approach. (d) Cross section of the covariance matrices in (a), (b) and (c) at property j=900. Covariance matrices were estimated using samples.

As can be seen, the true covariance takes the shape of “stairs”, which is induced by the model. The estimation error of the covariance obtained by PCA (Fig. (b)b) is manifested as noise mounted on each stair. The contribution of the clustering is evident in Fig. (c)c, where we observe that when the clusters are taken into account, each stair is denoised by averaging all the entries belonging to this stair, resulting in a structure that visually more similar to the true covariance depicted in Fig. (a)a. The smoothing effect of each stair is illustrated also by comparing one of the rows of the covariance (the th property) as presented in Fig. (d)d.

(a)
(b)
Figure 8: Embedding of the columns colored according to their type (TypeA, TypeB). (a) embedding obtained by the conventional Mahalanobis distance (b) embedding obtained by the proposed distance

Recall that the covariance matrices and (which are estimated by PCA (9) and our approach (25)) can be used to define the respective Mahalanobis distances and . To evaluate the benefit of our method in terms of the metric, we compute the diffusion maps of columns as described in Section 3.3 twice separately based on and .

The embedding (of the columns) obtained by each of the distances is presented in Fig. 8, where each embedded column is colored according to its type. It can be clearly observed in Fig. (b)b that when the diffusion maps is based on the proposed distance , columns in the embedded space are well separated according to their type. Conversely, as presented in Fig. (a)a, when using the two point clouds representing the two types are mixed.

Note that the proposed metric can be incorporated not only into diffusion maps algorithm, but also into other dimensionality reduction algorithms, such as t-SNE [16].

6 Application to Gene Expression Data

To demonstrate the applicability of our approach we apply it to real data, and more specifically, to gene expression data. The technology of DNA microarray allows to measure the expression levels for tens of thousands of genes simultaneously. This massive genomic data, which contains gene profile of different subjects, is used for a variety of analysis tasks such as: predicting gene function classes, cancer classification, predicting putative regulatory signals in the genome sequences [3], identifying prognostic gene signatures [15, 23], unsupervised discrimination of cell types [12], etc.

We apply our method to gene expression data set for the purpose of analysing lung cancer subjects. Using the proposed method we will show that we can accomplish an analysis task, which was previously approached by several prognostic models [15, 23]: clustering and ordering subjects with respect to their (unknown) survival rate. Here, this will be accomplished through the construction of our informed metric between the subjects.

The gene profile of a subject in a typical data set may contains tens of thousands of genes. Therefore, the application of a metric such as the local Mahalanobis distance [24], which is based on a naïve estimation of the local covariance from a limited number of high dimensional samples (e.g., the sample covariance which is typically used in PCA applications) often yields poor performance. We will show that using the metric proposed in Algorithm 2, which is informed by clusters of genes, results in improved performance.

The setting of gene expression data can be formulated similarly to the problem setting considered in this paper. Consider a set of subjects for which a gene expression profiles composed of genes were collected. The data is organized in a matrix corresponding to the problem setting defined in Section 1.

We analyze a data set termed CAN/DF (Dana-Farber Cancer Institute) and taken from [23] consisting of gene expression profiles collected from lung adenocarcinomas (cancer subjects), which was previously analyzed in Shedden et al [23]. The data set consists of subjects, for which profiles of 22,283 gene expressions were collected. Following common practice, we use only genes with high variance; specifically, we take the genes with the highest variance. The data is organized in a matrix , such that the th column corresponds to the th subject, and the th row corresponds to the th gene. In order to illustrate the clustering structure embodied in the genes, k-means is applied to the rows of , which are then ordered according to the clusters. The heat map of the matrix is presented in Fig. (a)a and the reordered matrix is presented in Fig. (b)b. Indeed, in Fig. (b)b, we can observe a clear structure of gene clusters.

(a)
(b)
Figure 9: Heat maps of the expression of genes with the highest variance: (a) genes are ordered according to their variance, (b) genes are ordered according to clusters obtained by k-means.

We calculate the local Mahalanobis distance between the columns of in two ways. First, by applying local PCA as in [24] (based on the sample covariance), which will be denoted by , and second, by Algorithm 2, which will be denoted by . The number of principal directions used for the computation of the local inverse covariance of each column can be estimated by the eigenvalues decay. Here, in both methods, we use principal directions. The number of genes clusters used in Algorithm 2 is . This particular value of was determined empirically, since it yields a good separation to clusters. We note that alternatively, can be set using standard algorithms designed for this purpose such as [27, 20]. In addition, in the local PCA we could use a different value of for each neighborhood. We set the neighborhood size to be samples. This choice is explained in the sequel.

The obtained metrics are objectively evaluated in the context of clustering the subjects (columns). Specifically, we use the following algorithm. (i) Compute the affinity matrix with a Gaussian kernel based on the competing metrics:

Common practice is to set the kernel scale to be of the order of the median of all pairwise distances