Hybrid Subspace Learning for High-Dimensional Data

08/05/2018 ∙ by Micol Marchetti-Bowick, et al. ∙ Google Carnegie Mellon University 0

The high-dimensional data setting, in which p >> n, is a challenging statistical paradigm that appears in many real-world problems. In this setting, learning a compact, low-dimensional representation of the data can substantially help distinguish signal from noise. One way to achieve this goal is to perform subspace learning to estimate a small set of latent features that capture the majority of the variance in the original data. Most existing subspace learning models, such as PCA, assume that the data can be fully represented by its embedding in one or more latent subspaces. However, in this work, we argue that this assumption is not suitable for many high-dimensional datasets; often only some variables can easily be projected to a low-dimensional space. We propose a hybrid dimensionality reduction technique in which some features are mapped to a low-dimensional subspace while others remain in the original space. Our model leads to more accurate estimation of the latent space and lower reconstruction error. We present a simple optimization procedure for the resulting biconvex problem and show synthetic data results that demonstrate the advantages of our approach over existing methods. Finally, we demonstrate the effectiveness of this method for extracting meaningful features from both gene expression and video background subtraction datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

High-dimensional datasets, in which the number of features is much larger than the sample size , appear in a broad variety of domains. Such datasets are particularly common in computational biology [1]

, where high-throughput experiments abound but collecting data from a large number of individuals is costly and impractical. In this setting, many traditional machine learning algorithms lack sufficient statistical power to distinguish signal from noise, a problem that is generally known as the curse of dimensionality 

[2].

One way to alleviate this problem is to perform dimensionality reduction, either by choosing a subset of the original features or by learning a new set of features. In this work, we focus on the class of subspace learning methods, whose goal is to find a linear transformation that projects the high-dimensional data points onto a nearby low-dimensional subspace. This corresponds to learning a latent space representation of the data that captures the majority of information from the original features.

(a) Data with fully low-rank structure.
(b) Data with hybrid structure.
Fig. 1: Toy datasets that illustrate the difference between fully low-rank data and hybrid data.

The most popular subspace learning method is principal component analysis (PCA) 

[3], which learns a compact set of linearly uncorrelated features that represent the directions of maximal variance in the original data. Since PCA was first introduced, many variants have been developed. For example, Sparse PCA [4] uses an elastic net penalty [5] to encourage element-wise sparsity in the projection matrix, resulting in more interpretable latent features. Another method, Robust PCA [6], learns a decomposition of the data into the sum of a low-rank component and a sparse component, which leads to increased stability in the presence of noise. Finally, there are approaches that propose richer models for the underlying latent representation of the data, involving multiple subspaces rather than just one [7].

A significant limitation of nearly all existing subspace learning methods is their assumption that the data, except for noise terms, can be fully represented by an embedding in a few low-dimensional subspaces. While this may hold true in some restricted settings, we contend that in most high-dimensional, real-world datasets, only a subset of the features exhibit low-rank structure, while the remainder are best represented in the original feature space. Specifically, since the low-rank features will be highly intercorrelated, they can be accurately represented as the linear combination of a small set of latent features. However, if there are raw features that are largely uncorrelated with the others, it’s clear that including them in the latent space model would require adding one new dimension for each such feature. We therefore argue that these features, which we describe as exhibiting high-dimensional rather than low-rank structure, should be excluded from the low-dimensional subspace representation.

We illustrate this intuition with an example. Figure 1 shows two toy datasets that each lie on a different 2D plane in 3D space. In the left plot, all three of the raw dimensions exhibit low-rank structure because they are all correlated. However, in the right plot, the vertical axis is completely uncorrelated with and , which causes the 2D subspace on which the data points lie to be axis-aligned with . We say that this data exhibits hybrid structure because only two out of the three features are truly low-rank.

In this simple example, PCA easily succeeds on both of the datasets shown in Figures 1(a) and 1(b). However, in a high-dimensional and noisy setting, the data may not lie exactly on a low-rank subspace. In this case, we can boost the signal to noise ratio in the data by identifying a sparse set of high-dimensional features that do not contribute to the low-rank structure of the dataset and eliminating them from the low-rank projection. This is the core motivation for our approach.

In this work, we introduce a new method called hybrid subspace learning that estimates a latent representation of the data in which some features are mapped to a low-rank subspace but others remain in the original high-dimensional feature space. To enforce this structure, we propose a novel regularization scheme that encourages each variable to choose between participating in the low-rank or high-dimensional component of the model. The resulting problem is biconvex, and we propose an efficient alternating minimization scheme using proximal gradient descent. We further describe a warm start procedure that allows us to learn a series of increasingly penalized models while avoiding many local optima.

The goal of this method is to perform dimensionality reduction for high-dimensional datasets in a way that allows flexibility in the proportion of low-rank vs. high-dimensional structure that is present in the data, and is also robust to noise. This allows us to learn a compact representation of the data using both feature combination and feature selection. As with other dimensionality reduction models, the representation that we infer can be used in various downstream tasks such as clustering and classification. Finally, we also go one step beyond dimensionality reduction by identifying a sparse set of features that stand out from the rest. Importantly, we do not assume that these features are purely noise; instead, we demonstrate that they are likely to have unique roles or functions, and therefore can provide new domain-specific insights or discoveries.

The remainder of this paper is organized as follows. In Section 2, we motivate our approach by demonstrating that certain properties of several real-world datasets naturally hint at a hybrid model. We then describe our model and optimization procedure in Sections 3 and 4. We evaluate our method on synthetic data in Section 5 and on two real-world datasets in Section 6.2. Finally we conclude in Section 7 by discussing the implications of our findings.

Notation

: We use lowercase bold symbols for vectors

and uppercase bold symbols for matrices . The element of is denoted , the row and column of are denoted and , respectively, and denotes a diagonal matrix s.t. . We use for the element-wise norm of a vector or matrix, for the norm of a vector, for the Frobenius norm of a matrix, and to denote an column-wise block norm of a matrix s.t. .

2 Motivation

In this section, we demonstrate via a series of simulations that hybrid structure causes the singular value spectrum of a dataset to become “long-tailed” i.e. to have a distribution in which a large amount of the probability mass is far from the mean. We then show that many real-world biological datasets possess long-tailed singular value spectra, which implies that it is not appropriate to attempt to capture all of the variables with a low-dimensional representation.

Consider a dataset with samples and features. The top row of Figure 2 shows the singular value spectra of five real datasets that consist of measurements taken from tumor samples of cancer patients. In all of these datasets, the top singular values are large but then decay very quickly. However, instead of going directly to zero, the spectrum has a long tail. This indicates the presence of structure in the data that does not fit into a low-rank space. As a result, if we ignored the tail by projecting the data to a low-rank subspace, it is likely that we would only capture a very coarse-grained representation of the data.

(a) real data 1
(b) real data 2
(c) real data 3
(d) real data 4
(e) real data 5
(f)
(g)
(h)
(i)
(j)
Fig. 2: Singular value spectra of real and synthetic datasets; (a)-(e) five real biological datasets collected from tumor samples of 163 leukemia patients; (f) synthetic data with pure low-rank structure; (g) synthetic data with pure high-dimensional structure; (h)-(j) synthetic data with hybrid structure.

We compare these real datasets with several simulated datasets to demonstrate how certain underlying modeling assumptions affect the singular value spectrum of the data. We generate synthetic data as follows. Let be an matrix with full column rank, be a matrix with full row rank, and be an matrix whose elements are independent. Define a probability vector that specifies the likelihood that each feature participates in only a low-rank (low-r) component, only a high-dimensional (high-d) component, or both, respectively. For simplicity, we consider only the case of for now. For each variable , we draw . If , we set . If , we set .

For our simulations, we use , , , and close to , and plot the spectra of five synthetic datasets generated for multiple values of in the bottom row of Figure 2. In panel (f), we set such that is rank with some random noise. In this case the singular value spectrum drops sharply after , but the tail that appears in the real data is missing. While it is possible that the tail could only contain noise, we postulate that it contains some important information that is ignored by subspace learning methods that focus purely on low-rank structure. In panel (g), we set such that has rank . In this case, the singular value spectrum of decays slowly, again unlike the real data. This implies that methods that use the full data matrix without alteration are not exploiting its intrinsic structure.

Panels (h)-(j) display three “hybrid” settings of . The spectra of these datasets exhibit a structure that is much more similar to the real data, with a few large singular values, and a tail that decays slowly. In these cases, This is the motivation for our hybrid approach that can model both the head and tail of the singular value spectrum.

3 Model

Given a dataset , traditional subspace learning aims to solve the following problem:

(1)

where is a -dimensional representation of each point and is a transformation that maps the latent space to the observed feature space. The above model, which is equivalent to PCA when the columns of are constrained to be orthogonal, implicitly assumes that all of the information in can be captured by its embedding in a low-rank subspace. However, as previously discussed, this assumption is inappropriate for high-dimensional data with a long-tailed singular value spectrum.

To overcome this limitation, we propose a new, flexible model for subspace learning that allows each feature in to choose between participating in a low-rank representation, , or a high-dimensional representation, . With this formulation, the goal is to have the low-r component capture the head of the singular value spectrum while the high-d component captures the tail. This leads naturally to the following problem:

s.t.
(2)

Here, is the low-rank component (as before) and is the high-dimensional component. Furthermore, is a vector of indicator variables, each of which dictates whether or not a particular feature participates in the high-d component. We apply an norm regularizer to restrict the total number of features that are captured by the high-d component. Finally, we constrain the problem such that each feature belongs to exactly one component.

However, this problem is intractable for two reasons. First, the penalty is highly nonconvex and difficult to optimize. Secondly, since and are coupled in the constraint, they cannot be optimized jointly. Performing alternating minimization on (2) would yield degenerate solutions, since initializing to non-zero would always force to be zero and vice-versa. We therefore propose the following relaxation:

s.t. (3)

We make two changes in order to arrive at (3). First, as is common in the sparsity literature, we relax to , and replace the penalty on with an penalty. Second, and more unique to our problem, we replace the hard constraint on and in (2) with a structured sparse regularizer that encourages each feature to participate in either the low-r component () or the high-d component (), but not both. This is achieved with an norm penalty over and of the form . Notice that sparsifying either the th element of or the th column of will completely zero out the th term of the penalty. This regularization scheme therefore encourages mutually exclusive sparsity over the columns of and the elements of . Furthermore, once the th term of the penalty is zero, there is no longer any shrinkage applied to the th column of , which yields a better estimate of the model parameters and eliminates the need for refitting the low-rank model after the high-d features have been identified.

As tends to , the model shown in (3) will enforce the hard constraint in (2). Conveniently, as we will see in the next section, this relaxation also permits us to develop a much more effective optimization procedure that is less likely to be trapped in local optima. At the same time, the new model is more flexible than (2) in that it can allow some overlap between and at the cost of having an additional tuning parameter.

Our approach, hybrid subspace learning (HSL), is closely related to Robust PCA (RPCA) [6] and its variants, which learn a decomposition of the data into the sum of a low-rank component and a sparse component . In particular, while RPCA encourages element-wise sparsity in

, Outlier Pursuit (OP) 

[8] is a more structured approach that encourages row-wise sparsity in in order to identify points in the dataset that are outliers, and allow them to be ignored by the low-rank representation . The OP algorithm can just as easily be applied to a transposed data matrix to identify features that are “outliers” because they can’t easily be embedded in a low-rank subspace. Although this is conceptually very similar to the core idea of HSL, there are several key differences.

First, and most importantly, HSL also strictly enforces sparsity in the projection matrix , which causes some features to be completely excluded from the low-rank representation. In OP, although can be made column-wise sparse, there is nothing to prevent the features that participate in from also participating in . Second, we learn an exact rank low-rank representation, whereas OP aims to minimize the nuclear norm of . Finally, HSL also has connections to Sparse PCA (SPCA) [4], which learns a rank decomposition of given by , where is encouraged to be element-wise sparse.

4 Optimization

Our optimization objective consists of a differentiable, biconvex loss function,

and two non-smooth, biconvex regularizers,

The objective is jointly convex in when and are fixed, and is jointly convex in {} when and are fixed. We implement an alternating minimization scheme to solve this problem, in which we iteratively optimize each convex sub-problem until the complete objective converges. Since the objective function of each sub-problem consists of a smooth, convex loss function plus a non-smooth, convex regularizer, we can leverage well-known tools to optimize functions of this form. Specifically, we apply proximal gradient descent, which projects the gradient step back onto the solution space at each iteration. The complete optimization procedure is outlined in Algorithm 1. In practice, we employ accelerated proximal gradient descent with line search to achieve a convergence rate of  [9]. We also find that 25-50 outer iterations is typically sufficient to reach convergence.

The projection and proximal operators used on lines 8, 10, 16, and 18 of Algorithm 1 are defined as:

(4)
(5)
(6)

These are applied column-wise or element-wise when given matrix arguments in place of vectors or vector arguments in place of scalars, respectively. We also use to denote the element-wise absolute value of , and to denote the column-wise norm of .

1:inputs: data matrix ; regularization parameters , ; step size ; initial values , , ,
2:initialize , , , using provided initial values
3:repeat
4:     fix ,
5:     initialize ,
6:     repeat Optimize
7:         
8:          Eq. (4)
9:         
10:          Eq. (5)
11:     until convergence
12:     fix ,
13:     initialize ,
14:     repeat Optimize
15:         
16:          Eq. (4)
17:         
18:          Eq. (6)
19:     until convergence
20:until convergence
21:outputs: estimates , , ,
Algorithm 1 Proximal Gradient Descent for HSL
1:inputs: data matrix ; regularization parameter ; increment size ; step size
2:randomly initialize , , ,
3:initialize
4:while  do
5:     (, , , )
6:       ProxGD-HSL (, , , , , , , )
7:     update
8:end while
9:outputs: final estimates of , , ,
Algorithm 2 Warm Starts for HSL

Although this optimization procedure is quite efficient, the algorithm can easily get trapped in local optima. The joint regularization term compounds the problem by increasing the sensitivity of the algorithm to initialization, especially when the value of is very high. However, when is small, these local optima are substantially reduced. Therefore, to circumvent this problem, we fit our model to data by incrementally increasing the value of from 0 to , while using warm starts to initialize the estimate of each successive model.111This is based on ideas by [10] who proposed warm starts for a non-convex sparse regularizer. In the next section, we demonstrate empirically that using warm starts in place of cold starts leads to significant performance gains. The warm starts procedure is shown in Algorithm 2, where we define as the smallest value of that yields .

5 Synthetic Data Experiments

Fig. 3: Results comparing the performance of our hybrid model against four baselines on synthetic data.

In order to quantitatively evaluate our approach, we begin by performing a series of experiments on synthetic data. We compare HSL against four baseline methods that perform different variants of subspace learning: PCA, Sparse PCA [4], Robust PCA [6], and Outlier Pursuit [8]. Note that we apply Outlier Pursuit to the transposed data matrix, .

5.1 Data Generation

Given raw feature space dimensionality , latent space dimensionality , and sample size , we first generate low-rank features and high-dimensional features . We then generate coefficients for the low-r component by drawing uniform random values in and similarly generate coefficients for the high-d component by drawing uniformly at random from .

Next, given a probability vector whose elements denote the likelihood that a feature will participate in only the low-r component (), only the high-d component (), or both (), we incorporate sparsity by setting randomly chosen columns of and elements of to zero according to the proportions specified in . Finally we generate the data according to , where is i.i.d. Gaussian noise.

5.2 Experimental Results

Fig. 4:

Results of a phase transition experiment with varying

and .
Fig. 5: Precision-recall curves for SPCA, RPCA, OP, and HSL calculated by varying parameter values over a broad range and evaluating recovery of the true set of high-dimensional features.
Fig. 6: (a) Convergence of HSL. (b) Final objective value after running HSL with each value of using warm and cold starts. (c) F1 score after running HSL with each value of using warm and cold starts. (d) Final objective value of HSL averaged over multiple simulations with warm and cold starts.

We compare the performance of our method against the baselines on three tasks: recovery of the true low-rank subspace, recovery of the true high-dimensional component, and identification of the set of true high-dimensional features. For each method, we evaluate recovery of the subspace based on the estimate of the low-rank matrix , which is directly output by RPCA and OP, and is calculated according to for SPCA and HSL. Specifically, we compare the operator that projects -dimensional points to the true -dimensional subspace to the projection operator for the -dimensional subspace closest to for each method. Similarly, we evaluate recovery of the high-dimensional component based on the estimate of the sparse matrix , which is again directly estimated by RPCA and OP, and is calculated according to for HSL.222PCA and SPCA do not produce a high-dimensional component, so we do not evaluate this metric for those two methods. We measure recovery of the true

using the Frobenius norm distance, and we measure the recovery of the high-dimensional feature set using the F1 score. Since parameter selection is a challenging task in unsupervised learning, each method is run with the ground truth value of

, and tuning parameters are chosen by picking the values that yield the best recovery of the low-rank subspace. We believe this provides a fair comparison of all methods.

In our first set of experiments, we choose default parameters , , , , , and then vary certain parameters across a range of values in order to evaluate the performance of our model under a broad variety of settings. In particular, we vary (a) the noise , (b) the dimensionality of the latent and feature space , and (c) the proportion of low-r and high-d participation ( v. ) with no overlap. In all cases, we run HSL with to ensure no overlap between the low-r and high-d components. The results of these experiments are shown in Figure 3. The top row shows the recovery error for the low-rank component , and the bottom row shows the recovery error for the high-dimensional component

. Each point represents the mean value over 10 random datasets, and the error bars show the standard error over these trials. The results demonstrate that HSL significantly outperforms all baselines in nearly all conditions.

Next we perform an experiment to evaluate the phase transition of our model in the zero noise case relative to RPCA and OP. Here we use , , , and we vary the low-rank dimensionality and the number of features in the high-dimensional component, denoted by . For each parameter setting, we run 10 trials and report (a) the average recovery error on , (b) the average selection error on ( F1 score), and (c) the average number of successes, where we define success as exactly recovering the true subspace (error ) and identifying the correct set of high-dimensional features (). The phase transition diagrams are shown in Figure 4. In the top two panels, a higher error is denoted by a darker color. HSL achieves low error in the majority of cases, whereas RPCA and OP have significantly higher error even when and are both small. In the bottom panel, white indicates success and black indicates failure. This figure shows that only HSL succeeds on both tasks (recovery of and ) when .

We also show a comparison of the precision-recall curves for the recovery of the high-dimensional features obtained by varying the parameter values for SPCA, RPCA, OP, and HSL over a broad range in Figure 5. The left panel shows the PR curve generated using the standard data generation approach that we previously described. Although HSL achieves a very high AUC, several other methods perform just as well. In order to make this task more challenging, we generated a second type of dataset in which the average variance of the high-dimensional features is about half the average variance of the low-rank features, making them harder to identify. The right panel shows the PR curve generated from this data. In both cases, the PR curves are averaged over 20 simulations.

Finally, we perform an empirical analysis of the effects of using cold starts versus warm starts to optimize our model. Given a dataset and a fixed value of , we train our model in one of two ways. Using cold starts, we simply test a series of successive values of , randomly initializing the model each time, until we hit . Using warm starts, we start with and increase its value incrementally, each time initializing the model with the estimate obtained on the previous value of , and again stop when we reach . As previously stated, is not fixed a priori, but is chosen to be the smallest value that yields zero overlap between the low-r and high-d components. We compare the final objective value and F1 score obtained after optimizing our objective with each value of . The results are shown in Figure 6, and illustrate that using warm starts helps avoid local optima and leads to increased stability. Figure 6 also shows that HSL with warm starts exhibits good convergence properties.

6 Real Data Experiments

In this section, we apply hybrid subspace learning in two different domains in order to showcase its capabilities and performance on real-world datasets.

6.1 Background Subtraction in Videos

Fig. 7: Results on one frame of the escalator dataset.

As an illustrative example, we begin by applying HSL to the problem of background subtraction in videos. In contrast to most traditional methods for background subtraction, which aim to distinguish between the foreground and background pixels in each frame, our results demonstrate that HSL is useful for identifying locations of consistent but irregular movement in videos.

In this experiment, we applied HSL along with RPCA and OP to a video of three escalators in a subway station. In the clip, one of the escalators has considerable traffic while the other two are largely unused. However, all three escalators are running. Using , HSL assigns nearly all of the background pixels to the low-r component, including those for the two moving but empty escalators, and assigns the remaining pixels, most of which correspond to locations with foreground movement, to the high-dimensional component. In this case, the two empty escalators exhibit regular movement, whereas the third escalator exhibits irregular movement from the pedestrian activity.

The results on a single frame of the video are shown in Figure 7

and compared with the results of applying RPCA and OP to the same data. Note that hyperparameters for each method were chosen to yield the same fraction of features assigned to the low-r versus high-d components (80% vs. 20%, respectively). The results of all three methods on the full video sequence are also available.

333See https://youtu.be/Ke0AZUn4TdM. In this example, only HSL is able to assign all pixels containing information about the moving people in the video to the high-dimensional component. Furthermore, the sparse map produced by HSL corresponds much more closely to the regions with consistent foreground movement in the videos, namely the locations that people move through.

6.2 Genomic Analysis of Cancer

(a) Raw Data, p-value=0.18
(b) PCA, p-value=0.16
(c) Robust PCA, p-value=0.15
(d) Outlier Pursuit, p-value=0.16
(e) HSL, p-value=0.04
(f) Mean (std.) p-values of differential survival functions over 100 initializations of the clustering method.
Fig. 8:

Survival analysis of breast cancer gene expression. (a-e) Representative Kaplan-Meier survival function estimates and corresponding p-values. (f) Distribution of p-values of differential survival functions over 100 clustering initializations.

Tumor Type PCA Robust PCA Outlier Pursuit HSL
Breast
Colon
GBM
Kidney
Lung
TABLE I: Reconstruction Errors of the Low-Rank Component of miRNA Data
Tumor Type Raw Data PCA Robust PCA Outlier Pursuit HSL
Breast
Colon
GBM
Kidney
Lung
TABLE II: Silhouette Scores for Clusters Produced by -Means

center Data Type Gene Ontology Code Gene Ontology Term Selected Oncogenes GO:0032633 Interleukin-4 production LEF1, CD83 Tumor GO:0017111 Nucleoside-Triphosphatase Activity TCIRG1, RAB31, ATP6V1C1, ATP6V1G3 GO:0005515 Protein Bnding NTRK3, HSPA1A, CCR5, ITGA2 + 10 more GO:0034472 snRNA 3’-end Processing None Control GO:0005006 Epidermal Growth Factor Receptor Activity ERRFI1, PSEN1 GO:0060478 Acrosomal vesical exocytosis None

TABLE III: Differential Enrichment of the Features Assigned to High-Dimensional Components

Next we apply HSL to biomedical data, and provide both qualitative and quantitative results to illustrate its performance. A common biological case in which is that of microarray data, in which the number of features measured typically far exceeds the number of patients for whom data is available. Here, we study the effectiveness of applying subspace learning methods to microarray data taken from cancer patients. We show that our approach outperforms several baselines on this data. Specifically, HSL produces subspace embeddings that achieve lower reconstruction error and lead to better performance on downstream tasks than competing methods. Finally, we demonstrate that HSL can also be used as a feature selection algorithm, since the features assigned to the high-dimensional component reflect biological characteristics of the original data.

To conduct our experiments, we used two datasets from TCGA444The Cancer Genome Atlas, http://cancergenome.nih.gov/.. The first dataset contains miRNA expression levels for five types of cancer – breast, glioblastoma multiforme (GBM), colon, kidney, and lung. Within each cancer type, we have data for 106, 93, 216, 123, and 107 samples, respectively, and each sample has 354 miRNA features. We use this dataset to evaluate how well the low-rank embedding of HSL captures the original data and its characteristics. The second dataset contains gene expression data for breast cancer patients and matching control samples. It contains 13794 mRNA features for 106 samples. We used this to analyze the high-dimensional component of HSL and to determine whether the information contained in the HSL estimate could sufficiently differentiate between cancer and control samples. EDIT HERE

For each dataset, the number of latent dimensions was chosen by manually inspecting the singular value spectrum. This value was determined to be for the miRNA datasets and for the gene expression dataset. In all experiments, we selected hyperparameter values as follows. For RPCA, the value of was set to , which can optimally recover the low-rank structure under standard assumptions [8]. In keeping with our synthetic experiments, OP was run on the transposed data matrix. The value of for OP was chosen to produce a low-rank component with rank equal to . For HSL, parameters were selected by performing a grid search and selecting the combination of parameters that minimized the AIC score.

In our first experiment, we evaluated the quality of the low-r components estimated for each miRNA dataset. To do this, we measured the reconstruction errors of the low-r embeddings produced by each method. Reconstruction errors, calculated as the Euclidean distance between the original data and the estimated low-r component , are shown in Table I. We see that HSL performs at least comparably, and frequently outperforms, all baseline methods on all datasets.

Next, we hypothesized that the low-r component of the HSL embedding may be more biologically informative than those estimated by traditional subspace learning methods. To study this, we used the estimated low-rank embeddings from each method to cluster the samples within each cancer type into subtypes. Since we do not have ground truth information about the subtypes, we evaluated the quality of the clusters by their silhouette scores, which provide a measure of how well the samples fit into their respective clusters. We performed -means clustering using clusters for breast[11], GBM[12], and colon[13] cancers and clusters for kidney[14] and lung[15]

cancers, where the number of clusters is based on the number of experimentally identified subtypes. The mean and standard deviation of the silhouette scores over 100 initializations of the clustering algorithm are shown in Table

II

. From these results, we see that the features extracted from the low-r component of the hybrid model yield more coherent clusters than features extracted from baseline methods.

Since our hybrid model does not encode all the features of the original data in the low-rank subspace, using these features alone would not necessarily be expected to boost performance on downstream tasks. Furthermore, the features assigned to the high-d component of the model likely correspond to genes that display uncommon activity patterns, which is why they cannot be easily represented by the same low-rank structure as the other genes. Based on this reasoning, we hypothesized that, rather than being unimportant, some of these genes may actually have very important biological functions. This is particularly likely in the case of cancer data, since genes that are mutated in cancerous cells display highly aberrant activity that disrupts their normal correlations with other genes.

To test this hypothesis, we investigated whether genes assigned to the high-d component in HSL are enriched for oncogenes when the model is run on cancerous samples but not enriched for oncogenes when it is run on samples of healthy tissue. For this experiment, we used the breast cancer gene expression data with matching control samples. After estimating the latent subspaces, we identified gene ontology (GO) terms by performing an enrichment analysis[16] of the features comprising the high-d component, and identified known oncogenes[17] in the subsets. For both cancer and control samples, the three GO terms with the lowest p-value for each dataset, and their contained oncogenes, are shown in Table III.

From these results, we see that HSL identifies a significant number of oncogenes when trained on tumor samples but selects non-oncogenic genes when trained on the healthy control samples. Notably, the high-d component estimated from the breast cancer tumor dataset selected features involved in the regulation of Interleukin-4, an enzyme that is known to be key in the growth of human breast cancer tumors [18]. In contrast, the high-d component learned from a control group did not include those features, instead assigning them to the low-rank space. In addition, the high-d component for the cancerous samples is enriched for the GO term “nucleoside-triphosphatase activity”, which includes both ATPase and GTPase activity. These processes are involved in regulation of the cell metabolism, a central mechanism in tumor growth[19]. Once again, the hybrid model assigned these features to the low-r component for non-cancerous samples. As the two datasets share the same set of features, the differential enrichment of oncogenes in the high-d component suggests that the assignment of features to either high-d or low-r component reflects characteristics of the original data.

Finally, we studied whether the subspaces estimated by HSL are more useful for downstream analysis than those of competing methods. To do this, we clustered the low-rank embeddings estimated from gene expression levels of both tumor and control samples into two groups using -means. As seen in Figure 8, clusters formed in the subspace estimated by HSL have more differential survival patterns than clusters formed in the subspaces estimated by traditional methods. While the survival effect size is not large, HSL is the only dimensionality reduction technique that retains enough information to produce any survival curves that are different at a significance level of . This indicates that the subspace estimated by HSL is not only efficient, but also retains information for downstream analysis.

7 Conclusion

In this work, we present a new subspace learning model. Our approach employs a novel regularization scheme to estimate a partial low-rank latent space embedding of a high-dimensional dataset, and simultaneously identifies features that do cannot easily be embedded in a low-rank space. This model addresses a critical gap in the existing literature on subspace learning, in which it is usually assumed that the high-dimensional data can be completely captured by a low-rank approximation, modulo some noise.

By comparing the singular value decompositions of real and synthetic datasets, we demonstrate that this assumption is not fulfilled in many real datasets. We therefore argue that our model is more appropriate for subspace learning on high-dimensional datasets that have a long-tailed singular value spectrum. Through applications to synthetic data, a video background subtraction task, and real gene expression data, we demonstrate that hybrid subspace learning can effectively learn a low-rank latent structure while assigning meaningful features to the high-dimensional component.

References

  • [1] V. Marx, “Biology: The big challenges of big data,” Nature, vol. 498, no. 7453, pp. 255–260, 2013.
  • [2] G. P. Hughes, “On the mean accuracy of statistical pattern recognizers,” Information Theory, IEEE Transactions on, vol. 14, no. 1, pp. 55–63, 1968.
  • [3] I. Jolliffe, Principal component analysis.   Wiley Online Library, 2002.
  • [4] H. Zou, T. Hastie, and R. Tibshirani, “Sparse principal component analysis,” Journal of computational and graphical statistics, vol. 15, no. 2, pp. 265–286, 2006.
  • [5] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 67, no. 2, pp. 301–320, 2005.
  • [6] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal of the ACM (JACM), vol. 58, no. 3, p. 11, 2011.
  • [7] R. T. Dorsam and J. S. Gutkind, “G-protein-coupled receptors and cancer,” Nature reviews cancer, vol. 7, no. 2, pp. 79–94, 2007.
  • [8] H. Xu, C. Caramanis, and S. Sanghavi, “Robust pca via outlier pursuit,” in Advances in Neural Information Processing Systems, 2010, pp. 2496–2504.
  • [9] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM journal on imaging sciences, vol. 2, no. 1, pp. 183–202, 2009.
  • [10] R. Mazumder, J. H. Friedman, and T. Hastie, “Sparsenet: Coordinate descent with nonconvex penalties,” Journal of the American Statistical Association, vol. 106, no. 495, 2011.
  • [11] K. Voduc, M. Cheang, S. Tyldesley, K. Gelmon, T. Nielsen, and H. Kennecke, “Breast cancer subtypes and the risk of local and regional relapse.” J. Clinical Oncology, vol. 28, no. 10, pp. 1684–91, 2010.
  • [12] R. Verhaak et al, “Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in pdgfra, idh1, egfr, and nf1.” Cancer Cell, vol. 17, no. 1, pp. 98–110, 2010.
  • [13] J. Guinney et al, “The consensus molecular subtypes of colorectal cancer,” Nature Medicine, vol. advance online publication, Oct 2015.
  • [14] S. R. Prasad, P. A. Humphrey, J. R. Catena, V. R. Narra, J. R. Srigley, A. D. Cortez, N. C. Dalrymple, and K. N. Chintapalli, “Common and uncommon histologic subtypes of renal cell carcinoma: Imaging spectrum with pathologic correlation,” RadioGraphics, vol. 26, no. 6, pp. 1795–1806, 2006, pMID: 17102051. [Online]. Available: http://dx.doi.org/10.1148/rg.266065010
  • [15] L. West, S. J. Vidwans, N. P. Campbell, J. Shrager, G. R. Simon, R. Bueno, P. A. Dennis, G. A. Otterson, and R. Salgia, “A novel classification of lung cancer into molecular subtypes,” PLoS ONE, vol. 7, no. 2, pp. 1–11, 02 2012. [Online]. Available: http://dx.doi.org/10.1371/journal.pone.0031906
  • [16] E. Eden, R. Navon, I. Steinfeld, D. Lipson, and Z. Yakhini, “Gorilla: a tool for discovery and visualization of enriched go terms in ranked gene lists,” BMC Bioinformatics, vol. 10, no. 48, 2009.
  • [17] G. F. Berriz, O. D. King, B. Bryant, C. Sander, and F. P. Roth, “Characterizing gene sets with funcassociate,” Bioinformatics, vol. 19, no. 18, pp. 2502–2504, 2003.
  • [18] N. S and T. M, “Interleukin-4 and breast cancer,” BMC Bioinformatics, vol. 7, no. 3, pp. 181–6, 2000.
  • [19] R. A. Cairns, I. S. Harris, and T. W. Mak, “Regulation of cancer cell metabolism,” Nature Reviews Cancer, vol. 11, no. 2, pp. 85–95, 2011.