Best Subset Selection in Reduced Rank Regression

12/13/2019 ∙ by Canhong Wen, et al. ∙ 0

Reduced rank regression is popularly used for modeling the relationship and uncovering the structure between multivariate responses and multivariate predictors in genetics. It is especially challenging when predictors are high-dimensional, in which case subset selection is considered to reduce model complexity and enhance model interpretability. We propose a novel selection scheme to directly identify the best subset of predictors via a primal dual formulation. Based on it, we develop a computational efficient algorithm that can be scalable to high-dimensional data with guaranteed convergence. We show that the estimator from the proposed algorithm enjoys nice sampling properties including consistency in estimation, rank and sparsity selection under wild regularity conditions. Further in the practical stage, the new estimator achieves competitive numerical performance under a variety of simulation settings and at the same time allows significantly fast computation. The effectiveness of the proposed method is also demonstrated on an ovarian cancer genetic dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Suppose we observe the centered111Both and are centered for simplicity; otherwise, One can involve the intercept term to fit our model. dataset , where

is generated via the multivariate response linear regression model

, where is the underlying yet unknown coefficient matrix and

is the zero-mean noise vector. In literature,

represents the response to the predictor , through the structure that we expect to figure out. Equivalently, we can write the model in the matrix form as

(1)

where , and concatenates the responses, predictors and noises, respectively. We further assume that each column of , namely, each variable with samples, is normalized to be with norm .

Recently, as the high dimensional data become increasingly common, model (1) is in demand of (i) a simple structure of the coefficient , and (ii) a small amount of important variables affecting the responses most. The reduced rank regression (RRR) (Anderson, 1951; Izenman, 2008) restricting the coefficient matrix to be low rank provides a simple yet interpretable underlying structure, and has been wildly used in econometrics and genetics, see, for instance, Velu and Reinsel (1998), Vounou et al. (2010), and Ma et al. (2014). It works by solving the rank-constrained least squares problem

(2)

where is the Frobenius norm, denotes the rank of , and is an positive integer. Meanwhile, as the number of variables are relatively large in high dimensional data, we aim at detecting a small (or sparse) subset of important predictors and thus obtaining a parsimonious fit to the data in hand. For example in genomics, researchers are interested in finding out which micro RNAs play key role in the regulatory relationship between cancer related gene expressions and hundreds of micro RNAs (Ma et al., 2014).

To meet with the two demands, we mainly focus on the scenario that is low-rank and row-sparse. In specific, and are both small with and . In order to recover , we consider the following constrained least squares problem

(3)

where and . The minimizer of (3) denoted as is called the multivariate response best subset selection (MrBeSS) estimator with rank and sparsity .

The only difference between the MrBeSS minimization (3) and the RRR minimization (2) is the row-sparse constraint. Thus a direct approach to solve MrBeSS estimator is to select candidates out of the rows in that may not be zero, and then apply RRR method to the corresponding selected data that includes the -th column of if and only if the -th row of is a candidate. However, exhaustively searching over all possible choices of candidates and performing RRR at each time is NP-hard and impractical.

To overcome the computational difficulties caused by the non-convex constraint that strictly gives the best subset selection result, computationally friendlier optimization based regularization methods have been proposed as a surrogate for problem (3), among which the convex relaxation is a popular method. For example, the non-convex is replaced by the convex , which represents the sum of the row-norms. Then replacing the sparse constraint as a Lagrangian term gives the penalized optimization problem

(4)

where penalty parameter . The minimizer of (4) is called as rank constrained group Lasso (RCGL) in Bunea et al. (2012) since the term can be viewed as a group Lasso penalty (Yuan and Lin, 2006) by vectorizing and setting entries in the same row as a group. Consequently, several adaptive weighted variants of group Lasso (Chen and Huang, 2012; She, 2017) can also be applied to select important predictors in RRR. For example, instead of distributing equal weights to all the row-norms, the SRRR estimator in Chen and Huang (2012) minimizes the variant of (4) with row-wise weight being determined by an adaptive weighting strategy (Zou, 2006; Wang and Leng, 2008). Besides, rather than directly imposing group sparsity on , Chen et al. (2012) and Mishra et al. (2017) suggested to decompose the coefficient matrix

via singular value decomposition and prompt group sparsity to the corresponding singular vectors. Similarly,

Ma et al. (2014) applied the hard thresholding operator to the singular vectors in order to eliminate singular vectors with small singular values.

In spite of its favorable computational properties, regularization methods with continuous penalty relaxation have several shortcomings. The gap between the relaxed penalty and the true one is a main concern in the joint rank and row selection (JRRS) estimator proposed by Bunea et al. (2012). JRRS estimator provides an adaptive correction, which acts analogously to the information criterion such as AIC (Akaike, 1974) and BIC (Schwarz and others, 1978) in cases of univariate response regression. Specifically speaking, with some candidates of , the authors consider solving the following regularized optimization problem

(5)

where penalty parameter . Nonetheless, it needs complete enumeration of the model spaces and should be used together with an efficient subset selection procedure. Moreover, the parameter is hard to determine beforehand, which limits the effectiveness of the JRRS estimator.

Meanwhile, in the theoretical aspect, some sufficient regularity conditions on the data must be satisfied to guarantee a relaxed model approaching good predictive accuracy, see, for example, the restricted eigenvalue condition

(Bunea et al., 2012) or incoherence condition (She, 2017). Thus as soon as these conditions are violated, the above group Lasso based methods become suboptimal and might bring in a large number of irrelevant rows including noise predictors. The adaptive weighting version of group Lasso penalty might help to improve the prediction performance, yet it only works for fixed , the number of predictors and has the tendency of under-selecting relevant variables when (Chen and Huang, 2012). In contrast, the group penalty based methods are shown to achieve the optimal rate for prediction error under no restrictions on the design matrix (Bunea et al., 2012; She, 2017). Recently, She and Chen (2017) discussed that the performance of using group Lasso penalty is substantially worse and less stable than those of using the nonconvex group

penalty for outliers detection in RRR. Therefore, to achieve better performance of subset selection and parameter estimation in RRR, it is desirable to design an efficient algorithm directly based on the group

penalty.

In this paper, we propose a novel framework via which the best subset selection problem (3) can be directly solved within a reasonable time frame. Motivated by the primal dual formulation of the optimizer, which is the main ingredient of our proposal, we develop a new computationally efficient algorithm with guaranteed convergence. Note that the special case of univariate response regression has been thoroughly studied in Wen et al. (2017) and we provide a more general point of view here. Theoretically, we demonstrate that our proposed MrBeSS algorithm successfully recovers the true model with acceptable accuracy. Further, we show significant numerical outperformance of MrBeSS under a variety of simulation settings when compared with state-of-the-art methods . We also study how the new estimator works efficient and stable when applying to the real world in an example of micro RNA and gene expression association data.

The rest of the paper is organized as follows. Section 2 presents the primal dual characterization of the best subset selection model. We mainly develop our algorithms in Section 3, including solving a MrBeSS estimator with fixed parameters, an initialization strategy, and an adaptive parameter tuning algorithm. In Section 4, we analyze the theoretical performance of the proposed algorithms. The competitive numerical performance of our proposal is demonstrated in Section 5 through both simulated and real data examples. Section 6 discusses interesting related problems for future research. The detailed proofs of the theorems are presented in the Supplementary Material.

Notations

Conventionally, a vector is expressed in the column form, that is, . For a matrix , where is the -th row of , we denote its -th column as , and more generally (resp. ) represents the sub-matrix concatenating (resp. ) with . The rank of is denoted as , while its trace is denoted as . For a symmetric matrix let , be its

-th largest eigenvalue and the corresponding eigenvector, respectively,

. As an eigenvector has at least two options in opposite directions, we will specify one direction case by case. Let and be the Frobenius norm and -norm of , respectively. Define as the number of nonzero rows of , which can be convexly relaxed as the -norm of , denoted as .

As our theoretical results focus on the case when the number of samples is sufficiently large, we will use the following notations to compare the growth order of sequences. For any two positive sequences , denote or , if grows no faster or in the same order of , respectively. That is,

Further, we analogously say if , if , and if .

2 Primal Dual Formulation

We decompose into the product of two matrices, i.e., , with

and orthogonal matrix

. Then (3) can be rewritten as

(6)

Therefore the two restrictions are separated, which may help reduce the computational difficulty. However, the solution to the optimization problem (6) is not unique. For example, suppose is a solution of (6), then is also a solution of (6), if there is an orthogonal matrix such that and . Nonetheless, suggests that both the two solutions give the same estimation of .

Next we characterize the primal dual condition for that motivates our MrBeSS algorithm, where is a minimizer of (6). For the -th row, we consider the unconstrained minimization problem given the other variables optimal, to say,

(7)

We can find the explicit solution of (7) as with , where denotes the -th column of .

To figure out the connection between and , note that the row-sparse constraint in (6) forces elements in to be zero vectors. If is not enforced to be zero, then the row-wise optimality of gives that and . From the above discussion, we observe the primal-dual condition of the optimal point.

Proposition 1 (Primal-dual condition).

If is a minimizer of (6) and denotes the index of the non-zero rows of , then row-wisely and its corresponding normalized residual satisfy the following primal-dual condition:

(8)

In consequence, we call , as the -th primal variable and dual variable, respectively.

For the decomposed MrBeSS problem (6), define the active set as the index of non-zero rows of , i.e., , and the inactive set as the complement of . Given , the active set and the -part of the minimizer, we can recover the primal variables as well as the dual according to (8). In specific,

(9)

Since the support of the primal variables is complementary to that of , we directly set

(10)

The computation is the key to design a scalable MrBeSS algorithm in the next section.

3 Algorithm

In this section, we develop a new estimation algorithm for solving the MrBeSS problem, where both dimension reduction and variable selection are taken into account. We first solve the basic parameter-fixed MrBeSS estimator, and then provide suggested initialization and parameter tuning strategies. Based on the primal-dual active set updating scheme, the proposed algorithm has closed-form updates and is computationally efficient, which enables our algorithm to be applied in high dimensional data.

3.1 MrBeSS with Fixed Parameters

Given the rank and the row-sparsity , the decomposed optimization problem in (6) is now with respect to and . This motivates us to solve it in a block-wise iteration, i.e., optimizing one variable by fixing another variable. In specific, we solve the following two sub-problems at the -th iteration:

(11)
(12)

When is fixed, the sub-problem (12) is an orthogonal Procrustes problem and has explicit solution (Schönemann, 1966). In particular, if we perform the singular value decomposition of as , where , , and , then the solution is given by . The solution is unique provided the -th singular value of is nonzero.

Given , the sub-problem (11) can be treated as a group subset selection problem with each group representing one row in . As discussed in Section 2, to obtain the optimal variables, it suffices to know the optimal active set. Once the estimated active set has been settled as , we can update the variables analogous to (9), (10), namely,

(13)
(14)

Therefore, the main ingredient at each iteration is to estimate the active set. To determine which rows are non-zero candidates, we consider the difference of objective function in (7) when switching the vector from to , as given by

(15)
(16)

which is called the -th sacrifice hereafter. Intuitively, we may prefer enforcing those s with least sacrifices to all zeros. To realize this, let , which is a permutation of , be the rank statistic222If there’re knots, we will randomly specify different rank of them, such that is still a one-to-one mapping. of , that is, , then truncate the ordered sacrifice vector at position . Namely, and . As the sacrifices calculated at each time highly depend on the inaccurate primal-dual pair, we repeat the processes of finding the primal-dual variables and least-sacrifice active set. The above discussion is summarized in Algorithm 1 as follows.

  Algorithm 1 Multivariate Response Best Subset Selection (MrBeSS) with fixed and
  Input: Response matrix , predictor matrix , rank and sparsity .
Output: .

  1. Set and initialize . (See Algorithm 2 for a suggested initialization.)

  2. While the value of objective function in (6) not converged do

    1. Given , update as follows

      1. Set and with its -th row . Let .

      2. While not converged do

        1. Determine the sacrifice by for .

        2. Compute the rank statistics of . Determine the active and inactive sets by

        3. Update and by (13) and (14) with .

        4. .

      3. Set and .

    2. Given , update as follows

      1. Let and perform SVD on , i.e., ;

      2. Determine by .

    3. .

  3. Set .

 

3.2 Identification and Initialization

We first study the noiseless case , which motivates us to specify the identical optimal solution as well as the suggested initialization strategy used in Step 1. in Algorithm 1. Similar to , we can also decompose as , where and are unique up to right-multiplying an orthogonal matrix. We observe the following property, which motivates the unique decomposition .

Proposition 2 (Characterization of noiseless decomposition).

For any decomposition , there exists an orthogonal matrix , such that consists of the ordered nonzero eigenvectors of , that is,

(17)

where is the projection matrix onto the column space of with the Moore-Penrose pseudo-inverse. Further, is unique if the first eigenvalues of are different with each other.

It should be noted that we take a specific direction of , e.g., its first nonzero entry is positive. Hereafter we regard in (17) and its correspoding as the identical decomposition of .

Now we turn back to the practical noisy case that . We expect that our estimated could be as close as possible to . An intuition according to (17) is to initialize as the order nonzero eigenvectors of or , that is,

(18)
(19)

Note that though in the noiseless case, in general , and we provide a guideline for choosing which matrix to perform eigenvalue decomposition. When and is with full rank, the projection matrix plays the role of compressing the error on a smaller -dimensional space, while not disturbing the information from . Therefore, we prefer applying (19) in low-dimensional case. On the other hand, when ,

is nearly identical, and is exactly the identical matrix in the full rank case. Thus,

loses its power of error-compression yet burdens the computation. In this high-dimensional situation, we seek via (18). Our above discussion leads to the algorithm as follows.

  Algorithm 2 Initialization of
  Input: Response matrix , predictor matrix and rank .
Output: Initial matrix .

  1. If , compute and let ; otherwise, let .

  2. Perform eigen-decomposition to the matrix .

  3. Compute .

 

3.3 Adaptive Parameter Tuning

There are two tuning parameters in problem (3) or (6), namely, the rank and the row-sparsity . Given a grid of candidates , we can select the one that leads to the smallest average prediction error with -fold cross validation (Friedman et al., 2001). Alternatively, various information criteria can been used due to their computational efficiency. Here we propose a novel generalized information criterion (GIC) defined by

(20)

where is the normalized loss and is the MrBeSS estimator of with fixed parameters . When the response is univariate, i.e., , our proposed GIC (20) reduces to the GIC studied in Fan and Tang (2013). It is shown in their paper (Corollary 1, Fan and Tang (2013)) that the component is a valid choice for identifying the true model consistently.

GIC finds the tradeoff between loss minimization and overfitting, and smaller GIC represents a more balanced choice. Therefore, a naive way is to minimize GIC among all candidate parameter values, say, . For each pair of tuning parameters and , the MrBeSS estimator is computed via Algorithm 1 and the corresponding GIC value is calculated. However, simultaneous search for optimal tuning parameters and over a two-dimensional grid is computationally expensive. To reduce the computational burden, we introduce a simplified yet efficient search strategy. In specific, we first identify a GIC-minimal choice of sparsity over a sequence of values with rank large enough, and then based on the detected active set of the parameter , we determine the GIC-minimal rank via fitting RRR model to the restricted data , where denotes the sub-matrix of only including the columns . The algorithm with the simplified search strategy is summarized in Algorithm 3. In this way, we can obtain the result of parameter tuning after running Algorithm 1 for time and running RRR for time, rather than running the time-demanding Algorithm 1 for time.

  Algorithm 3 Multivariate Response Best Subset Selection (MrBeSS)
  Input: Response matrix , predictor matrix , maximum number of rank and sparsity .
Output: .

  1. For , do

    1. Run Algorithm 1 with and . Denote the output by and .

    2. Compute the GIC value .

  2. Find and its corresponding with the minimal GIC value, i.e.,

  3. For , do

    1. Fit RRR model with rank to the data and denote the estimated coefficient matrix by .

    2. Compute the GIC value .

  4. Find the optimal with the minimal GIC value in 3., i.e.,

  5. Determine the final estimated coefficient matrix by , .

 

4 Theoretical properties

In this section we present the theoretical analysis for the outputs of Algorithm 1 and Algorithm 3. The proofs of the results are provided in the Supplementary Material.

4.1 Preliminaries

Define the active set and inactive set of the true underlying coefficient matrix as and , respectively. We prepare the following technical conditions for the theoretical analysis.

  1. (Dimensions) As , is fixed, , and

  2. (Restricted Isometry Predictor) The predictor matrix is nearly orthogonal in the sense that there exists constants independent of , such that

    (21)
    (22)
  3. (Minimal-separated Response) Let denote the -th largest eigenvalue of , . Then the nonzero eigenvalues are separated as

    for some constant .

  4. (Response Intensity) for all .

  5. (Noise) are -sub-Gaussian333We say a random vector is -sub-Gaussian, if its sub-Gaussian norm is no more than See Vershynin (2018)

    for more details about the definition of sub-Gaussian distribution.

    random vectors for some constant , that is, for any , we have

  6. (One-side Stronger Restricted Isometry Predictor) There exists constants independent of , such that

Remark 1.

The conditions require some intuitive interpretations.

  1. The order of in (C1) is a common high-dimensional data setting, especially suitable for genetic datasets. The fixed- condition comes from the reality that increasing experiments will not increase the number of output variables. We further give the constraint on the upper-bound of sparsity for our scheme to work, which is rather weak as it goes infinity with respect to .

  2. From Weyl’s theorem on eigenvalues, (21) in (C2) is equivalent to

    (23)

    which represents that any is almost an identity matrix. (22) further upper-bounds any sparse off-diagonal block of , which means that any two distinct small subsets of variables of are designed to be mutually nearly independent. In summary, (C2) suggests the variables in are almost isometric and uncorrelated. These two equations can be regarded as a weaker condition of the restricted isometry and restricted orthogonality conditions discussed in Candes and Tao (2005); Geer and Bühlmann (2009).

  3. (C3) assumes that any two eigenvalues of are well-distinguished from each other, then from Proposition 2, is uniquely determined, and its columns will not change their orders under minute disturbance.

  4. As is an -by- matrix and sometimes normalized to be for each column, it is natural to assume that . From (C2), the correlation between variables is weak, so can be regarded as aggregation of information from . Then (C4) suggests the heterogeneity of each part of information. In specific, consists of approximately of the whole information, which gives (C4).

  5. (C6) makes stronger assumptions than the two equations in (C2) on the lower-bound side.

Proposition 3.

If (C2) and (C4) are satisfied, then we have for some constant .

We study a special case to show the conditions easy to be satisfied.

Proposition 4 (Gaussian case).

Suppose each column of

is identically and independently distributed (i.i.d.) as the uniform distribution over the

-dimensional sphere with radius , which is simply denoted as

Second, is specifically designed. Third, each noise , . If the dimension condition (C1) holds, then for any fixed choice of the constants

and any small probability

, there exists large enough , such that all the conditions hold with probability at least .

4.2 Main results

We assume without loss of generality that our calculated eigenvectors always take the correct direction, such that the angles between estimated and population vectors are no more than degrees. Mathematically speaking, if is an estimation of , which are computed via eigenvalue decomposition, then we have that Then we present the error bounds for the estimator obtained via Algorithm 1.

Theorem 1 (Error bounds).

Denote as the output Algorithm 1 with and . Suppose that conditions (C1)-(C5) hold. If , then with probability at least for any , we have

(24)

with the constants and . Consequently when the above high-probability event happens,

(25)

where and are the estimators of and , given and .

Note that , which is bounded under (C1). Second, if the active set is not correctly chosen in , it can be shown that , whose proof is similar to Proposition 3. Then we have the following corollary.

Corollary 1.

With the settings in Theorem 1, we further assume that . Then with probability at least and sufficiently large , Algorithm 1 will find the true active set after iterations, where for some constants Furthermore, the estimation error bounds will be

(26)

Theorem 1 states that Algorithm 1 successfully approaches the true coefficient matrix when the row-sparsity is correctly chosen and rank constraint is not too strict. In the next theorem, we demonstrate Algorithm 3 can consistently output the true rank and row-sparsity .

Theorem 2.

Denote as the output Algorithm 3. Assume (C1)-(C6) hold with , then with probability at least and sufficiently large , Algorithm 3 will select the true rank and the active set, i.e., and .

We now present the error bounds for the estimator obtained via Algorithm 3.

Corollary 2.

Assume (C1)-(C6) hold with . Then with probability at least and sufficiently large , Algorithm 3 will find the true active set, that is, with the estimation error bounds in (24) and (25). If the conditions in Corollary 1 is further satisfied, then the error bounds will be (26).

5 Simulation study