Suppose we observe the centered111Both and are centered for simplicity; otherwise, One can involve the intercept term to fit our model. dataset , where
is generated via the multivariate response linear regression model, where is the underlying yet unknown coefficient matrix and
is the zero-mean noise vector. In literature,represents the response to the predictor , through the structure that we expect to figure out. Equivalently, we can write the model in the matrix form as
where , and concatenates the responses, predictors and noises, respectively. We further assume that each column of , namely, each variable with samples, is normalized to be with norm .
Recently, as the high dimensional data become increasingly common, model (1) is in demand of (i) a simple structure of the coefficient , and (ii) a small amount of important variables affecting the responses most. The reduced rank regression (RRR) (Anderson, 1951; Izenman, 2008) restricting the coefficient matrix to be low rank provides a simple yet interpretable underlying structure, and has been wildly used in econometrics and genetics, see, for instance, Velu and Reinsel (1998), Vounou et al. (2010), and Ma et al. (2014). It works by solving the rank-constrained least squares problem
where is the Frobenius norm, denotes the rank of , and is an positive integer. Meanwhile, as the number of variables are relatively large in high dimensional data, we aim at detecting a small (or sparse) subset of important predictors and thus obtaining a parsimonious fit to the data in hand. For example in genomics, researchers are interested in finding out which micro RNAs play key role in the regulatory relationship between cancer related gene expressions and hundreds of micro RNAs (Ma et al., 2014).
To meet with the two demands, we mainly focus on the scenario that is low-rank and row-sparse. In specific, and are both small with and . In order to recover , we consider the following constrained least squares problem
where and . The minimizer of (3) denoted as is called the multivariate response best subset selection (MrBeSS) estimator with rank and sparsity .
The only difference between the MrBeSS minimization (3) and the RRR minimization (2) is the row-sparse constraint. Thus a direct approach to solve MrBeSS estimator is to select candidates out of the rows in that may not be zero, and then apply RRR method to the corresponding selected data that includes the -th column of if and only if the -th row of is a candidate. However, exhaustively searching over all possible choices of candidates and performing RRR at each time is NP-hard and impractical.
To overcome the computational difficulties caused by the non-convex constraint that strictly gives the best subset selection result, computationally friendlier optimization based regularization methods have been proposed as a surrogate for problem (3), among which the convex relaxation is a popular method. For example, the non-convex is replaced by the convex , which represents the sum of the row-norms. Then replacing the sparse constraint as a Lagrangian term gives the penalized optimization problem
where penalty parameter . The minimizer of (4) is called as rank constrained group Lasso (RCGL) in Bunea et al. (2012) since the term can be viewed as a group Lasso penalty (Yuan and Lin, 2006) by vectorizing and setting entries in the same row as a group. Consequently, several adaptive weighted variants of group Lasso (Chen and Huang, 2012; She, 2017) can also be applied to select important predictors in RRR. For example, instead of distributing equal weights to all the row-norms, the SRRR estimator in Chen and Huang (2012) minimizes the variant of (4) with row-wise weight being determined by an adaptive weighting strategy (Zou, 2006; Wang and Leng, 2008). Besides, rather than directly imposing group sparsity on , Chen et al. (2012) and Mishra et al. (2017) suggested to decompose the coefficient matrix
via singular value decomposition and prompt group sparsity to the corresponding singular vectors. Similarly,Ma et al. (2014) applied the hard thresholding operator to the singular vectors in order to eliminate singular vectors with small singular values.
In spite of its favorable computational properties, regularization methods with continuous penalty relaxation have several shortcomings. The gap between the relaxed penalty and the true one is a main concern in the joint rank and row selection (JRRS) estimator proposed by Bunea et al. (2012). JRRS estimator provides an adaptive correction, which acts analogously to the information criterion such as AIC (Akaike, 1974) and BIC (Schwarz and others, 1978) in cases of univariate response regression. Specifically speaking, with some candidates of , the authors consider solving the following regularized optimization problem
where penalty parameter . Nonetheless, it needs complete enumeration of the model spaces and should be used together with an efficient subset selection procedure. Moreover, the parameter is hard to determine beforehand, which limits the effectiveness of the JRRS estimator.
Meanwhile, in the theoretical aspect, some sufficient regularity conditions on the data must be satisfied to guarantee a relaxed model approaching good predictive accuracy, see, for example, the restricted eigenvalue condition(Bunea et al., 2012) or incoherence condition (She, 2017). Thus as soon as these conditions are violated, the above group Lasso based methods become suboptimal and might bring in a large number of irrelevant rows including noise predictors. The adaptive weighting version of group Lasso penalty might help to improve the prediction performance, yet it only works for fixed , the number of predictors and has the tendency of under-selecting relevant variables when (Chen and Huang, 2012). In contrast, the group penalty based methods are shown to achieve the optimal rate for prediction error under no restrictions on the design matrix (Bunea et al., 2012; She, 2017). Recently, She and Chen (2017) discussed that the performance of using group Lasso penalty is substantially worse and less stable than those of using the nonconvex group
penalty for outliers detection in RRR. Therefore, to achieve better performance of subset selection and parameter estimation in RRR, it is desirable to design an efficient algorithm directly based on the grouppenalty.
In this paper, we propose a novel framework via which the best subset selection problem (3) can be directly solved within a reasonable time frame. Motivated by the primal dual formulation of the optimizer, which is the main ingredient of our proposal, we develop a new computationally efficient algorithm with guaranteed convergence. Note that the special case of univariate response regression has been thoroughly studied in Wen et al. (2017) and we provide a more general point of view here. Theoretically, we demonstrate that our proposed MrBeSS algorithm successfully recovers the true model with acceptable accuracy. Further, we show significant numerical outperformance of MrBeSS under a variety of simulation settings when compared with state-of-the-art methods . We also study how the new estimator works efficient and stable when applying to the real world in an example of micro RNA and gene expression association data.
The rest of the paper is organized as follows. Section 2 presents the primal dual characterization of the best subset selection model. We mainly develop our algorithms in Section 3, including solving a MrBeSS estimator with fixed parameters, an initialization strategy, and an adaptive parameter tuning algorithm. In Section 4, we analyze the theoretical performance of the proposed algorithms. The competitive numerical performance of our proposal is demonstrated in Section 5 through both simulated and real data examples. Section 6 discusses interesting related problems for future research. The detailed proofs of the theorems are presented in the Supplementary Material.
Conventionally, a vector is expressed in the column form, that is, . For a matrix , where is the -th row of , we denote its -th column as , and more generally (resp. ) represents the sub-matrix concatenating (resp. ) with . The rank of is denoted as , while its trace is denoted as . For a symmetric matrix let , be its
-th largest eigenvalue and the corresponding eigenvector, respectively,. As an eigenvector has at least two options in opposite directions, we will specify one direction case by case. Let and be the Frobenius norm and -norm of , respectively. Define as the number of nonzero rows of , which can be convexly relaxed as the -norm of , denoted as .
As our theoretical results focus on the case when the number of samples is sufficiently large, we will use the following notations to compare the growth order of sequences. For any two positive sequences , denote or , if grows no faster or in the same order of , respectively. That is,
Further, we analogously say if , if , and if .
2 Primal Dual Formulation
We decompose into the product of two matrices, i.e., , with. Then (3) can be rewritten as
Therefore the two restrictions are separated, which may help reduce the computational difficulty. However, the solution to the optimization problem (6) is not unique. For example, suppose is a solution of (6), then is also a solution of (6), if there is an orthogonal matrix such that and . Nonetheless, suggests that both the two solutions give the same estimation of .
Next we characterize the primal dual condition for that motivates our MrBeSS algorithm, where is a minimizer of (6). For the -th row, we consider the unconstrained minimization problem given the other variables optimal, to say,
We can find the explicit solution of (7) as with , where denotes the -th column of .
To figure out the connection between and , note that the row-sparse constraint in (6) forces elements in to be zero vectors. If is not enforced to be zero, then the row-wise optimality of gives that and . From the above discussion, we observe the primal-dual condition of the optimal point.
Proposition 1 (Primal-dual condition).
If is a minimizer of (6) and denotes the index of the non-zero rows of , then row-wisely and its corresponding normalized residual satisfy the following primal-dual condition:
In consequence, we call , as the -th primal variable and dual variable, respectively.
For the decomposed MrBeSS problem (6), define the active set as the index of non-zero rows of , i.e., , and the inactive set as the complement of . Given , the active set and the -part of the minimizer, we can recover the primal variables as well as the dual according to (8). In specific,
Since the support of the primal variables is complementary to that of , we directly set
The computation is the key to design a scalable MrBeSS algorithm in the next section.
In this section, we develop a new estimation algorithm for solving the MrBeSS problem, where both dimension reduction and variable selection are taken into account. We first solve the basic parameter-fixed MrBeSS estimator, and then provide suggested initialization and parameter tuning strategies. Based on the primal-dual active set updating scheme, the proposed algorithm has closed-form updates and is computationally efficient, which enables our algorithm to be applied in high dimensional data.
3.1 MrBeSS with Fixed Parameters
Given the rank and the row-sparsity , the decomposed optimization problem in (6) is now with respect to and . This motivates us to solve it in a block-wise iteration, i.e., optimizing one variable by fixing another variable. In specific, we solve the following two sub-problems at the -th iteration:
When is fixed, the sub-problem (12) is an orthogonal Procrustes problem and has explicit solution (Schönemann, 1966). In particular, if we perform the singular value decomposition of as , where , , and , then the solution is given by . The solution is unique provided the -th singular value of is nonzero.
Given , the sub-problem (11) can be treated as a group subset selection problem with each group representing one row in . As discussed in Section 2, to obtain the optimal variables, it suffices to know the optimal active set. Once the estimated active set has been settled as , we can update the variables analogous to (9), (10), namely,
Therefore, the main ingredient at each iteration is to estimate the active set. To determine which rows are non-zero candidates, we consider the difference of objective function in (7) when switching the vector from to , as given by
which is called the -th sacrifice hereafter. Intuitively, we may prefer enforcing those s with least sacrifices to all zeros. To realize this, let , which is a permutation of , be the rank statistic222If there’re knots, we will randomly specify different rank of them, such that is still a one-to-one mapping. of , that is, , then truncate the ordered sacrifice vector at position . Namely, and . As the sacrifices calculated at each time highly depend on the inaccurate primal-dual pair, we repeat the processes of finding the primal-dual variables and least-sacrifice active set. The above discussion is summarized in Algorithm 1 as follows.
Algorithm 1 Multivariate Response Best Subset Selection (MrBeSS) with fixed and
Input: Response matrix , predictor matrix , rank and sparsity .
Set and initialize . (See Algorithm 2 for a suggested initialization.)
While the value of objective function in (6) not converged do
Given , update as follows
Let and perform SVD on , i.e., ;
Determine by .
3.2 Identification and Initialization
We first study the noiseless case , which motivates us to specify the identical optimal solution as well as the suggested initialization strategy used in Step 1. in Algorithm 1. Similar to , we can also decompose as , where and are unique up to right-multiplying an orthogonal matrix. We observe the following property, which motivates the unique decomposition .
Proposition 2 (Characterization of noiseless decomposition).
For any decomposition , there exists an orthogonal matrix , such that consists of the ordered nonzero eigenvectors of , that is,
where is the projection matrix onto the column space of with the Moore-Penrose pseudo-inverse. Further, is unique if the first eigenvalues of are different with each other.
It should be noted that we take a specific direction of , e.g., its first nonzero entry is positive. Hereafter we regard in (17) and its correspoding as the identical decomposition of .
Now we turn back to the practical noisy case that . We expect that our estimated could be as close as possible to . An intuition according to (17) is to initialize as the order nonzero eigenvectors of or , that is,
Note that though in the noiseless case, in general , and we provide a guideline for choosing which matrix to perform eigenvalue decomposition. When and is with full rank, the projection matrix plays the role of compressing the error on a smaller -dimensional space, while not disturbing the information from . Therefore, we prefer applying (19) in low-dimensional case. On the other hand, when ,
is nearly identical, and is exactly the identical matrix in the full rank case. Thus,loses its power of error-compression yet burdens the computation. In this high-dimensional situation, we seek via (18). Our above discussion leads to the algorithm as follows.
Algorithm 2 Initialization of
Input: Response matrix , predictor matrix and rank .
Output: Initial matrix .
If , compute and let ; otherwise, let .
Perform eigen-decomposition to the matrix .
3.3 Adaptive Parameter Tuning
There are two tuning parameters in problem (3) or (6), namely, the rank and the row-sparsity . Given a grid of candidates , we can select the one that leads to the smallest average prediction error with -fold cross validation (Friedman et al., 2001). Alternatively, various information criteria can been used due to their computational efficiency. Here we propose a novel generalized information criterion (GIC) defined by
where is the normalized loss and is the MrBeSS estimator of with fixed parameters . When the response is univariate, i.e., , our proposed GIC (20) reduces to the GIC studied in Fan and Tang (2013). It is shown in their paper (Corollary 1, Fan and Tang (2013)) that the component is a valid choice for identifying the true model consistently.
GIC finds the tradeoff between loss minimization and overfitting, and smaller GIC represents a more balanced choice. Therefore, a naive way is to minimize GIC among all candidate parameter values, say, . For each pair of tuning parameters and , the MrBeSS estimator is computed via Algorithm 1 and the corresponding GIC value is calculated. However, simultaneous search for optimal tuning parameters and over a two-dimensional grid is computationally expensive. To reduce the computational burden, we introduce a simplified yet efficient search strategy. In specific, we first identify a GIC-minimal choice of sparsity over a sequence of values with rank large enough, and then based on the detected active set of the parameter , we determine the GIC-minimal rank via fitting RRR model to the restricted data , where denotes the sub-matrix of only including the columns . The algorithm with the simplified search strategy is summarized in Algorithm 3. In this way, we can obtain the result of parameter tuning after running Algorithm 1 for time and running RRR for time, rather than running the time-demanding Algorithm 1 for time.
Algorithm 3 Multivariate Response Best Subset Selection (MrBeSS)
Input: Response matrix , predictor matrix , maximum number of rank and sparsity .
For , do
Run Algorithm 1 with and . Denote the output by and .
Compute the GIC value .
Find and its corresponding with the minimal GIC value, i.e.,
For , do
Fit RRR model with rank to the data and denote the estimated coefficient matrix by .
Compute the GIC value .
Find the optimal with the minimal GIC value in 3., i.e.,
Determine the final estimated coefficient matrix by , .
4 Theoretical properties
In this section we present the theoretical analysis for the outputs of Algorithm 1 and Algorithm 3. The proofs of the results are provided in the Supplementary Material.
Define the active set and inactive set of the true underlying coefficient matrix as and , respectively. We prepare the following technical conditions for the theoretical analysis.
(Dimensions) As , is fixed, , and
(Restricted Isometry Predictor) The predictor matrix is nearly orthogonal in the sense that there exists constants independent of , such that
(Minimal-separated Response) Let denote the -th largest eigenvalue of , . Then the nonzero eigenvalues are separated as
for some constant .
(Response Intensity) for all .
(One-side Stronger Restricted Isometry Predictor) There exists constants independent of , such that
The conditions require some intuitive interpretations.
The order of in (C1) is a common high-dimensional data setting, especially suitable for genetic datasets. The fixed- condition comes from the reality that increasing experiments will not increase the number of output variables. We further give the constraint on the upper-bound of sparsity for our scheme to work, which is rather weak as it goes infinity with respect to .
which represents that any is almost an identity matrix. (22) further upper-bounds any sparse off-diagonal block of , which means that any two distinct small subsets of variables of are designed to be mutually nearly independent. In summary, (C2) suggests the variables in are almost isometric and uncorrelated. These two equations can be regarded as a weaker condition of the restricted isometry and restricted orthogonality conditions discussed in Candes and Tao (2005); Geer and Bühlmann (2009).
As is an -by- matrix and sometimes normalized to be for each column, it is natural to assume that . From (C2), the correlation between variables is weak, so can be regarded as aggregation of information from . Then (C4) suggests the heterogeneity of each part of information. In specific, consists of approximately of the whole information, which gives (C4).
We study a special case to show the conditions easy to be satisfied.
Proposition 4 (Gaussian case).
Suppose each column of is identically and independently distributed (i.i.d.) as the uniform distribution over the
is identically and independently distributed (i.i.d.) as the uniform distribution over the-dimensional sphere with radius , which is simply denoted as
Second, is specifically designed. Third, each noise , . If the dimension condition (C1) holds, then for any fixed choice of the constants and any small probability
and any small probability, there exists large enough , such that all the conditions hold with probability at least .
4.2 Main results
We assume without loss of generality that our calculated eigenvectors always take the correct direction, such that the angles between estimated and population vectors are no more than degrees. Mathematically speaking, if is an estimation of , which are computed via eigenvalue decomposition, then we have that Then we present the error bounds for the estimator obtained via Algorithm 1.
Theorem 1 (Error bounds).
With the settings in Theorem 1, we further assume that . Then with probability at least and sufficiently large , Algorithm 1 will find the true active set after iterations, where for some constants Furthermore, the estimation error bounds will be
Theorem 1 states that Algorithm 1 successfully approaches the true coefficient matrix when the row-sparsity is correctly chosen and rank constraint is not too strict. In the next theorem, we demonstrate Algorithm 3 can consistently output the true rank and row-sparsity .
Denote as the output Algorithm 3. Assume (C1)-(C6) hold with , then with probability at least and sufficiently large , Algorithm 3 will select the true rank and the active set, i.e., and .
We now present the error bounds for the estimator obtained via Algorithm 3.