1 Introduction
Suppose we observe the centered^{1}^{1}1Both and are centered for simplicity; otherwise, One can involve the intercept term to fit our model. dataset , where
is generated via the multivariate response linear regression model
, where is the underlying yet unknown coefficient matrix andis the zeromean noise vector. In literature,
represents the response to the predictor , through the structure that we expect to figure out. Equivalently, we can write the model in the matrix form as(1) 
where , and concatenates the responses, predictors and noises, respectively. We further assume that each column of , namely, each variable with samples, is normalized to be with norm .
Recently, as the high dimensional data become increasingly common, model (1) is in demand of (i) a simple structure of the coefficient , and (ii) a small amount of important variables affecting the responses most. The reduced rank regression (RRR) (Anderson, 1951; Izenman, 2008) restricting the coefficient matrix to be low rank provides a simple yet interpretable underlying structure, and has been wildly used in econometrics and genetics, see, for instance, Velu and Reinsel (1998), Vounou et al. (2010), and Ma et al. (2014). It works by solving the rankconstrained least squares problem
(2) 
where is the Frobenius norm, denotes the rank of , and is an positive integer. Meanwhile, as the number of variables are relatively large in high dimensional data, we aim at detecting a small (or sparse) subset of important predictors and thus obtaining a parsimonious fit to the data in hand. For example in genomics, researchers are interested in finding out which micro RNAs play key role in the regulatory relationship between cancer related gene expressions and hundreds of micro RNAs (Ma et al., 2014).
To meet with the two demands, we mainly focus on the scenario that is lowrank and rowsparse. In specific, and are both small with and . In order to recover , we consider the following constrained least squares problem
(3) 
where and . The minimizer of (3) denoted as is called the multivariate response best subset selection (MrBeSS) estimator with rank and sparsity .
The only difference between the MrBeSS minimization (3) and the RRR minimization (2) is the rowsparse constraint. Thus a direct approach to solve MrBeSS estimator is to select candidates out of the rows in that may not be zero, and then apply RRR method to the corresponding selected data that includes the th column of if and only if the th row of is a candidate. However, exhaustively searching over all possible choices of candidates and performing RRR at each time is NPhard and impractical.
To overcome the computational difficulties caused by the nonconvex constraint that strictly gives the best subset selection result, computationally friendlier optimization based regularization methods have been proposed as a surrogate for problem (3), among which the convex relaxation is a popular method. For example, the nonconvex is replaced by the convex , which represents the sum of the rownorms. Then replacing the sparse constraint as a Lagrangian term gives the penalized optimization problem
(4) 
where penalty parameter . The minimizer of (4) is called as rank constrained group Lasso (RCGL) in Bunea et al. (2012) since the term can be viewed as a group Lasso penalty (Yuan and Lin, 2006) by vectorizing and setting entries in the same row as a group. Consequently, several adaptive weighted variants of group Lasso (Chen and Huang, 2012; She, 2017) can also be applied to select important predictors in RRR. For example, instead of distributing equal weights to all the rownorms, the SRRR estimator in Chen and Huang (2012) minimizes the variant of (4) with rowwise weight being determined by an adaptive weighting strategy (Zou, 2006; Wang and Leng, 2008). Besides, rather than directly imposing group sparsity on , Chen et al. (2012) and Mishra et al. (2017) suggested to decompose the coefficient matrix
via singular value decomposition and prompt group sparsity to the corresponding singular vectors. Similarly,
Ma et al. (2014) applied the hard thresholding operator to the singular vectors in order to eliminate singular vectors with small singular values.In spite of its favorable computational properties, regularization methods with continuous penalty relaxation have several shortcomings. The gap between the relaxed penalty and the true one is a main concern in the joint rank and row selection (JRRS) estimator proposed by Bunea et al. (2012). JRRS estimator provides an adaptive correction, which acts analogously to the information criterion such as AIC (Akaike, 1974) and BIC (Schwarz and others, 1978) in cases of univariate response regression. Specifically speaking, with some candidates of , the authors consider solving the following regularized optimization problem
(5) 
where penalty parameter . Nonetheless, it needs complete enumeration of the model spaces and should be used together with an efficient subset selection procedure. Moreover, the parameter is hard to determine beforehand, which limits the effectiveness of the JRRS estimator.
Meanwhile, in the theoretical aspect, some sufficient regularity conditions on the data must be satisfied to guarantee a relaxed model approaching good predictive accuracy, see, for example, the restricted eigenvalue condition
(Bunea et al., 2012) or incoherence condition (She, 2017). Thus as soon as these conditions are violated, the above group Lasso based methods become suboptimal and might bring in a large number of irrelevant rows including noise predictors. The adaptive weighting version of group Lasso penalty might help to improve the prediction performance, yet it only works for fixed , the number of predictors and has the tendency of underselecting relevant variables when (Chen and Huang, 2012). In contrast, the group penalty based methods are shown to achieve the optimal rate for prediction error under no restrictions on the design matrix (Bunea et al., 2012; She, 2017). Recently, She and Chen (2017) discussed that the performance of using group Lasso penalty is substantially worse and less stable than those of using the nonconvex grouppenalty for outliers detection in RRR. Therefore, to achieve better performance of subset selection and parameter estimation in RRR, it is desirable to design an efficient algorithm directly based on the group
penalty.In this paper, we propose a novel framework via which the best subset selection problem (3) can be directly solved within a reasonable time frame. Motivated by the primal dual formulation of the optimizer, which is the main ingredient of our proposal, we develop a new computationally efficient algorithm with guaranteed convergence. Note that the special case of univariate response regression has been thoroughly studied in Wen et al. (2017) and we provide a more general point of view here. Theoretically, we demonstrate that our proposed MrBeSS algorithm successfully recovers the true model with acceptable accuracy. Further, we show significant numerical outperformance of MrBeSS under a variety of simulation settings when compared with stateoftheart methods . We also study how the new estimator works efficient and stable when applying to the real world in an example of micro RNA and gene expression association data.
The rest of the paper is organized as follows. Section 2 presents the primal dual characterization of the best subset selection model. We mainly develop our algorithms in Section 3, including solving a MrBeSS estimator with fixed parameters, an initialization strategy, and an adaptive parameter tuning algorithm. In Section 4, we analyze the theoretical performance of the proposed algorithms. The competitive numerical performance of our proposal is demonstrated in Section 5 through both simulated and real data examples. Section 6 discusses interesting related problems for future research. The detailed proofs of the theorems are presented in the Supplementary Material.
Notations
Conventionally, a vector is expressed in the column form, that is, . For a matrix , where is the th row of , we denote its th column as , and more generally (resp. ) represents the submatrix concatenating (resp. ) with . The rank of is denoted as , while its trace is denoted as . For a symmetric matrix let , be its
th largest eigenvalue and the corresponding eigenvector, respectively,
. As an eigenvector has at least two options in opposite directions, we will specify one direction case by case. Let and be the Frobenius norm and norm of , respectively. Define as the number of nonzero rows of , which can be convexly relaxed as the norm of , denoted as .As our theoretical results focus on the case when the number of samples is sufficiently large, we will use the following notations to compare the growth order of sequences. For any two positive sequences , denote or , if grows no faster or in the same order of , respectively. That is,
Further, we analogously say if , if , and if .
2 Primal Dual Formulation
We decompose into the product of two matrices, i.e., , with
. Then (3) can be rewritten as(6) 
Therefore the two restrictions are separated, which may help reduce the computational difficulty. However, the solution to the optimization problem (6) is not unique. For example, suppose is a solution of (6), then is also a solution of (6), if there is an orthogonal matrix such that and . Nonetheless, suggests that both the two solutions give the same estimation of .
Next we characterize the primal dual condition for that motivates our MrBeSS algorithm, where is a minimizer of (6). For the th row, we consider the unconstrained minimization problem given the other variables optimal, to say,
(7) 
We can find the explicit solution of (7) as with , where denotes the th column of .
To figure out the connection between and , note that the rowsparse constraint in (6) forces elements in to be zero vectors. If is not enforced to be zero, then the rowwise optimality of gives that and . From the above discussion, we observe the primaldual condition of the optimal point.
Proposition 1 (Primaldual condition).
If is a minimizer of (6) and denotes the index of the nonzero rows of , then rowwisely and its corresponding normalized residual satisfy the following primaldual condition:
(8) 
In consequence, we call , as the th primal variable and dual variable, respectively.
For the decomposed MrBeSS problem (6), define the active set as the index of nonzero rows of , i.e., , and the inactive set as the complement of . Given , the active set and the part of the minimizer, we can recover the primal variables as well as the dual according to (8). In specific,
(9) 
Since the support of the primal variables is complementary to that of , we directly set
(10) 
The computation is the key to design a scalable MrBeSS algorithm in the next section.
3 Algorithm
In this section, we develop a new estimation algorithm for solving the MrBeSS problem, where both dimension reduction and variable selection are taken into account. We first solve the basic parameterfixed MrBeSS estimator, and then provide suggested initialization and parameter tuning strategies. Based on the primaldual active set updating scheme, the proposed algorithm has closedform updates and is computationally efficient, which enables our algorithm to be applied in high dimensional data.
3.1 MrBeSS with Fixed Parameters
Given the rank and the rowsparsity , the decomposed optimization problem in (6) is now with respect to and . This motivates us to solve it in a blockwise iteration, i.e., optimizing one variable by fixing another variable. In specific, we solve the following two subproblems at the th iteration:
(11)  
(12) 
When is fixed, the subproblem (12) is an orthogonal Procrustes problem and has explicit solution (Schönemann, 1966). In particular, if we perform the singular value decomposition of as , where , , and , then the solution is given by . The solution is unique provided the th singular value of is nonzero.
Given , the subproblem (11) can be treated as a group subset selection problem with each group representing one row in . As discussed in Section 2, to obtain the optimal variables, it suffices to know the optimal active set. Once the estimated active set has been settled as , we can update the variables analogous to (9), (10), namely,
(13)  
(14) 
Therefore, the main ingredient at each iteration is to estimate the active set. To determine which rows are nonzero candidates, we consider the difference of objective function in (7) when switching the vector from to , as given by
(15)  
(16) 
which is called the th sacrifice hereafter. Intuitively, we may prefer enforcing those s with least sacrifices to all zeros. To realize this, let , which is a permutation of , be the rank statistic^{2}^{2}2If there’re knots, we will randomly specify different rank of them, such that is still a onetoone mapping. of , that is, , then truncate the ordered sacrifice vector at position . Namely, and . As the sacrifices calculated at each time highly depend on the inaccurate primaldual pair, we repeat the processes of finding the primaldual variables and leastsacrifice active set. The above discussion is summarized in Algorithm 1 as follows.
Algorithm 1 Multivariate Response Best Subset Selection (MrBeSS) with fixed and
Input: Response matrix , predictor matrix , rank and sparsity .
Output: .

Set and initialize . (See Algorithm 2 for a suggested initialization.)

While the value of objective function in (6) not converged do

Given , update as follows

Let and perform SVD on , i.e., ;

Determine by .


.

Set .
3.2 Identification and Initialization
We first study the noiseless case , which motivates us to specify the identical optimal solution as well as the suggested initialization strategy used in Step 1. in Algorithm 1. Similar to , we can also decompose as , where and are unique up to rightmultiplying an orthogonal matrix. We observe the following property, which motivates the unique decomposition .
Proposition 2 (Characterization of noiseless decomposition).
For any decomposition , there exists an orthogonal matrix , such that consists of the ordered nonzero eigenvectors of , that is,
(17) 
where is the projection matrix onto the column space of with the MoorePenrose pseudoinverse. Further, is unique if the first eigenvalues of are different with each other.
It should be noted that we take a specific direction of , e.g., its first nonzero entry is positive. Hereafter we regard in (17) and its correspoding as the identical decomposition of .
Now we turn back to the practical noisy case that . We expect that our estimated could be as close as possible to . An intuition according to (17) is to initialize as the order nonzero eigenvectors of or , that is,
(18)  
(19) 
Note that though in the noiseless case, in general , and we provide a guideline for choosing which matrix to perform eigenvalue decomposition. When and is with full rank, the projection matrix plays the role of compressing the error on a smaller dimensional space, while not disturbing the information from . Therefore, we prefer applying (19) in lowdimensional case. On the other hand, when ,
is nearly identical, and is exactly the identical matrix in the full rank case. Thus,
loses its power of errorcompression yet burdens the computation. In this highdimensional situation, we seek via (18). Our above discussion leads to the algorithm as follows.
Algorithm 2 Initialization of
Input: Response matrix , predictor matrix and rank .
Output: Initial matrix .

If , compute and let ; otherwise, let .

Perform eigendecomposition to the matrix .

Compute .
3.3 Adaptive Parameter Tuning
There are two tuning parameters in problem (3) or (6), namely, the rank and the rowsparsity . Given a grid of candidates , we can select the one that leads to the smallest average prediction error with fold cross validation (Friedman et al., 2001). Alternatively, various information criteria can been used due to their computational efficiency. Here we propose a novel generalized information criterion (GIC) defined by
(20) 
where is the normalized loss and is the MrBeSS estimator of with fixed parameters . When the response is univariate, i.e., , our proposed GIC (20) reduces to the GIC studied in Fan and Tang (2013). It is shown in their paper (Corollary 1, Fan and Tang (2013)) that the component is a valid choice for identifying the true model consistently.
GIC finds the tradeoff between loss minimization and overfitting, and smaller GIC represents a more balanced choice. Therefore, a naive way is to minimize GIC among all candidate parameter values, say, . For each pair of tuning parameters and , the MrBeSS estimator is computed via Algorithm 1 and the corresponding GIC value is calculated. However, simultaneous search for optimal tuning parameters and over a twodimensional grid is computationally expensive. To reduce the computational burden, we introduce a simplified yet efficient search strategy. In specific, we first identify a GICminimal choice of sparsity over a sequence of values with rank large enough, and then based on the detected active set of the parameter , we determine the GICminimal rank via fitting RRR model to the restricted data , where denotes the submatrix of only including the columns . The algorithm with the simplified search strategy is summarized in Algorithm 3. In this way, we can obtain the result of parameter tuning after running Algorithm 1 for time and running RRR for time, rather than running the timedemanding Algorithm 1 for time.
Algorithm 3 Multivariate Response Best Subset Selection (MrBeSS)
Input: Response matrix , predictor matrix , maximum number of rank and sparsity .
Output: .

For , do

Run Algorithm 1 with and . Denote the output by and .

Compute the GIC value .


Find and its corresponding with the minimal GIC value, i.e.,

For , do

Fit RRR model with rank to the data and denote the estimated coefficient matrix by .

Compute the GIC value .


Find the optimal with the minimal GIC value in 3., i.e.,

Determine the final estimated coefficient matrix by , .
4 Theoretical properties
In this section we present the theoretical analysis for the outputs of Algorithm 1 and Algorithm 3. The proofs of the results are provided in the Supplementary Material.
4.1 Preliminaries
Define the active set and inactive set of the true underlying coefficient matrix as and , respectively. We prepare the following technical conditions for the theoretical analysis.

(Dimensions) As , is fixed, , and

(Restricted Isometry Predictor) The predictor matrix is nearly orthogonal in the sense that there exists constants independent of , such that
(21) (22) 
(Minimalseparated Response) Let denote the th largest eigenvalue of , . Then the nonzero eigenvalues are separated as
for some constant .

(Response Intensity) for all .

(Noise) are subGaussian^{3}^{3}3We say a random vector is subGaussian, if its subGaussian norm is no more than See Vershynin (2018)
for more details about the definition of subGaussian distribution.
random vectors for some constant , that is, for any , we have 
(Oneside Stronger Restricted Isometry Predictor) There exists constants independent of , such that
Remark 1.
The conditions require some intuitive interpretations.

The order of in (C1) is a common highdimensional data setting, especially suitable for genetic datasets. The fixed condition comes from the reality that increasing experiments will not increase the number of output variables. We further give the constraint on the upperbound of sparsity for our scheme to work, which is rather weak as it goes infinity with respect to .

From Weyl’s theorem on eigenvalues, (21) in (C2) is equivalent to
(23) which represents that any is almost an identity matrix. (22) further upperbounds any sparse offdiagonal block of , which means that any two distinct small subsets of variables of are designed to be mutually nearly independent. In summary, (C2) suggests the variables in are almost isometric and uncorrelated. These two equations can be regarded as a weaker condition of the restricted isometry and restricted orthogonality conditions discussed in Candes and Tao (2005); Geer and Bühlmann (2009).

As is an by matrix and sometimes normalized to be for each column, it is natural to assume that . From (C2), the correlation between variables is weak, so can be regarded as aggregation of information from . Then (C4) suggests the heterogeneity of each part of information. In specific, consists of approximately of the whole information, which gives (C4).
We study a special case to show the conditions easy to be satisfied.
Proposition 4 (Gaussian case).
Suppose each column of
is identically and independently distributed (i.i.d.) as the uniform distribution over the
dimensional sphere with radius , which is simply denoted asSecond, is specifically designed. Third, each noise , . If the dimension condition (C1) holds, then for any fixed choice of the constants
and any small probability
, there exists large enough , such that all the conditions hold with probability at least .4.2 Main results
We assume without loss of generality that our calculated eigenvectors always take the correct direction, such that the angles between estimated and population vectors are no more than degrees. Mathematically speaking, if is an estimation of , which are computed via eigenvalue decomposition, then we have that Then we present the error bounds for the estimator obtained via Algorithm 1.
Theorem 1 (Error bounds).
Note that , which is bounded under (C1). Second, if the active set is not correctly chosen in , it can be shown that , whose proof is similar to Proposition 3. Then we have the following corollary.
Corollary 1.
With the settings in Theorem 1, we further assume that . Then with probability at least and sufficiently large , Algorithm 1 will find the true active set after iterations, where for some constants Furthermore, the estimation error bounds will be
(26) 
Theorem 1 states that Algorithm 1 successfully approaches the true coefficient matrix when the rowsparsity is correctly chosen and rank constraint is not too strict. In the next theorem, we demonstrate Algorithm 3 can consistently output the true rank and rowsparsity .
Theorem 2.
Denote as the output Algorithm 3. Assume (C1)(C6) hold with , then with probability at least and sufficiently large , Algorithm 3 will select the true rank and the active set, i.e., and .
We now present the error bounds for the estimator obtained via Algorithm 3.
5 Simulation study
In this section, we investigate the performance of MrBeSS on simulated data. We include five methods for comparison: (i) the rank constrained group Lasso (RCGL, c.f. (4)) by Bunea et al. (2012), (ii) the sparse reduced rank regression using adaptive group Lasso (SRRR) by Chen and Huang (2012), (iii) the thresholds SVD method (TSVD) by Ma et al. (2014), (iv) the iterative exclusive extraction algorithm (IEEA) in Chen et al. (2012), and (v) the sequential factor extraction via cosparse unitrank estimation (SeCURE) by Mishra et al. (