 # Sparse Multivariate Factor Regression

We consider the problem of multivariate regression in a setting where the relevant predictors could be shared among different responses. We propose an algorithm which decomposes the coefficient matrix into the product of a long matrix and a wide matrix, with an elastic net penalty on the former and an ℓ_1 penalty on the latter. The first matrix linearly transforms the predictors to a set of latent factors, and the second one regresses the responses on these factors. Our algorithm simultaneously performs dimension reduction and coefficient estimation and automatically estimates the number of latent factors from the data. Our formulation results in a non-convex optimization problem, which despite its flexibility to impose effective low-dimensional structure, is difficult, or even impossible, to solve exactly in a reasonable time. We specify an optimization algorithm based on alternating minimization with three different sets of updates to solve this non-convex problem and provide theoretical results on its convergence and optimality. Finally, we demonstrate the effectiveness of our algorithm via experiments on simulated and real data.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Multivariate regression analysis, also known as multiple–output regression, is concerned with modelling the relationships between a set of real–valued output vectors, known as responses, and a set of real–valued input vectors, known as predictors or features. The multivariate responses are measured over the same set of predictors and are often correlated. Hence, the goal of multivariate regression is to exploit these dependencies to learn a predictive model of responses based on an observed set of input vectors paired with corresponding outputs. Multiple–output regression can also be seen as an instance of the problem of multi–task learning, where each task is defined as predicting individual responses based on the same set of predictors. The multivariate regression problem is encountered in numerous fields including finance

(Lee and Lee, 2010), computational biology (Monteiro, 1999), geostatistics (Ranhao et al., 2008), chemometrics (Wold et al., 2001), and neuroscience (Harrison et al., 2003).

In this paper, we are interested in multivariate regression tasks where it is reasonable to believe that the responses are related to factors, each of which is a sparse linear combination of the predictors. Our model further assumes that the relationships between the factors and the responses are sparse. This type of structure occurs in a number of applications and we provide two examples in later sections.

Given –dimensional predictors and –dimensional responses for the -th sample, we assume there is a linear relationship between the inputs and outputs as follows:

 yi=DTxi+ϵi,i=1,…,N, (1)

where is the regression coefficient matrix and is the vector of errors for the -th sample. We can combine these equations into a single matrix formula:

 Y=XD+E, (2)

where denotes the matrix of predictors with as its -th row, denotes the matrix of responses with as its -th row, and denotes the matrix of errors with as its -th row. For

, this multivariate linear regression model reduces to the well–known, univariate linear regression model.

We assume that the columns of and are centred and hence the intercept terms are omitted and the columns of are normalized. We also assume that the error vectors for samples are iid Gaussian random vectors with zero mean and covariance , i.e. .

In the absence of additional structure, many standard procedures for solving (2

), such as linear regression and principal component analysis, are not consistent unless

. Thus, in a high-dimensional setting where is comparable to or greater than , we need to impose some low-dimensional structure on the coefficient matrix. For instance, element-wise sparsity can be imposed by constraining the norm of the coefficient matrix,  (Tibshirani, 2011; Wainwright, 2009). This regularization is equivalent to solving

separate univariate lasso regressions for every response; thus we consider tasks separately. Another way to introduce sparsity is to consider the mixed

norms (). In this approach (sometimes called group lasso), the mixed norms impose a block-sparse structure where each row is either all zero or mostly zeros. Particular examples, among many other works, include results using the norm (Turlach et al., 2005; Zhang and Huang, 2008), and the norm  (Yuan and Lin, 2006; Obozinski et al., 2011). Also, there are the so-called “dirty” models which are superpositions of simpler low-dimensional structures such as element-wise and row-wise sparsity (Jalali et al., 2010; Peng et al., 2010) or sparsity and low rank (Chen et al., 2011; Chandrasekaran et al., 2011).

Another approach is to impose a constraint on the rank of the coefficient matrix. In this approach, instead of constraining the regression coefficients directly, we can apply penalty functions on the rank of

, its singular values and/or its singular vectors

(Pourahmadi, 2013; Chen and Huang, 2012; Chun and Keleş, 2010; Yuan et al., 2007; Reinsel and Velu, 1998). These algorithms belong to a broad family of dimension-reduction methods known as linear factor regression, where the responses are regressed on a set of factors achieved by a linear transformation of the predictors. The coefficient matrix is decomposed into two matrices: . Matrix transforms the predictors into latent factors, and matrix determines the factor loadings.

### Our contributions

Here, we propose a novel algorithm which performs sparse multivariate factor regression (SMFR). We jointly estimate matrices and by minimizing the mean-squared error, , with an elastic net penalty on (which promotes grouping of correlated predictors and the interpretability of the factors) and an penalty on (which enhances the accuracy and interpretability of the predictions). We provide a formulation to estimate the number of effective latent factors, . To the best of our knowledge, our work is the first to strive for low-dimensional structure by imposing sparsity on both factoring and loading matrices as well as the grouping of the correlated predictors. This can result in a set of interpretable factors and loadings with high predictive power; however, these benefits come at the cost of a non-convex objective function. Most current approaches for multivariate regression solve a convex problem (either through direct formulation or by relaxation of a non-convex problem) to impose low-dimensional structures on the coefficient matrix. Although non-convex formulations, such as the one introduced here, can be employed to achieve very effective representations in the context of multivariate regression, there are few theoretical performance guarantees for optimization schemes solving such problems. We formulate our problem in Section 2. In Section 3, we propose an alternating minimization scheme with three sets of updates to solve our problem and provide theoretical guarantees for its convergence and optimality. We show that under mild conditions on the predictor matrix, every limit point of the minimization algorithm is a stationary point of the objective function and if the starting point is close enough to a local or global minimum, our algorithm converges to that point. Through analysis of simulations on synthetic datasets in Section 5 and two real-world datasets in Section 6, we show that compared to other multivariate regression algorithms, our proposed algorithm can provide a more effective representation of the data, resulting in a higher predictive power.

### Related Methods

Many multivariate regression techniques impose a low-dimensional structure on the coefficient matrix. Element-wise sparsity, here noted as LASSO, is the most common approach where the cost function is defined as  (Tibshirani, 2011; Wainwright, 2009). An extension of LASSO to the multivariate case is the row-wise sparsity with the norm as the cost function:  (Yuan and Lin, 2006; Obozinski et al., 2011). Peng et al. proposed a method, called RemMap (Peng et al., 2010), which imposes both element-wise and row-wise sparsity and solves the following problem:

 minD∥Y−XD∥2F+λ1∥D∥1,1+λ2∥D∥1,2.

In an alternative approach, (Chun and Keleş, 2010) extended the partial least squares (PLS) framework by imposing an additional sparsity constraint and proposed Sparse PLS (SPLS).

Another common idea is to employ dimensionality reduction techniques to find the underlying latent structure in the data. One of the most basic algorithms in this class is an approach called Reduced Rank Regression (RRR) (Velu and Reinsel, 1998) where the sum-of-squares error is minimized under the constraint that rank for some . It is easy to show that one can find a closed-form solution for

based on the singular value decomposition of

. However, similar to least-squares, the solution of this problem without appropriate regularization exhibits poor predictive performance and is not suitable for high-dimensional settings. Another popular approach is to use the trace norm as the penalty function:

 minD∥Y−XD∥2F+λmin{p,q}∑j=1σj(D), (3)

where denotes the ’th singular value of . The trace norm regularization has been extensively studied in the literature (Yuan et al., 2007; Argyriou et al., 2008; Ji and Ye, 2009; Zhang et al., 2012). It imposes sparsity in the singular values of and therefore, results in a low-dimensional solution (higher values of correspond to achieving solutions of lower rank).

Many papers study problems of the following form:

 minD∥Y−XD∥2F+g(D)  s.t. \ % rank(D)≤r≤min{p,q}, (4)

where is a regularization function over . For instance, in (Mukherjee and Zhu, 2011), a ridge penalty is proposed with . Often it is assumed that (and thus rank, rank, and consequently rank) and the problem is formulated in terms of and . In (Kumar and Daume, 2012), . An algorithm called Sparse Reduced Rank Regression (SRRR) is proposed in (Chen and Huang, 2012) and further studied in (Ma and Sun, 2014), where with an extra constraint that . In (Argyriou et al., 2008), with an extra constraint that , and it is assumed that and . Dimension reduction is achieved by the constraint on which forces many rows to be zero which consequently cancels the effects of the corresponding columns in .

Our problem formulation differs in three important ways: (i) sparsity constraints are imposed on both and ; (ii) the elastic net penalty enables us to control the level of sparsity for each matrix separately and also provides the grouping of correlated predictors; and (iii) the number of factors is determined directly, without the need for cross-validation. We will discuss the second and third aspects in detail in the next section. The first difference has substantial consequences; when decomposing the coefficient matrix into two matrices, the first matrix has the role of aggregating the input signals to form the latent factors and the second matrix performs a multivariate regression on these factors. Imposing sparsity on enhances the variable selection as well as the interpretability of the achieved factors. Also, as originally motivated by LASSO, we would like to impose the sparsity constraint on in order to improve the interpretability and prediction performance.

Imposing sparsity on both and means that for our model to make sense for a specific problem, the outputs should be related in a sparse way to a common set of factors that are derived as a sparse combination of the inputs. For example, in analyzing the S&P 500111Standard and Poors index of 500 large-cap US equities http://ca.spindices.com/indices/equity/sp-500 stocks, it is well-known that returns exhibit much stronger correlations for companies that belong to the same industry sector. The memberships of each sector are not always so clear, because a company may have several diverse activities that generate revenue. So if we are using just concurrent stock returns to try to predict those of other companies, it is reasonable to assume that factors representing industry sectors should appear (Chou et al., 2012). Since we do not expect the main sectors to overlap much, a company will not be present in many factors; so, it is reasonable to assume that is sparse. Moreover, most companies will only be predicted by one or two such factors, so it makes sense that is also sparse.

There is a connection between Sparse PCA (Zou et al., 2006) and our algorithm, in the case where is replaced by (i.e., is regressed on itself). We investigate this relationship in Section 4.

## 2 Problem Setup

In this work, we introduce a novel low-dimensional structure where we decompose into the product of two sparse matrices and where . This decomposition can be interpreted as first identifying a set of factors which are derived by some linear transformation of the predictors (through matrix ) and then identifying the transformed regression coefficient matrix to estimate the responses from these factors. We provide a framework to find , the number of effective latent factors, as well as the transforming and regression matrices, and . For a fixed , define:

 ˆAm,ˆBm=argminAp×m,Bm×qf(A,B), (5)

where

 f(A,B)=12∥Y−XAB∥2F+λ1∥A∥1,1+λ2∥B∥1,1+λ3∥A∥2F. (6)

Then, we solve the following optimization problem:

 ˆm=max(m)≤r \ s.t. \ rank(ˆAm)= rank(ˆBm)=m, (7)

where is a problem-specific bound on the number of factors. We then choose and as solutions. Thus, we find the maximum number of factors such that the solution of (5) has full rank factor and loading matrices. In other words, we find the maximum such that the best possible regularized reconstruction of responses, i.e., the solution of (5), results in a model where the factors (columns of of ) and their contributions to the responses (rows of ) are linearly independent. To achieve this, we initialize to have columns, to have rows, and set , solve the problem (5)–(6), check for the full rank condition; if not satisfied, set , and repeat the process until we identify an that satisfies the rank condition.

In the remainder of this section, we first discuss the desirable properties resulting from the choice of our penalty function, and then explain the reasoning behind how we estimate .

### 2.1 Controlling Sparsity Levels in Matrices A and B Separately

Consider the case where . We have:

 minA,Bf(A,B) =minA,B12∥Y−XAB∥2F+λ1∥A∥1,1+λ2∥B∥1,1 (8) =minA′,B′,c12∥Y−XA′B′∥2F+λ1c∥A′∥1,1+λ2c∥B′∥1,1 (9) =minA′,B′12∥Y−XA′B′∥2F+√λ1λ2∥A′∥1,1∥B′∥1,1, (10)

where , and . In problem (10), we do not have any separate control over the sparsity levels of and . If is a solution to the last problem, then there is a for which is a solution to the first problem. Therefore, we cannot control the sparsity levels of the two matrices by just including two norms. Incorporating the elastic net penalty for resolves this issue immediately since the equivalence between optimization problems (8) and (10) does not hold any more.

### 2.2 Grouping of Correlated Features

In this section, we show the ’th row of a matrix by and its ’th column by . Remember that matrix has the role of combining the relevant features to form the latent factors which will be used later in the second layer by matrix for estimating the outputs. If there are two highly correlated features we expect them to be grouped together in forming the factors. In other words, we expect them to be both present in a factor or both absent. Inspired by Theorem 1 in the original paper of Zou and Hastie on elastic net (Zou and Hastie, 2005), we prove in this section that elastic net penalty enforces the grouping of correlated features in forming the factors.

The columns of correspond to different features. We assume that all columns of are centred and normalized. Thus, the correlation between the ’th and the ’th features is .

###### Lemma 1.

Consider solving the following problem for given and :

 ˆA=argminAf(A,B)=argminA12∥Y−XAB∥2F+λ1∥A∥1,1+λ3∥A∥2F. (11)

Then, if , we have:

 2λ3∥Y∥F∥Bk⋅∥F|ˆAik−ˆAjk|≤√2(1−ρij) (12)

This lemma says, for instance, that if the correlation between features and is really high (i.e., ), then the difference between their corresponding weights in forming the ’th factor, , would be very close to 0. If and are negatively correlated, we can state the same lemma for and and use .

###### Proof.

The condition means that , , and . So, we have:

 ∂f∂Aik∣∣Aik=ˆAik =(−XTYBT)ik+(XTXˆABBT)ik+λ1sign(ˆAik)+2λ3ˆAik (13) =−(X⋅i)T(YBT)⋅k+(X⋅i)T(XˆABBT)⋅k+λ1sign(ˆAik)+2λ3ˆAik (14) ∂f∂Ajk∣∣Ajk=ˆAjk =(−XTYBT)jk+(XTXˆABBT)jk+λ1sign(ˆAjk)+2λ3ˆAjk (15) =−(X⋅j)T(YBT)⋅k+(X⋅j)T(XˆABBT)⋅k+λ1sign(ˆAjk)+2λ3ˆAjk (16)

Due to the optimality of , both derivatives are equal to zero. Equating the two equations, we get:

 2λ3(ˆAik−ˆAjk)=(X⋅i−X⋅j)T(Y−XˆAB)(Bk⋅)T (17)

Thus,

 2λ3|ˆAik−ˆAjk| =|(X⋅i−X⋅j)T(Y−XˆAB)(Bk⋅)T| (18) ≤∥X⋅i−X⋅j∥F∥Y−XˆAB∥F∥Bk⋅∥F (19)

Since the columns of are normalized, we have . Also, by definition, we have:

 f(ˆA,B)≤f(0,B)⇒∥Y−XˆAB∥2F+λ1∥ˆA∥1,1+λ3∥ˆA∥1,1≤∥Y∥2F⇒∥Y−XˆAB∥F≤∥Y∥F (20)

Combining all these together gives the desired result. ∎

### 2.3 Estimating the number of effective factors

In choosing , we want to avoid both overfitting (large ) and lack of sufficient learning power (small ). In general, we only require ; however, in practical settings where and are very large, we impose an upper bound on to have a reasonable number of factors and avoid overfitting. This upper bound, denoted by , is problem-specific and should be chosen by the programmer. For instance, continuing our example of analysing the S&P 500 stocks, most economists identify 10-15 primary financial sectors, so a choice of or is reasonable since we expect the factors representing industries to appear, but we allow the algorithm to find the optimal from the data. In order to have the maximum learning power, we find the maximum for which the solutions satisfy our rank conditions. This motivates starting with and decreasing it until the conditions hold (as opposed to starting with and increasing the dimension).

The full rank conditions are employed to guarantee a good estimate of the number of “effective” factors. An effective factor explains some aspect of the response data but cannot be constructed as a linear combination of other factors (otherwise it is superfluous). We therefore require the estimated factors to be linearly independent. In addition, we require that the rows of , which determine how the factors affect the responses, are linearly independent. If we do not have this latter independence, we could reduce the number of factors and still obtain the same relationship matrix , so at least one of the factors is superfluous. By enforcing that and are full rank, we make sure that the estimated factors are linearly independent in both senses, and thus is a good estimate of the number of effective factors.

In estimating the number of factors, we differ fundamentally from the common approach in literature. Setting aside the differences in the choice of regularization, most algorithms minimize a cost function as in (5) for a fixed without the rank condition in (7). They then use cross-validation to find the optimal  (Yuan et al., 2007; Reinsel and Velu, 1998; Chen and Huang, 2012; Peng et al., 2010). The criterion in choosing is thus the cross-validation error. In contrast, we strive to find the largest such that the solutions of (5) have full rank and the estimated factors are linearly independent. Finding via cross-validation may result in non-full rank solutions with linearly dependent factors. Therefore, some factors can be expressed as linear combinations of other factors and can be viewed as redundant. Including redundant factors (via a non-full rank ) could help to improve the sparsity of , but our goal is to have the minimum necessary set of factors, not to have a sparse at any cost. Later, with experiments on synthetic and real data, we show that our approach towards choosing the number of factors results in better predictive performance as well as more interpretable factors compared to other techniques that apply cross-validation.

## 3 Optimization Technique and Theoretical Results

The optimization problem defined in (57

) is not a convex problem and it is difficult, if not impossible, to solve exactly (i.e., to find the global optimum) in polynomial time. Therefore, we have to employ heuristic algorithms, which may or may not converge to a stationary solution

(Powell, 1973). In this section, we propose an alternating minimization algorithm with three different sets of updates, and provide theoretical results for each of them.

### 3.1 Optimization

For a fixed , the objective function in (5) is biconvex in and ; it is not convex in general, but is convex if either or is fixed. Let us define . To solve (5) for a fixed , we perform Algorithm 1, with an arbitrary, non-zero starting value  (see Section 5 for a discussion on the choice of the starting point). The stopping criterion is related to the convergence of the value of function , not the convergence of its arguments. In our experiments, we assume has converged, if where the default value of the tolerance parameter, , is ; i.e., the algorithm stops if the relative changes in are less than .

We consider three different types of updates:

 Bi+1 ←argminB12∥Y−XAiB∥2F+λ2∥B∥1,1 (23) Ai+1 ←argminA12∥Y−XABi+1∥2F+λ3∥A∥2F+λ1∥A∥1,1 (24)

 Bi+1 ←argminB12∥Y−XAiB∥2F+λ2∥B∥1,1+βi∥B−Bi∥2F (25) Ai+1 ←argminA12∥Y−XABi+1∥2F+λ3∥A∥2F+λ1∥A∥1,1+αi∥A−Ai∥2F (26)

where and ; i.e., they are bounded from both sides.

 Bi+1 ←Sλ2/βi(˜Bi−g(Ai,˜Bi)/βi) (27) Ai+1 ←Sλ1/αi(˜Ai−h(˜Ai,Bi+1)/αi) (28)

where is the soft-thresholding function, , and:

 g(A,B) =∂∂B(12∥Y−XAB∥2F)=−ATXTY+ATXTXAB (29) h(A,B) =∂∂A(12∥Y−XAB∥2F+λ3∥A∥2F)=−XTYBT+XTXABBT+2λ3A (30)

are the derivatives of without the penalties. Also, and are multipliers that have to be greater than or equal to the Lipschitz constants of and respectively. Since:

 ∥g(Ai,B)−g(Ai,B′)∥F =∥ATiXTXAi(B−B′)∥F≤∥ATiXTXAi∥F∥B−B′∥F (31) ∥h(A,Bi+1)−h(A′,Bi+1)∥F =∥XTX(A−A′)Bi+1BTi+1+2λ3(A−A′)∥F (32) ≤(∥XXT∥F∥Bi+1BTi+1∥F+2λ3)∥A−A′∥F, (33)

we can set

 βi =∥ATiXTXAi∥F (34) αi =∥XXT∥F∥Bi+1BTi+1∥F+2λ3 (35)

Finally:

 ˜Bi =Bi+ωBi(Bi−Bi−1) (36) ˜Ai =Ai+ωAi(Ai−Ai−1) (37) ωAi,ωBi ∈[0,1) (38)

are extrapolations which are used to accelerate the convergence. (See Section 4.3 of (Parikh and Boyd, 2014) for more details.) For our algorithm, we choose

 ωAi=min(ti−1−1ti,δω√αi−1√αi);ωBi=min(ti−1−1ti,δω√βi−1√βi);ti=(1+√1+4t2i−1)/2,t0=1,

for some . The inequalities and are necessary for establishing convergence.

The prox-linear updates need an extra step. If , i.e., no descent, the two optimization steps are repeated with no extrapolation (). This is also necessary for convergence.

### 3.2 Theoretical Results

The theoretical aspects of the alternating minimization (also known as block coordinate descent) algorithms have been extensively studied in a wide range of settings with different convexity and differentiability assumptions (Powell, 1973; Grippo and Sciandrone, 2000; Tseng, 2001; Tseng and Yun, 2009; Attouch and Bolte, 2009; Attouch et al., 2010; Xu and Yin, 2013, 2014). A full review of this vast literature is beyond the scope of this paper. Here, we focus on the recent work of Xu and Yin (Xu and Yin, 2013) which considers a setting that matches the problem we study in this paper (block multi-convex objective with non-smooth penalty). Xu and Yin show that the solutions of proximal and prox-linear updates described above converge to a stationary point of the objective function and if the starting point is close to a global minimum, the alternating scheme converges globally. This conclusion is possible because the objective function in our problem is the sum of a real analytic function and a semi-algebraic function and satisfies the Kurdyka–Lojasiewicz (KL) property. However, their results on convergence do not apply to the basic updates, because although the objective is convex in , it is not strongly convex. In this section, we first present the results for proximal and prox-linear updates from (Xu and Yin, 2013) and then provide some novel results for the basic update.

#### 3.2.1 Definitions and convergence of the objective

###### Definition 1.

is called a partial optimum of if

 f(A∗,B∗)  ≤ f(A∗,B), ∀B∈Rm×q (39) andf(A∗,B∗)  ≤ f(A,B∗),  ∀A∈Rn×m. (40)

Note that a stationary point of the objective is a partial optimum but the reverse does not generally hold; see Section 4 of (Wen et al., 2012) for an example.

###### Definition 2.

A point is an accumulation point or a limit point of a sequence , if for any neighbourhood of , there are infinitely many such that . Equivalently, is the limit of a subsequence of .

###### Proposition 1.

The sequence generated by Algorithm 1 converges monotonically.

The value of is always positive and is reduced in each of the two main steps of Algorithm 1. Thus, it is guaranteed that the stopping criterion of Algorithm 1 will be reached.

#### 3.2.2 Proximal and prox-linear updates

Since the sequence of solutions generated by the alternating minimization stays in a bounded, closed, and hence compact set, it has at least one accumulation point. Assuming that the parameters for the updates are chosen as explained above, we can state the following theorem about the accumulation points of the sequence of solutions.

###### Theorem 1.

(Theorem 2.3 from (Xu and Yin, 2013)). For a given starting point , let denote the sequence of solutions generated by proximal or prox-linear updates. Then, any accumulation points of the sequence is a partial minimum of .

The next theorem states that under some assumptions, if the sequence has a finite accumulation point, it converges to that point.

###### Theorem 2.

(Theorem 2.8 from (Xu and Yin, 2013)). For a given starting point , let denote the sequence of solutions generated by proximal or prox-linear updates and assume that it has a finite accumulation point . Assuming that is Lipschitz continuous and satisfies the Kurdyka–Lojasiewicz (KL) property at , converges to .

Both of these conditions are held. From (31)–(33) we get the Lipschitz continuity. Also since is the sum of real analytic and semi-algebraic functions, it satisfies the KL property (see Section 2.2 of (Xu and Yin, 2013) for details of the KL property).

Finally, it is shown in  (Xu and Yin, 2013, Lemma 2.6) that if the staring point is sufficiently close to a global minimizer of , then the sequence of solutions will converge to that point.

The results of the previous section do not hold for the basic updates because the objective is not strongly convex in . In this part, we provide some novel results for the basic update with the proofs in the Appendix.

###### Theorem 3.

If the entries of

are drawn from a continuous probability distribution on

, then:
(i) The solution of (23) is unique if is full rank.
(ii) The objective of (24) is strongly convex and its solution, if one exists, is unique.

In classical LASSO, the condition on the entries of is sufficient to achieve solution uniqueness (Tibshirani, 2013). For LASSO, the continuity is used to argue that the columns of are in general position with probability 1 (see  B, Definition 3 for a formal definition of general position). The affine span of the columns of , , has Lebesgue measure in for a continuous distribution on , so there is zero probability of drawing in their span. If we multiply by a matrix with full column rank, we retain the same property, and thus the solution of (23) is unique if is full rank. Also, since the objective function is strictly convex in (due to the elastic net property), if (24) has a solution, its solution is unique. Next, we study the properties of at convergence.

###### Theorem 4.

Assume that the entries of are drawn from a continuous probability distribution on . For a given starting point , let denote the sequence of solutions generated by Algorithm 1. Then:
(i) has at least one accumulation point.
(ii) All the accumulation points of are partial optima and have the same function value.
(iii) If is full rank for all accumulation points of , then:

 limi→∞∥Ci+1−Ci∥=0, (41)

Part (i) follows from the fact that the solutions produced by Algorithm 1 are contained in a bounded, closed (and hence compact) set. Although Algorithm 1 converges to a specific value of , this value can be achieved by different values of . Thus, the sequence can have many accumulation points. Part (ii) of Theorem 4 shows that any accumulation point is a partial optimum. Proposition 1 implies that for any given starting point, all the associated accumulation points have the same value. Under the assumption that is full rank for all accumulation points of , part (iii) provides a guarantee that the difference between successive solutions of the algorithm converges to zero, for both the factor and loading matrices. Although the condition in (41) does not guarantee the convergence of the sequence , it is close enough for practical purposes. Also, note that for finding the number of factors, we require both and to be full rank for the final solution. When is full rank, the solutions to both (23) and (24) are unique and thus and will not change in the following iterations, i.e., convergence.

### 3.3 Complete Algorithm

Now, we propose Algorithm 2 to solve the optimization problem described in (57) to find the number of latent factors as well as the factor and loading matrices. Optimal values of , , are found via 5-fold cross-validation. Although the performance of our algorithm with different updates is roughly similar, our simulations show that the prox-linear updates provide the best results both in terms of prediction accuracy and speed of convergence. This could be due to the local extrapolation which helps the algorithm avoid small neighbourhoods around certain local minima. In the following, we present results for the prox-linear updates.

## 4 Fully Sparse PCA

In (Zou et al., 2006), Zou, Hastie, and Tibshirani propose a sparse PCA, arguing that in regular PCA “each principal component is a linear combination of all the original variables, thus it is difficult to interpret the results”. Assume that we have a data matrix with the following SV decomposition: . The principal components are defined as with the corresponding columns of as the loadings. In (Zou, 2006), Zou et al. show that solving the following optimization problem leads to exact PCA:

 (ˆA,ˆB)=argminA,B∥X−XAB∥2F+λ∑i∥Ai∥22 s.t. BBT=I,

where denotes the ’th column of . Thus, we have , and corresponds to the ’th principal component. Sparse PCA (SPCA) is introduced by adding an penalty on matrix to the objective function. Zou et al. propose an alternating minimization scheme to solve this problem (Zou et al., 2006).

The ordinary principal components are uncorrelated and their loadings are orthogonal. SPCA imposes sparsity on the construction of the principal components. Here sparsity means that each component is a combination of only a few of the variables. By enforcing sparsity, the principal components become correlated and the loadings are no longer orthogonal. On the other hand, SPCA assumes, like regular PCA, that the contributions of these components are orthonormal (). In our algorithm, if we replace with , i.e., regressing on itself, we get a similar algorithm. However, our work differs in two ways. First, we also impose sparsity on the contributions of principal components. This comes at the expense of higher computational costs, but results in more interpretable results. Also, the contributions will not be orthonormal anymore. However, by the full rank constraint we impose on the two matrices, we make sure that the principal components and their contributions are linearly independent. Moreover, the algorithm provides a mechanism for learning from the data how many principal components are sufficient to explain the data.

## 5 Simulation Study

In this section, we use synthetic data to compare the performance of our algorithm with several related multivariate regression methods reviewed in the introduction.

### 5.1 Simulation Setup

We generate the synthetic data in accordance with the model described in (2), , where . First, we generate an predictor matrix, , with rows independently drawn from , where the -th element of is defined as . This is a common model for predictors in the literature (Yuan et al., 2007; Peng et al., 2010; Rothman et al., 2010). The rows of the error matrix are sampled from , where the -th element of is defined as . The value of is varied to attain different levels of signal to noise ratio (SNR). Each row of the matrix is chosen by first randomly selecting of its elements and sampling them from and then setting the rest of its elements to zero. Finally, we generate the matrix by the element-wise product of , where the elements of are drawn independently from and elements of

are drawn from Bernoulli distribution with success probability

.

We evaluate the performance of a given algorithm with three different metrics. We evaluate the predictive performance over a test set , separate from the training set, in terms of the mean-squared error:

 MSE=∥XtestˆD−Ytest∥2Fnq, (42)

where is the estimated coefficient matrix. In our case, we have . We also compare different algorithms based on their signed sensitivity and specificity of the support recognition:

 Signed Sensitivity =∑i,j1[di,j⋅ˆdi,j>0]∑i,j1[di,j≠0], Specificity =∑i,j1[di,j=0]⋅1[ˆdi,j=0]∑i,j1[di,j=0],

where represents the indicator function.

We compare the performance of our algorithm, SMFR, with many other algorithms reviewed in Section 1 as well as a baseline algorithm with a simple ridge penalty. We consider three different regimes: (i) high-dimensional problems with few instances (50) compared to the number of predictors or responses (50, 100, or 150); (ii) problems with increased number of instances (); and (iii) problems where the structural assumption of our technique is violated. In the first regime, which is of most interest to us due to high-dimensionality, we explore different parameter settings. The values of and affect the SNR—lower values of and higher values of correspond to higher values of SNR (e.g., corresponds to a very low SNR). In regime (iii), we violate the assumption about the structure of the coefficient matrix, i.e., , in two ways. In the first case, has an element-wise sparsity with a density of ; in the second, it has row-wise sparsity where of the rows are all zeros and the rest have non-zero elements. We consider these cases to compare our algorithm with others in an unfavourable setting.

### 5.2 Results

##### Predictive performance

The means and standard deviations of MSE for different algorithms are presented in Table 1. We use five-fold cross-validation to find the tuning parameters of all algorithms. We set , the maximum number of factors, to 20. For the first two simulation regimes, our algorithm outperforms the other algorithms and results in lower MSE means and standard deviation. The improvements are more significant in the high-dimensional settings with high SNR. However, in settings with low SNR () or high number of instances (), we still observe lower errors for SMFR. On average, our algorithm reduces the test error by compared with LASSO, compared with , compared with SRRR, compared with RemMap, compared with SPLS, compared with Trace, and compared with Ridge.

In the last two simulations, where the assumed factor structure is abandoned, our algorithm has no advantage over simpler methods with no factor structure (such as LASSO) and gives a higher error.

##### Variable selection

In Figure 1, we compare the average signed sensitivity and specificity of different algorithms (based on 20 simulation runs) as the number of instances increase (other parameters kept fixed). We observe that our algorithm has higher sensitivity and specificity. This effect for specificity is reduced as the number of instances increases. This shows that our algorithm is more advantageous in high-dimensional settings where the number of instances is comparable to or less than the number of predictors/responses. Although we only show the plots for a specific parameter setting, the results are similar for other parameters. Figure 1: Sensitivity and specificity comparison of different algorithms as the number of instances increases. (p=150,q=50,m=10,m0=1,σn=3,s=0.3)
##### Number of latent factors

In Table 2, we compare the number of estimated factors for the three algorithms that perform dimensionality reduction. For SRRR and SPLS, we find the number of factors by 5-fold cross-validation. The true number of factors is mentioned in the fourth column. We observe that our algorithm provides better estimates of the number of factors.

##### Computation time

We also compare the computation time of different algorithms for the parameter settings corresponding to the first row of Table 1. We report the median computation time (on a PC with 16GB RAM and quad-core CPU at 3.4GHz) of each algorithm (excluding the cross-validation part) over 20 runs; SMFR (with prox-linear updates): 1.6s, SRRR: 5.6s, RemMap: 0.3s, SPLS: 0.3s, LASSO: 0.02s, and : 0.33s. Since the algorithms are implemented using different programming languages and stopping criteria vary slightly, care must be taken when interpreting these results. The main message is that SMFR and SRRR have larger computation times, since they solve more complicated problems which involve estimating the number of factors, and the loading and factoring matrices. This extra information has value in itself and provides more insight about the structure of data. Also, SMFR is almost three times faster than SRRR.

##### SMFR initialization

In all the experiments above, and the ones in the next section, we use a random initialization for our algorithm where the elements of are sampled independently from . We compare this initialization with two other methods which are based on the matrix factorization of other solutions: (i) using the SVD of LASSO solution, , setting and , where the subscript indicates choosing the first columns; and (ii) using the SVD of trace-norm solution, , setting and . We compare these three initializations for the case where the data is generated according to the first regime of Table 1 and vary the level of noise and sparsity. The results are shown in Table 3. The procedure to find the number of effective factors is the same for all three initializations. Also, as a baseline comparison, we show the results for the usual LASSO regression in the last row. The results reported in Table 3 are based on 20 runs (i.e., 20 different realizations of the data). It would also be interesting to examine how the results change based on different random initializations for a fixed realization of the data. In Table 4, we show the results for 5 realizations of the data. For each realization, we run our algorithm with 20 different random initializations and report the mean and standard deviation of MSE over the test set. For comparison, we also include the MSE achieved by the decomposition of LASSO and trace norm solutions.

The results show that our algorithm is very robust to the choice of the starting point. The results in both tables show that random initialization gives similar (or slightly better) results compared to more sophisticated initializations. Moreover, the very small standard deviations in the first row of Table 4 and the similarity of the error for all three types of initializations show that for a given realization of the data, different initializations lead to very similar estimates by our algorithm – another indication of the robustness of our algorithm to the choice of the starting point.

## 6 Application to Real Data

We apply the proposed algorithm to real-world datasets and show that it exhibits better or similar predictive performance compared to state-of-the-art algorithms. We show that the factoring identified by our algorithm provides valuable insight into the underlying structure of the datasets.

### 6.1 Montreal’s bicycle sharing system (BIXI)

The first dataset we consider provides information about Montreal’s bicycle sharing system called BIXI. The data contains the number of available bikes in each of the 400 installed stations for every minute. We use the data collected for the first four weeks of June 2012. From this dataset we form the set of predictors and responses as follows. We allocate two features to each station corresponding to the number of arrivals and departures of bikes to or from that station for every hour. The learning task is to predict the number of arrivals and departures for all the stations from the number of arrivals and departures in the last hour (i.e., a vector autoregressive model). The choice of this model is a compromise between accuracy and complexity. Mathematically, we want to estimate

such that , where and respectively show the number of arrivals and departures in hour at station and and respectively show the number of arrivals and departures in hour at station .

We perform the prediction task on each of the four weeks. For each week, we take the data for the first 5 days (120 data points; Friday to Tuesday) as the training set (with the fifth day data as the validation set), and the last two days as the test set (48 data points; Wednesday and Thursday). We compare the algorithms performing dimensionality reduction in terms of their predictive performance on the test sets and the number of chosen factors in Table 5. We also include LASSO and as baseline algorithms. To avoid showing very small numbers, we present the value of MSE, defined in (42), times . In terms of the prediction performance, we observe that our algorithm outperforms the others in all 4 weeks. We can also compare the algorithms in terms of the number of chosen factors. For our algorithm, we set (the upper bound on the number of factors), and for the others, we do the cross-validation for 1 to 15 factors (15 different values). SMFR always uses all the 15 factors while the other two algorithms choose fewer factors (using cross-validation).