1 Introduction
In this article we consider the pair , with and . Define and . We initially assume the linear model
(1) 
where
are realizations of an i.i.d. random variable with mean zero and covariance matrix
, and . We will refer to the matrix of error as . Under mild assumptions a consistent estimator ofis the ordinary least squares (OLS) estimator of
If are i.i.d. and the estimator is the MLE. This estimator does not use the other responses, ignoring potentially useful information.
Throughout this paper for a vector
define as the norm and for a matrix we define as the entrywise norm. If there is a priori information that the fitted values of response and should be close then we could impose a penalty on the difference in the fitted values and consider the estimators(2) 
where is a tuning parameter controlling the amount of agreement between the two fitted values vectors. We propose an objective function that generalizes (2) for multiple responses from multiple clusters that may not be known a priori. The proposed objective function also includes an penalty for simultaneous estimation and variable selection, which allows our method to be used to increase prediction accuracy, select relevant variables for each response, and detect groupings of response variables without assuming or estimating a covariance structure. In our theory, simulations, and applied examples we consider cases where . We extend the proposed method to the generalized linear model framework, specifically focusing on multiple binary responses. This extension allows the method to be used in many different contexts, such as understanding comorbidities related to patient information recorded in electronic medical records, or product level purchasing habits of customers based on information obtained from a loyalty program. We propose a coordinate descent algorithm for the least squares case and proximal coordinate descent algorithm for the binomial GLM case, which provides a general framework for extending the method to other GLM or Mestimator settings.
Our work has been influenced by previous work in estimating high dimensional models. When the penalty function is equivalent to a ridge penalty (Hoerl and Kennard, 1970) on the difference of the coefficient vectors for the two responses. We add the penalty as proposed in Tibshirani (1996) to do simultaneous variable selection and estimation. Similar to the work of Zou and Hastie (2005) we combine the ridge and penalties. The proposed estimator simultaneously estimates clusters of the response and fuses the fitted values of the clustered responses. Previous work has been done on clustering covariates for high dimensional regression with a univariate response. This work is most similar to the work of Witten et al. (2014) who proposed the cluster elastic net (CEN) that simultaneously estimates clusters of covariates and fuses the effects of covariates within the same cluster. Our proposed method is also similar to Grace estimators proposed in Li and Li (2008) and Li and Li (2010), which use regularization based on external network information to minimize the difference of coefficients for related predictors and use a lasso penalty for sparsity. Huang et al. (2011) proposed the sparse Laplacian shrinkage method, which preforms variable selection and promotes similarities among coefficients of correlated covariates. Zhao and Shojaie (2016) proposed the Grace Test, a testing framework for Grace estimators, that allows for some uncertainty in the graph and showed that if the external graph is informative it increases the power of the Grace test. Bühlmann et al. (2013) proposed two different penalized methods for clustered covariates in highdimensional regression: cluster representative lasso (CRL) and cluster group lasso (CGL). In CRL the covariates are clustered, dimension reduction is done by replacing the original covariates with the cluster centers and a lasso model is fit using the cluster centers as covariates. In CGL the group penalty of Yuan and Lin (2005) is applied using the previously found clusters as the groups. Zhou et al. (2017) demonstrated that averaging over models using different cluster centers for both responses and predictors can improve prediction accuracy of DNase I hypersensitivity using gene expression data. Kim et al. (2009) proposed graphguided fused lasso (GGFL) to the specific problem of association analysis to quantitative trait networks. GGFL presents a fused lasso framework in multivariate regression that leverages correlated traits based on a network structure. Our work is related to the fused lasso literature as well, though we do not achieve exact fusion (Tibshirani et al., 2005; Rinaldo, 2009; Hoefling, 2010; Tibshirani, 2014). The proposed method differs from the works mentioned in this setting because it focuses on using correlation between the response variables to improve estimation, however all of the works mentioned were instrumental in helping us derive our final estimator.
The idea of using information from different responses to improve estimation in multivariate regression is not new and our work builds upon previous works in this area. Breiman and Friedman (1997) introduced the Curds and Whey method whose predictions are an optimal linear combination of least squares predictions. Rothman et al. (2010) proposed multivariate regression with covariance estimation (MRCE), which is a penalized likelihood approach to simultaneously estimate the regression coefficients and the inverse covariance matrix of the errors. MRCE leverages correlation in unexplained variation to improve estimation, while our proposed method leverages correlation in explained variation to improve estimation. Other estimators assume both the response and covariates are multivariate normal and exploit this structure to derive estimators (Lee and Liu, 2012; Molstad and Rothman, 2016). Rai et al. (2012) proposed a penalized likelihood method for multivariate regression that simultaneously estimates regression coefficients, the inverse covariance matrix of the errors, and the covariance matrix of the regression coefficients across responses using lasso type penalties. Peng et al. (2010) introduced regularized multivariate regression for identifying master predictors (remMap), which relies on a priori information about valuable predictors and imposes a group and norm, across responses, on all covariates not prespecified as being useful predictors. Kim and Xing (2012) proposed the tree guided group lasso, which uses an a priorihierarchical clustering of the responses to define overlapping group lasso penalties for the multivariate regression model. They propose a weighting method that ensures all coefficients are penalized equally, while using the hierarchical structure to impose a similar sparsity structure across highly correlated responses.
Another approach to improving efficiency is by doing dimension reduction on to find a smaller subspace that retains the material information needed for estimation of the regression coefficients (Cook et al., 2010; Cook and Zhang, 2015; Sun et al., 2015). Cook et al. (2010) introduced the envelope estimator for the multivariate linear model, which projects the maximum likelihood estimator onto the estimated subspace with the material information. Cook and Zhang (2015) provided envelope models for GLMs and weighted least squares. Sun et al. (2015) proposed a sparse regression model (SPReM) for estimating models where is very large. SPReM projects the response variables into a lowerdimensional space while maintaining the structure needed for a specific hypothesis test. The key difference between our proposed method and these approaches is that we are interested in simultaneously estimating clustering of the response variables and fusing the fitted values from responses within the same cluster.
The proposed method simultaneously estimates clusters of the response and coefficients. Changes in cluster groups are discrete changes and as a result our objective function is discontinuous, similar to kmeans clustering, thus making it difficult to derive an efficient algorithm that will find the optimal estimates for coefficients and groups. Witten et al. (2014) dealt with a similar difficulty for the CEN estimator, but noticed that if the groups are fixed then the problem is convex, while if the regression coefficients are fixed the problem becomes a kmeans clustering problem. We modify the approach proposed in Witten et al. (2014) to our problem of grouping responses and extend the approach to the case of generalized linear models, specifically the binomial logistic model. In our theoretical results we assume the clustering groups are known, but the problem remains challenging as we are dealing with multiple responses, allow for and for to increase with .
In Section 2 we present our method for the multivariate linear regression model and provide theoretical results, including consistency of our estimator, to better understand the basic properties of the penalized likelihood solution. In Section 3 we provide details on the twostep iterative algorithm and show estimating the regression coefficients for the different clusters is an embarrassingly parallel problem, which is a property of our cluster fusion penalty that fuses within group fitted values. This avoids issues that would arise in fusing all possible combinations of regression coefficients, or having to specify a fusion set
a priori. Examples of the issues that can arise can be found in Price et al. (2017), who discussed the importance of choosing the fusion set, and the original fused lasso paper which fused only consecutive coefficients (Tibshirani et al., 2005). In Section 4 we present the model for binomial responses along with an algorithm, demonstrating how the use of the cluster fusion penalty can exploit relationships of response variables beyond the traditional Gaussian problem. Simulations for both conditional Gaussian and binomial responses are presented in Section 5. The least squares version of our method is applied to model baby birth weight, placental weight and cotinine levels given maternal gene expression and demographic information. The binomial case is applied to model concession stand purchases using customer information as covariates. Both applied analysis are presented in Section 6. We conclude with a summary in Section 7.2 Least Squares Model
2.1 Method
First, we consider estimating (1) when there are unknown clusters of the responses. We further assume that for all , and for all . The model requires parameters to be estimated for prediction, which is problematic when or are large. Let be a partition of the set . For a set define as the cardinality of that set. We propose the multivariate cluster elastic net (MCEN) estimator as
(3) 
where is the number of clusters and and are nonnegative user specified tuning parameters. In addition , the total number of clusters, can be considered a tuning parameter. The cluster fusion penalty, associated with tuning parameter , is used to exploit similarities in the fitted values. The lasso penalty, with tuning parameter , is used to perform simultaneous estimation and variable selection. When or , the optimization in (3) reduces to independent lasso penalized least squares problems with tuning parameter . If is known then the optimization in (3) can be split into independent optimizations that are similar to the optimizations presented in Li and Li (2008), Li and Li (2010), and Witten et al. (2014) and can be solved in parallel. We exploit this computational feature in our algorithm, which is a result of using the cluster fusion penalty.
The proposed method uses a combination of and penalties as proposed by Zou and Hastie (2005). Similar methods have been proposed for grouping the effects of predictors with a univariate response such as CEN (Witten et al., 2014) and Grace estimators (Li and Li, 2008, 2010; Zhao and Shojaie, 2016). Kim and Xing (2012) proposed a method that uses a predetermined hierarchical clustering of the responses that provides an penalty for all coefficients and a group penalty for responses that are grouped together. Chen et al. (2016) proposed a method using conjoint clustering to incorporate similarities in preferences between individuals in conjoint analysis. This method does not simultaneously estimate coefficients and groupings. It requires a twostep algorithm to estimate the number of clusters, and then estimates coefficients using regularization based on the estimated cluster. The proposed approach uses nonhierarchical clusters, allows for the clustering structure to be unknown before estimation of the coefficients and focuses more on imposing similar fitted values for grouped responses, compared to directly imposing a similar sparsity structure.
Selecting the triplet, , of tuning parameters can be done by Kfold cross validation minimizing the squared prediction error. Let be the set of indices in the th fold, , and be the estimated regression coefficient vector using , and for response produced from the training set with removed. Then select the triplet, , that minimizes
(4) 
2.2 Theoretical Results
For theoretical discussions we assume that is known for some fixed value of . This is because for unknown the objective function in (3) is discontinuous because of the discrete changes in groups, however if is known (3) is a convex function. In this section we will look at properties of the MCEN estimator for the special case of fixed and with . In addition, we present a consistency result that allows for when and .
Thus, the first two theorems refer to the following estimator
(5) 
The estimator does not simultaneously estimate the groups, it assumes they are known a priori, and thus is different than . There are instances where the grouping structure is known before data analysis and thus using would be preferable in practice. In addition is a key component to the algorithm discussed in Section 3. We begin by relating the estimator in (5) to ordinary least squares (OLS), for the special case of . Removing the penalty allows us to derive a closed form for the estimator.
Assume , , and and are fixed values. Define to be the OLS estimates for the response variables and be the solution to (5) with tuning parameter . Given then has the closed form solution of
(6) 
Theorem 2.2 provides some intuition about the MCEN estimator. As increases the MCEN estimator approaches a weighted average of the OLS coefficients within a cluster. In addition the results from Theorem 2.2
can be used to calculate the bias and variance of
, which are needed for proving Theorem 2.2. The proof of Theorem 2.2 and the following Theorems can be found in the appendix.Assume for all and and for , where . Set , then for a fixed and where there exists a positive such that
(7) 
where are the true regression coefficients, is as defined in Theorem 2.2 and is as defined in (5).
Similar to ridge regression Theorem
2.2 shows that for some positive the estimator from (5) has a smaller mean squared error than OLS. Note, we are not assuming that for that and unless this condition holds the estimator is biased. Thus, there exists a value of for which there is a favorable biasvariance trade off.Next we examine the asymptotic performance of the estimator with the penalty. At times it will be easier to refer to a vectorized version of a matrix and for any matrix , . Where is the vector formed by stacking the columns of . Define as the set of active predictors. That is, S is a subset of where if . The subspace for the active predictors is
The parameter space will be separated using projections of vectors into orthogonal complements. We define a projection of a vector into space as
The orthogonal complement of space is
The following set is central to our proof of consistency,
For our proof of the consistency of we make the following six assumptions:

Define to be the th column vector of , then has the condition that .

Define as the error vector for response . The error vector has a mean of zero and subGaussian tails for all . That is, there exists a constant such that for any , with ,
Define .

Define , where is the standard Kronecker product. There exists a positive constant such that

There exists a positive constant such that .

Given , if then , for all and .

Define
as the maximum eigenvalue of square matrix
and as the matrix of true predictors for cluster , where the th predictor is a true predictor if for any . There exists a positive constant such that
Assumption A1 is a standard assumption for lassotype penalties and can be achieved by appropriately scaling the covariates, which is commonly done in penalized regression. Assumption A2 is a generalization of the subGaussian error assumption for penalized regression for a univariate response. Assumption A1 could be relaxed to allow for certain unbounded covariates, but then A2 would be replaced by assuming the errors are normally distributed
(Candes and Tao, 2007; Meinshausen and Yu, 2009). Assumption A3 is a generalization of the common restricted eigenvalue assumption. Motivation for assumption A3 is discussed in great detail by Negahban et al. (2012) and a version for has been used in several works analyzing asymptotic behaviors of the lasso estimator (Bickel et al., 2009; van de Geer and Bühlmann, 2009; Meinshausen and Yu, 2009). Assumptions A4 and A5 provide that the true coefficients are similar for responses in the same group. Assumption A5 provides that they have the same sparsity structure. While, assumption A4 ensures that the difference in the nonzero elements can be bounded by a finite constant, even if the number of predictors increases with . Assumption A6 assumes the maximum eigenvalues of the sample covariance of the true predictors are bounded, a common assumption in highdimensional work. Assumptions A4A6 can be replaced by an assumption similar to assumption A2 from Witten et al. (2014) that if then , for all , thus the bias of the MCEN estimator only comes from the penalty.Using assumptions A3 and A5 we can provide a closed form definition of the asymptotic bias when . This relationship will be central to our proof of consistency of .
Let be an ssparse matrix, whose column vectors are all sparse and to be a positive definite matrix. Assume and are fixed values. Define,
Assume then has closed form solution,
Corollary 2.2 provides insight into what would converge to for a fixed . Knowing this exact relationship is used in our consistency proof because it allows us to understand the exact nature of the bias caused by the penalty and for going to zero at a given rate we can show that the bias is asymptotically negligible.
Let be an ssparse matrix, whose column vectors are all sparse and to be a positive definite matrix. Given , and assumptions A1A6 hold then there exist constants , , and such that
(8) 
with probability at least
. The convergence rate derived is similar to rates found in lassotype estimators with a univariate response, with replacing to accommodate for the multiple responses (Bickel et al., 2009; Candes and Tao, 2007; Meinshausen and Yu, 2009; Negahban et al., 2012). Thus, under the conditions of Theorem 2.2 if then . Our results prove consistency of our estimator when the group structure is known. Zhao and Shojaie (2016) propose the Grace test for an estimator with a similar penalty for grouping predictors with a univariate response and establish asymptotic results that allow for inference even if there is some uncertainty to the grouping structure.3 Algorithm
The optimization in (3) is discontinuous because of the estimation of cluster assignments. To simplify the optimization we propose an iterative algorithm that alternates between estimating the groups with the regression coefficients fixed, and estimating the regression coefficients with the groups fixed. If the clusters are known (5) then it is a convex optimization problem that can be solved by a coordinate descent algorithm. Let , define as the th column of . The super script denotes the th element of the vector has been removed, and is th diagonal element of . Define . To solve (5), we use a coordinate descent algorithm where each update is preformed by
(9) 
Thus, (5) is solved by iterating through and until the solution converges, similar to other coordinate descent solutions (Witten et al., 2014; Li and Li, 2010, 2008; Friedman et al., 2008). If is known then the solution to (3) reduces to the well studied kmeans clustering problem. Recognizing this, we propose a twostep iterative procedure to obtain a local minimum. To start the algorithm an initial estimate of or is needed. We propose initializing the regression coefficients for the different responses separately with the elastic net estimator of response of
(10) 
where represents the th iterative estimate of . Given a fixed we propose the following algorithm.

Begin with initial estimates, .

For the th step, where , repeat the steps below until the group estimates do not change:

Hold fixed and minimize,
(11) The above can be solved by performing means clustering on the dimensional vectors .

Convergence is reached once the groups at the th and th iteration are the same. The optimization in (5) is separable with respect to , and results in independent optimization problems that can be solved in parallel. The algorithm for (5) can be solved in solution path type form where we iterate across different values of in a similar fashion as proposed in the glmnet algorithm (Friedman et al., 2008)
. If all of the initial elastic net estimators are fully sparse, we set the solution to be a zero matrix and thus following
Friedman et al. (2008), initialize the algorithm by beginning the sequence with atOur twostep approach is closely related to the CEN algorithm proposed by Witten et al. (2014), who proposed a twostep algorithm where the two steps are solved by coordinate descent and kmeans algorithms. The major difference in our proposal is that we cluster the responses rather than the predictors, and have the ability to solve the optimization in parallel due to the nature of our regularization in a multiple response setting.
4 Binomial Model
4.1 Method
Next we extend the multivariate cluster elastic net to generalized linear models. We focus specifically on the binomial response case, but our discussion here will scale to other exponential families. A fusion penalty has been proposed for merging groups from a multinomial response (Price et al., 2017), but our method differs as it aims to leverage association between multiple binomial responses. Kasap et al. (2016)
proposed an ensemble method that combines association rule mining and binomial logistic regression via a multiple linear regression model. Our method differs from this by simultaneously estimating the clusters of the response variables and estimating the regression coefficients. An example is
customers, with covariates, such as demographic and historic purchasing variables, and indicators of product purchasing statuses for each customer. You could run independent models, but this would not allow for modeling the relationship between the different products. Extending the multivariate cluster elastic net to multiple binomial responses would allow us to group products by purchase probabilities to identify and use relationships between products. This could also be used to create a probabilistic model for diseases based on patient demographic and medical information.For the linear model we ignore the intercept term as it can be removed by appropriately scaling and . This is not possible in logistic regression, therefore the model needs an intercept term. We define , , as the kth column vector of and . The true coefficients for response is defined as , , is the matrix with the first row, the row of intercept coefficients, of removed and is the th column vector of . In this model is an independent draw from
(13) 
where
(14) 
The penalized negative loglikelihood function is
(15) 
4.2 Algorithm
We propose solving (15) by approximating it with a penalized quadratic function similar to the glmnet algorithm proposed by Friedman et al. (2008). Define,
(16) 
To implement this approximation we define
(17)  
(18)  
(19) 
Note that is just the first order Taylor approximation of , and that is the conditional variance of given . Define and .
The MCEN estimator for the binomial model is
(20) 
If the groups are known a priori the solution is
(21) 
For same length vectors and let represent the component wise multiplication of the two vectors. To solve (21), we use a proximal coordinate descent algorithm where each update is performed by
(22) 
where
The final solution is found by iterating through and until convergence. Again this is a solution similar to the glmnet algorithm proposed by Friedman et al. (2008).
To solve (20), we propose an algorithm that is similar in nature to the penalized least squares solution proposed in Section 3. The main difference is that we solve (20) with fixed using an iteratively reweighed least squares (IRWLS) solution with a proximal coordinate descent algorithm. The initial estimator for each response is done separately with
(23) 
The following is our proposed algorithm for estimating (20).

Begin with initial estimates of .

For the th step, where , repeat the steps below until the group estimates do not change:

Hold fixed and minimize
(24) The above can be solved by performing means clustering.

The triplet can be selected using KFold cross validation maximizing the validation loglikelihood. Let be the set of indices in the th fold and be the estimated probability for observation and response produced from the model with removed using , and . Specifically we select the triplet that maximizes
(26) 
The quadratic approximation defined by (19), is a standard technique used to estimate parameters in generalized linear models, making this framework and our algorithm scalable to other exponential family settings (Faraway, 2006). Tuning parameter selection would then be done by updating (26) with the appropriate likelihood.
5 Simulations
5.1 Gaussian Simulations
In this section we compare the performances of the MCEN estimator (3), the true MCEN (TMCEN) (5), with clustering structure known a priori, the separate elastic net (SEN) estimator (10), the joint elastic net (JEN) estimator
(27) 
and the treeguided group lasso (TGL) (Kim and Xing, 2012). Define as the th row vector of matrix . Given a tree with vertices , where each node is associated with group define as a vector of the th predictors from responses in group . The TGL estimator is
(28) 
where are weights that can vary with the nodes. See Kim and Xing (2012) for a detailed presentation of TGL, including how the weights, , are derived.
The JEN and SEN models are fit using the glmnet package in R (Friedman et al., 2008). Tuning parameters for all methods are selected using 10folds cross validation. For the MCEN and TMCEN methods cluster sizes of 2, 3 and 4 are considered. We include the TMCEN estimator for two reasons. First, in practice the TMCEN estimator could be used if the practitioner has a predetermined clustering of the responses. Second, the TMCEN is useful as a benchmark to compare with the MCEN estimator because if the grouping of responses is useful TMCEN provides the optimal grouping.In all of the simulations the sample size is 100 and the number of responses is 15. For the number of covariates we considered 12, 100 and 300. Next we define how the covariates are generated and then will present the generating process for the response variables.
Define with entries and , for . Let be a matrix with all entries equal to zero. The covariates are generated by , where for and otherwise
with .
For a group of responses we define the grouped coefficients as , where is a constant and with each element equal to . In the case of the matrix of coefficients is
otherwise
Define with being the entry for the th row and th column of . The generating process for the response is
(29) 
where , and , for not equal to . In all simulations we set the sample size to 100, perform 50 replications and with set consider the following 9 different combinations for the true coefficient matrix,
Models are fit using the training data with a sample size of 100. The tree for TGL is defined by performing completelinkage hierarchical clustering on the responses in the training data. In addition we generate 1000 additional testing samples to assess the prediction accuracies of the models. Let represent the th training sample for the th response and represent a predicted value of that sample and response. The average squared prediction error (ASPE) is defined as
(30) 
We also report the mean squared error (MSE) of the estimators where for an estimator
(31) 
In addition we report the number of true variables selected (TV), out of a maximum of 60 for and 150 otherwise, and the number of false variables selected (FV). Box plots of the statistics for and the different combinations of and are reported in Figures 1–4. These results show that TMCEN generally outperforms all methods in terms of ASPE and MSE. The one exception being when , particularly for larger values of , TGL is competitive with or outperforms TMCEN. For larger values of we expect more bias in the MCEN and TMCEN solutions and our simulation setting is favorable to TGL because the sparsity structure is the same for responses in the same cluster. With regards to ASPE and MSE, MCEN generally does better than TGL when . This suggests that the MCEN approach is advantageous with several smaller signals, but the signals need to be strong enough to correctly identify the clustering of the responses. The MCEN method also outperforms JEN and SEN in terms of ASPE and MSE, except in the case of where JEN outperforms MCEN. In this case the signal is too small resulting in the MCEN method not finding the true clustering structure, and thus the grouping penalty will not be optimal. The MCEN and TMCEN methods tend to pick a larger model than SEN, but a smaller model than JEN. This results in the MCEN and TMCEN methods correctly choosing more true predictors than SEN and fewer false positive predictors than JEN for weaker signal cases. In terms of variable selection MCEN and TMCEN tend to do better than TGL in terms of both true and false variable selection. For the stronger signal cases the SEN approach does the best in terms of variable selection, tending to have the maximum number of true covariates selected, while a smaller number of false covariates selected. Similar conclusions can be derived for the plots of and , which are available in the supplementary material.