1 Introduction
The rise of highthroughput technology has been a major boon for answering complex biological questions. Examples include measuring gene transcription on a genomewide scale using microarrays; screening of genetic mutant libraries; and screening for chemicals in human tissues using mass spectrometry. Analysis of these data pose statistical and computational challenges. In this note, we tackle the problem of fitting sparse multivariate linear models to highthroughput data. We induce model sparsity using an L penalty, and consider the case when the response matrix and the covariate matrices are large.
Suppose that is a response matrix, with rows annotated by covariate matrix and columns annotated by covariate matrix . Consider the linear model
(1) 
where the matrix needs to be estimated and the errors are . For example, in a highthroughput chemical screen of a library of mutants, the response matrix would consist of the colony sizes from growing the library of mutants in a variety of chemical conditions. Each row would be a separate run of the experiment; each column would represent a specific genetic mutant strain. The matrix would consist of information on the nature and doses of the chemical media in which the mutants were grown, and the matrix would have information on which gene was mutated. The linear model allows us to model the effect of both the genes and the chemicals on colony size.
We consider the scenario when the entries in
are independently distributed with mean zero and the same variance. If the rows are iid, but the columns are correlated, then we can transform the data so that the entries are uncorrelated. The estimation reduces to finding the least squares estimates, which have a closedform solution that can be computed quickly even in high dimensions
[Xiong et al., 2011]. However, in many problems, is expected to be sparse, or we may want to use a sparse for prediction and interpretation. In such settings, the convex LASSO or L penalty is added to the least squares criterion:(2) 
for which no closedform solution exists. Several approaches for solving the univariate problem are wellestablished. When the covariate matrix is highdimensional (as in genomewide association studies), Wu and Lange [2008] and Friedman et al. [2010] proposed cyclic coordinate descent. In the computer science literature, fast iterative shrinkage and thresholding algorithms (FISTAs) [Beck and Teboulle, 2009] have also been applied to univariate L
penalized regression. More generally, FISTA has been proposed for the nonsmooth, convex optimization problem for a parameter vector
given by:(3) 
where the loss function
is a smooth convex function that is continuously differentiable with a Lipschitz continuous gradient and, the penalty term is a continuous convex function that may be nonsmooth. We have implemented extensions of these two approaches for our multivariate setup, in which and .Note that our model may be expressed in its vectorized form as follows. If vec is the vectorization operator that stacks columns of a matrix into a single column vector, we can write (1) as
(4) 
This is in the form of the familiar linear regression model
. Although mathematically equivalent, the vectorized form is computationally cumbersome. For example, the R package glmnet [Friedman et al., 2010] fails, even for moderate dimensions of and, because their Kronecker product has large memory requirements. Unlike general solvers, we utilize the fact that the design matrix is a Kronecker product and are able to obtain a computationally efficient solution. Our approach also points to how one might fit penalized multivariate regression models for multidimensional (tensorvalued) responses.
In the Methods section below, we outline the algorithms considered for fitting the model. In the Results section, we examine the performance of our method in simulated and real data. In the Discussion section, we summarize our conclusions and outline implications for future work.
2 Methods
Throughout, as in the univariate case, the intercept is omitted from the penalty term and is thus not regularized. Standardizing and
by subtracting the row means and dividing by the row standard deviations is also recommended.
We outline three algorithms we used for fitting the Lpenalized model, beginning with the most stable algorithm: cyclic coordinate descent. Next, we describe two variants of FISTA, a considerably faster, but lessstable, approach. We conclude the section with computational and implementation considerations.
2.1 Cyclic coordinate descent
Cyclic coordinate descent searches for the minimum of a multivariable function by minimizing it along one coordinate direction at a time and cyclically iterating through each direction until convergence. We first calculate the directional derivatives along the forward and backward directions for the coordinate direction at each coefficient .
(5) 
This is possible because the nondifferentiable penalty has directional derivatives along and . Furthermore, the loss function is differentiable, so its forward and backward directional derivatives are simply the positive and negative ordinary partial derivatives stored in the gradient .
(6) 
Above, and denote the th and th columns of and respectively; is the matrix of residuals. Note that calculating the gradient involves lowdimensional matrix multiplication. Like Wu and Lange [2008], our implementation organizes cyclic updates around the residuals, which makes calculating fast. For each coefficient, we compute and then update the corresponding coefficient and residual as follows.
where is the softthresholding operator
(7) 
Our implementation uses “warm starts” by initializing the coefficients at zero and computing solutions for a decreasing sequence of values. The coefficients for each subsequent value are then initialized to the previous converged solutions. This strategy is faster and leads to a more stable algorithm [Friedman et al., 2010]. We also took advantage of sparsity by organizing iterations over the active set of coefficients: after performing a full cycle through all of the coefficients, we cyclically update only the active (nonzero) coefficients until convergence. Another full cycle is run, and the process is repeated until the estimates stop changing. Iterating through coefficients randomly instead of cyclically can in practice result in faster convergence as well.
2.2 Fista
Cyclic coordinate descent is a very stable approach with excellent performance for univariate Lpenalized regression [Wu and Lange, 2008]. However, it is too slow for multivariate linear models of moderately large dimensions, especially if crossvalidation is used to tune the parameter. Consider instead an iterative shrinkagethresholding algorithm (ISTA) that calculates the gradient at the previous coefficient estimates and updates all of the coefficients simultaneously at each iteration [Beck and Teboulle, 2009]. The updates are also multiplied by a small, fixed step size less than 1. While ISTA may take more iterations than coordinate descent to converge, each iteration is faster because the gradient can be calculated efficiently as a matrix product.
Choosing the step size requires some care, as an overly small step size can result in slow convergence and an overly large one can lead to divergence. A suggested approach for choosing the step size is to use the reciprocal of the (smallest) Lipschitz constant of , given by
. The maximum eigenvalue of
is equal to the product of the maximum eigenvalues of and .Fast iterative shrinkagethresholding algorithms (FISTAs) are an extension of ISTA [Beck and Teboulle, 2009] [Nesterov, 1983] that calculate the gradient based on extrapolated coefficients comprised of a linear combination of the coefficients at the previous two iterations. If is the matrix of coefficient estimates from the most recent iteration and is that from the secondtolast iteration, then calculate using . This approach takes into account the change between the coefficients in previous iterations, leading to a “damped oscillation” convergence that reduces overshooting when the local gradient is changing quickly.
Even faster convergence can be achieved by backtracking: performing a line search to find the maximum step size at each iteration, instead of initializing a fixed step size. The idea is that the step size should be small enough that the decrease in the objective function corresponds to the decrease expected by the gradient. First, set the initial step size to 1 and choose a multiplying factor with which to iteratively shrink the step size. Then, at each update step, iteratively shrink the step size by multiplying it with until the it satisfies the property in Eq. 8. Then update the coefficients.
(8) 
Like coordinate descent, we implemented FISTA using a path of “warm starts”. Coordinate descent, ISTA, and FISTA with fixed step size or backtracking each trade off between speed and stability.
2.3 Computational Considerations
2.3.1 Shrinkage Parameter Tuning
To determine the optimal shrinkage/regularization parameter , fold crossvalidation is recommended; a parallel implementation is straightforward. Various criteria can be used to identify optimal performance averaged across the folds, including mean squared error (MSE), test error, AIC, and BIC. We used MSE for the analyses presented.
2.3.2 Package Implementation
We implemented our algorithms using the highlevel programming language Julia (www.julialang.org). Julia is a relatively young language with an active community that combines ease of prototyping with computational speed. It features a justintime compiler and strong data typing, which enable fast computation. It is an attractive candidate for numerical computing problems such as ours, since one does not need to switch between multiple programming languages for implementation, analysis, and visualization. Additionally, Julia has builtin support for parallelization.
3 Results
3.1 Environmental screen simulations
We simulated data modeled after an environmental screening study [Woodruff et al., 2011] using mass spectrometry. The study measured environmental chemical concentrations in pregnant women across various demographics in several tissues. We simulated data from 100 chemicals, each measured in 10 tissues for 108 women. The tissues, chemicals, and each unique combination of tissues and chemicals were encoded in the matrix. We simulated an
matrix with 19 continuous demographic covariates drawn from the standard normal distribution. For each tissue, 1/4 of the chemicals, 1/2 of the demographic covariates, and 1/8 of the interactions between chemicals/tissues and demographics, we simulated effects drawn from a Normal(0,2) distribution. The receiver operating characteristic (ROC) curves in Figure
1 compare the performance of Lpenalized multivariate linear models (MLM) to the conventional approach of running a univariate linear model for each chemical and tissue combination.
The black solid line plots the results for the Lpenalized MLM. We obtained true positive rates (TPR) and false positive rates (FPR) by varying and comparing the nonzero and zero interaction estimates to the true interactions.

The red solid line is from running the 1000 univariate linear regression models for each combination of the 100 simulated chemicals and 10 simulated tissues. We obtained the adaptive BenjaminiHochberg adjusted pvalues for each model’s coefficient estimates and varied the cutoff for determining significant interactions. These were compared to the true interaction effects to calculate the TPR and FPR.

The blue lines offer an alternate visualization of the univariate linear models. For each chemical, there are 10 chemical demographic interactions, one for each of the 10 tissues. We flag an interaction if least 1/5, 2/5, 3/5, or 4/5 out of the 10 different pvalues is below the cutoff. A plot with curves for the 10 tissues, each of which corresponds to pvalues from 100 different linear models, yields similar results.
With the possible exception of the lower left, our method consistently outperforms variations of the conventional univariate approach. The L penalized MLM results in an area under the curve (AUC) of 0.905; AUC for the univariate linear regression interpretations is at most 0.714, when only one out of five significant univariate pvalues (“hits”) is needed to detect an significant interaction.
To illustrate the speed of FISTA with backtracking, we ran the algorithm on simulated data while fixing the dimensions of the multivariate response matrix and varying the dimensions of the interaction matrix (Table 1), or vice versa (Table 2). The data was simulated with 1/2 nonzero row and column main effects and 1/8 nonzero interactions drawn from a Normal(0,2) distributions and with standard normal errors. A laptop with 8 GB memory and a 2.7 GHz dualcore processor was used to obtain the times as an average of 10 runs. Our approach remains fast even when scaling to greater dimensions.
=200  =400  =600  

=200  0.85  1.35  1.58 
=400  0.97  1.40  1.92 
=600  0.95  1.52  2.60 
=400  =800  =1200  

=400  0.43  0.70  0.95 
=800  0.89  1.40  1.85 
=1200  1.29  2.02  2.83 
3.2 Arabidopsis fitness adaptation QTL
Ågren et al. [2013] studied 398 Arabidopsis thaliana recombinant inbred lines derived by crossing populations from Italy and Sweden. The individuals were grown in six environments: in two sites (Italy and Sweden) measured over three years (20092011). The investigators genotyped 348 markers with the goal of mapping quantitative trait loci (QTL) to explain genetic mechanisms of fitness adaptation to local environments. We ran L
penalized MLMs to determine significant interactions between the markers and each of the two sites and six environments. The 348 markers were encoded as dummy variables in the
matrix. The matrix was comprised of a identity matrix (for the six environments), plus a sum contrast to encode the two sites (Italy and Sweden).We performed 10fold crossvalidation with MSE as the criterion to determine an optimal penalty size of 7.4. Figure 2 plots all interactions (either G site in solid red or G environment in dashed black) against marker position on the five chromosomes. Vertical lines separate the chromosome, and the peaks correspond to loci with significant, nonzero interactions. Several are found on chromosome 5, and G site (Italy vs. Sweden) QTL represent many, but not all, of the peaks. These results are largely aligned with the significant QTL found by Ågren et al. [2013].
To compare the speed of the FISTA methods to coordinate descent, Table 3 shows the times for running each algorithm on this dataset for a path of 50 values, averaged over 10 runs. For this moderatelysized data, the Lpenalized coefficients can be computed in minutes using FISTA, compared to nearly a day for cyclic coordinate descent.
Coord. Desc.  ISTA (fixed)  FISTA (fixed)  FISTA (backtr.) 

23.63 h  7.83 m  2.85 m  1.77 m 
3.3 Arabidopsis eQTL experiment
Lowry et al. [2013] examined the regulation and evolution of gene expression by considering drought stress. This expression quantitative trait locus (eQTL) mapping experiment studied 104 individuals from the Tsu1 (Tsushima, Japan) Kas1 (Kashmir, India) recombinant inbred line population of Arabidopsis thaliana. It was conducted across wet and dry soil treatments with two replicates. Gene expression phenotypes were collected for 25,662 genes and 450 markers were genotyped [Lowry et al., 2013]. The goal was to identify main effect (G) and interaction (G E) eQTLs for the environmental conditions. Here, the matrix encodes the 450 markers. The matrix, which encodes for main effect and drought treatment interactions for the 25,662 expression phenotypes, can be expressed as
(9) 
The large number of phenotypes in this dataset makes performing Lpenalized MLMs a much more computationallyintensive endeavor than it is for the Ågren et al. [2013] experiment. It takes 36 hours to run the FISTA algorithm for a path of 16 penalties. However, applying R/qtl’s stepwise function [Broman et al., 2003] onebyone for each phenotype would take about twice as long, at 73 hours. Figure 3, which reproduces Figure 2 in Lowry et al. [2013] using our results, visually summarizes the main effect and interaction eQTLs identified by our method when . Our method was able to detect many of the same main effects and G E effects.
4 Discussion
We have developed a fast fitting procedure for Lpenalized multivariate regression models and demonstrated their use for several highthroughput data problems. Our approach takes advantage of the structure of the multivariate regression model to speed up the computational algorithms. The choice between cyclic coordinate descent and the various flavors of (F)ISTA is largely a tradeoff between speed and stability. The former is a reasonably fast approach for computing Lpenalized estimates for univariate linear models, but is too slow for our multivariate scenario. Instead, we turned to FISTA and the exploitation of the matrix properties and sparsity of our model. Analysis of simulated and Arabidopsis QTL data illustrates our method’s ability to detect sparse interactions.
Our work shows the feasibility of fitting multivariate linear models with moderately large dimensions. It can be extended in two promising directions. First, we can extend fitting models with different loss () and penalty functions (). For example, we can fit the elastic net [Zou and Hastie, 2005] by changing the penalty function, adding an L penalty to the L
penalty. We can also make the solution less sensitive to outliers by using a robust loss function such as Huber’s loss function instead of the squared error loss.
A second promising direction would be to extend the models to multidimensional (tensorvalued) responses. In (4), the design matrix is a Kronecker product of two matrices. It can be extended to more than two matrices to handle multidimensional . These models might be attractive for handling, for example, time series highthroughput data or 3D image data. Further work is needed to explore the performance, scalability and stability of the fitting algorithms.
Our algorithms have been implemented in the Julia programming language and are available at https://bitbucket.org/jwliang/mlm_packages.
References
 Ågren et al. [2013] Ågren, J., Oakley, C. G., McKay, J. K., Lovell, J. T., and Schemske, D. W. (2013). Genetic mapping of adaptation reveals fitness tradeoffs in Arabidopsis thaliana. Proceedings of the National Academy of Sciences, 110(52):21077–21082.
 Beck and Teboulle [2009] Beck, A. and Teboulle, M. (2009). A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202.
 Broman et al. [2003] Broman, K. W., Wu, H., Sen, Ś., and Churchill, G. A. (2003). R/qtl: Qtl mapping in experimental crosses. Bioinformatics, 19(7):889–890.
 Friedman et al. [2010] Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1.
 Lowry et al. [2013] Lowry, D. B., Logan, T. L., Santuari, L., Hardtke, C. S., Richards, J. H., DeRoseWilson, L. J., McKay, J. K., Sen, S., and Juenger, T. E. (2013). Expression quantitative trait locus mapping across water availability environments reveals contrasting associations with genomic features in arabidopsis. The Plant Cell, 25(9):3266–3279.
 Nesterov [1983] Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate O (1/k2). In Soviet Mathematics Doklady, volume 27, pages 372–376.
 Woodruff et al. [2011] Woodruff, T. J., Zota, A. R., and Schwartz, J. M. (2011). Environmental chemicals in pregnant women in the United States: NHANES 20032004. Environmental health perspectives, 119(6):878.
 Wu and Lange [2008] Wu, T. T. and Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics, pages 224–244.
 Xiong et al. [2011] Xiong, H., Goulding, E. H., Carlson, E. J., Tecott, L. H., McCulloch, C. E., and Sen, Ś. (2011). A flexible estimating equations approach for mapping functionvalued traits. Genetics, 189(1):305–316.
 Zou and Hastie [2005] Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320.
Comments
There are no comments yet.