Fast algorithms for fitting L_1-penalized multivariate linear models to structured high-throughput data

12/15/2017
by   Jane W. Liang, et al.
0

We present fast methods for fitting sparse multivariate linear models to high-throughput data. We induce model sparsity using an L_1 penalty and consider the case when the response matrix and the covariate matrices are large. Standard methods for estimation of these penalized regression models fail if the problem is converted to the corresponding univariate regression problem, motivating our fast estimation algorithms that utilize the structure of the model. We evaluate our method's performance on simulated data and two Arabidopsis genetic datasets with multivariate responses. Our algorithms have been implemented in the Julia programming language and are available at https://bitbucket.org/jwliang/mlm_packages.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/19/2013

Joint estimation of sparse multivariate regression and conditional graphical models

Multivariate regression model is a natural generalization of the classic...
02/16/2018

WHInter: A Working set algorithm for High-dimensional sparse second order Interaction models

Learning sparse linear models with two-way interactions is desirable in ...
12/06/2021

Bayesian Estimation Approach for Linear Regression Models with Linear Inequality Restrictions

Univariate and multivariate general linear regression models, subject to...
12/23/2021

A general framework for penalized mixed-effects multitask learning with application on DNAm biomarkers creation

The creation of non invasive biomarkers from blood DNA methylation profi...
12/02/2020

Spatial Multivariate Trees for Big Data Bayesian Regression

High resolution geospatial data are challenging because standard geostat...
11/03/2016

High-dimensional regression over disease subgroups

We consider high-dimensional regression over subgroups of observations. ...
02/17/2018

Tests about R multivariate simple linear models

Hypothesis about the parallelism of the regression lines in R multivaria...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The rise of high-throughput technology has been a major boon for answering complex biological questions. Examples include measuring gene transcription on a genome-wide scale using microarrays; screening of genetic mutant libraries; and screening for chemicals in human tissues using mass spectrometry. Analysis of these data pose statistical and computational challenges. In this note, we tackle the problem of fitting sparse multivariate linear models to high-throughput data. We induce model sparsity using an L penalty, and consider the case when the response matrix and the covariate matrices are large.

Suppose that is a response matrix, with rows annotated by covariate matrix and columns annotated by covariate matrix . Consider the linear model

(1)

where the matrix needs to be estimated and the errors are . For example, in a high-throughput chemical screen of a library of mutants, the response matrix would consist of the colony sizes from growing the library of mutants in a variety of chemical conditions. Each row would be a separate run of the experiment; each column would represent a specific genetic mutant strain. The matrix would consist of information on the nature and doses of the chemical media in which the mutants were grown, and the matrix would have information on which gene was mutated. The linear model allows us to model the effect of both the genes and the chemicals on colony size.

We consider the scenario when the entries in

are independently distributed with mean zero and the same variance. If the rows are iid, but the columns are correlated, then we can transform the data so that the entries are uncorrelated. The estimation reduces to finding the least squares estimates, which have a closed-form solution that can be computed quickly even in high dimensions

[Xiong et al., 2011]. However, in many problems, is expected to be sparse, or we may want to use a sparse for prediction and interpretation. In such settings, the convex LASSO or L penalty is added to the least squares criterion:

(2)

for which no closed-form solution exists. Several approaches for solving the univariate problem are well-established. When the covariate matrix is high-dimensional (as in genome-wide association studies), Wu and Lange [2008] and Friedman et al. [2010] proposed cyclic coordinate descent. In the computer science literature, fast iterative shrinkage and thresholding algorithms (FISTAs) [Beck and Teboulle, 2009] have also been applied to univariate L

-penalized regression. More generally, FISTA has been proposed for the nonsmooth, convex optimization problem for a parameter vector

given by:

(3)

where the loss function

is a smooth convex function that is continuously differentiable with a Lipschitz continuous gradient and, the penalty term is a continuous convex function that may be nonsmooth. We have implemented extensions of these two approaches for our multivariate setup, in which and .

Note that our model may be expressed in its vectorized form as follows. If vec is the vectorization operator that stacks columns of a matrix into a single column vector, we can write (1) as

(4)

This is in the form of the familiar linear regression model

. Although mathematically equivalent, the vectorized form is computationally cumbersome. For example, the R package glmnet [Friedman et al., 2010] fails, even for moderate dimensions of and

, because their Kronecker product has large memory requirements. Unlike general solvers, we utilize the fact that the design matrix is a Kronecker product and are able to obtain a computationally efficient solution. Our approach also points to how one might fit penalized multivariate regression models for multi-dimensional (tensor-valued) responses.

In the Methods section below, we outline the algorithms considered for fitting the model. In the Results section, we examine the performance of our method in simulated and real data. In the Discussion section, we summarize our conclusions and outline implications for future work.

2 Methods

Throughout, as in the univariate case, the intercept is omitted from the penalty term and is thus not regularized. Standardizing and

by subtracting the row means and dividing by the row standard deviations is also recommended.

We outline three algorithms we used for fitting the L-penalized model, beginning with the most stable algorithm: cyclic coordinate descent. Next, we describe two variants of FISTA, a considerably faster, but less-stable, approach. We conclude the section with computational and implementation considerations.

2.1 Cyclic coordinate descent

Cyclic coordinate descent searches for the minimum of a multivariable function by minimizing it along one coordinate direction at a time and cyclically iterating through each direction until convergence. We first calculate the directional derivatives along the forward and backward directions for the coordinate direction at each coefficient .

(5)

This is possible because the nondifferentiable penalty has directional derivatives along and . Furthermore, the loss function is differentiable, so its forward and backward directional derivatives are simply the positive and negative ordinary partial derivatives stored in the gradient .

(6)

Above, and denote the -th and -th columns of and respectively; is the matrix of residuals. Note that calculating the gradient involves low-dimensional matrix multiplication. Like Wu and Lange [2008], our implementation organizes cyclic updates around the residuals, which makes calculating fast. For each coefficient, we compute and then update the corresponding coefficient and residual as follows.

;
;
;
while not converged do
       for  do
             ;
             ;
             ;
             ;
            
       end for
      
end while
Algorithm 1 Cyclic coordinate descent

where is the soft-thresholding operator

(7)

Our implementation uses “warm starts” by initializing the coefficients at zero and computing solutions for a decreasing sequence of values. The coefficients for each subsequent value are then initialized to the previous converged solutions. This strategy is faster and leads to a more stable algorithm [Friedman et al., 2010]. We also took advantage of sparsity by organizing iterations over the active set of coefficients: after performing a full cycle through all of the coefficients, we cyclically update only the active (nonzero) coefficients until convergence. Another full cycle is run, and the process is repeated until the estimates stop changing. Iterating through coefficients randomly instead of cyclically can in practice result in faster convergence as well.

2.2 Fista

Cyclic coordinate descent is a very stable approach with excellent performance for univariate L-penalized regression [Wu and Lange, 2008]. However, it is too slow for multivariate linear models of moderately large dimensions, especially if cross-validation is used to tune the parameter. Consider instead an iterative shrinkage-thresholding algorithm (ISTA) that calculates the gradient at the previous coefficient estimates and updates all of the coefficients simultaneously at each iteration [Beck and Teboulle, 2009]. The updates are also multiplied by a small, fixed step size less than 1. While ISTA may take more iterations than coordinate descent to converge, each iteration is faster because the gradient can be calculated efficiently as a matrix product.

Choosing the step size requires some care, as an overly small step size can result in slow convergence and an overly large one can lead to divergence. A suggested approach for choosing the step size is to use the reciprocal of the (smallest) Lipschitz constant of , given by

. The maximum eigenvalue of

is equal to the product of the maximum eigenvalues of and .

Fast iterative shrinkage-thresholding algorithms (FISTAs) are an extension of ISTA [Beck and Teboulle, 2009] [Nesterov, 1983] that calculate the gradient based on extrapolated coefficients comprised of a linear combination of the coefficients at the previous two iterations. If is the matrix of coefficient estimates from the most recent iteration and is that from the second-to-last iteration, then calculate using . This approach takes into account the change between the coefficients in previous iterations, leading to a “damped oscillation” convergence that reduces overshooting when the local gradient is changing quickly.

;
;
set step size: ;
;
while not converged do
       ;
       ;
       ;
       ;
       ;
       ;
      
end while
Algorithm 2 FISTA with fixed step size

Even faster convergence can be achieved by backtracking: performing a line search to find the maximum step size at each iteration, instead of initializing a fixed step size. The idea is that the step size should be small enough that the decrease in the objective function corresponds to the decrease expected by the gradient. First, set the initial step size to 1 and choose a multiplying factor with which to iteratively shrink the step size. Then, at each update step, iteratively shrink the step size by multiplying it with until the it satisfies the property in Eq. 8. Then update the coefficients.

(8)
;
;
;
; ;
while not converged do
       if Eq. 8 not met then
             ;
            
       end if
      ;
       ;
       ;
       ;
       ;
       ;
      
end while
Algorithm 3 FISTA with backtracking

Like coordinate descent, we implemented FISTA using a path of “warm starts”. Coordinate descent, ISTA, and FISTA with fixed step size or backtracking each trade off between speed and stability.

2.3 Computational Considerations

2.3.1 Shrinkage Parameter Tuning

To determine the optimal shrinkage/regularization parameter , -fold cross-validation is recommended; a parallel implementation is straightforward. Various criteria can be used to identify optimal performance averaged across the folds, including mean squared error (MSE), test error, AIC, and BIC. We used MSE for the analyses presented.

2.3.2 Package Implementation

We implemented our algorithms using the high-level programming language Julia (www.julialang.org). Julia is a relatively young language with an active community that combines ease of prototyping with computational speed. It features a just-in-time compiler and strong data typing, which enable fast computation. It is an attractive candidate for numerical computing problems such as ours, since one does not need to switch between multiple programming languages for implementation, analysis, and visualization. Additionally, Julia has built-in support for parallelization.

3 Results

3.1 Environmental screen simulations

We simulated data modeled after an environmental screening study [Woodruff et al., 2011] using mass spectrometry. The study measured environmental chemical concentrations in pregnant women across various demographics in several tissues. We simulated data from 100 chemicals, each measured in 10 tissues for 108 women. The tissues, chemicals, and each unique combination of tissues and chemicals were encoded in the matrix. We simulated an

matrix with 19 continuous demographic covariates drawn from the standard normal distribution. For each tissue, 1/4 of the chemicals, 1/2 of the demographic covariates, and 1/8 of the interactions between chemicals/tissues and demographics, we simulated effects drawn from a Normal(0,2) distribution. The receiver operating characteristic (ROC) curves in Figure

1 compare the performance of L-penalized multivariate linear models (MLM) to the conventional approach of running a univariate linear model for each chemical and tissue combination.

  • The black solid line plots the results for the L-penalized MLM. We obtained true positive rates (TPR) and false positive rates (FPR) by varying and comparing the nonzero and zero interaction estimates to the true interactions.

  • The red solid line is from running the 1000 univariate linear regression models for each combination of the 100 simulated chemicals and 10 simulated tissues. We obtained the adaptive Benjamini-Hochberg adjusted p-values for each model’s coefficient estimates and varied the cutoff for determining significant interactions. These were compared to the true interaction effects to calculate the TPR and FPR.

  • The blue lines offer an alternate visualization of the univariate linear models. For each chemical, there are 10 chemical demographic interactions, one for each of the 10 tissues. We flag an interaction if least 1/5, 2/5, 3/5, or 4/5 out of the 10 different p-values is below the cutoff. A plot with curves for the 10 tissues, each of which corresponds to p-values from 100 different linear models, yields similar results.

Figure 1: ROC curves for simulations comparing L-penalized multivariate linear models to univariate linear regression for identifying chemical interactions. The AUC for each method is given in parentheses in the legend. L-penalized multivariate linear models generally outperform the univariate approach.

With the possible exception of the lower left, our method consistently outperforms variations of the conventional univariate approach. The L penalized MLM results in an area under the curve (AUC) of 0.905; AUC for the univariate linear regression interpretations is at most 0.714, when only one out of five significant univariate p-values (“hits”) is needed to detect an significant interaction.

To illustrate the speed of FISTA with backtracking, we ran the algorithm on simulated data while fixing the dimensions of the multivariate response matrix and varying the dimensions of the interaction matrix (Table 1), or vice versa (Table 2). The data was simulated with 1/2 nonzero row and column main effects and 1/8 nonzero interactions drawn from a Normal(0,2) distributions and with standard normal errors. A laptop with 8 GB memory and a 2.7 GHz dual-core processor was used to obtain the times as an average of 10 runs. Our approach remains fast even when scaling to greater dimensions.

=200 =400 =600
=200 0.85 1.35 1.58
=400 0.97 1.40 1.92
=600 0.95 1.52 2.60
Table 1: Computation time (minutes) for running FISTA with backtracking with 20 values on simulated data with and while varying and (the dimensions of the interaction matrix).
=400 =800 =1200
=400 0.43 0.70 0.95
=800 0.89 1.40 1.85
=1200 1.29 2.02 2.83
Table 2: Computation time (minutes) for running FISTA with backtracking with 20 values on simulated data with and while varying and (the dimensions of the multivariate response matrix).

3.2 Arabidopsis fitness adaptation QTL

Ågren et al. [2013] studied 398 Arabidopsis thaliana recombinant inbred lines derived by crossing populations from Italy and Sweden. The individuals were grown in six environments: in two sites (Italy and Sweden) measured over three years (2009-2011). The investigators genotyped 348 markers with the goal of mapping quantitative trait loci (QTL) to explain genetic mechanisms of fitness adaptation to local environments. We ran L

-penalized MLMs to determine significant interactions between the markers and each of the two sites and six environments. The 348 markers were encoded as dummy variables in the

matrix. The matrix was comprised of a identity matrix (for the six environments), plus a sum contrast to encode the two sites (Italy and Sweden).

We performed 10-fold cross-validation with MSE as the criterion to determine an optimal penalty size of 7.4. Figure 2 plots all interactions (either G site in solid red or G environment in dashed black) against marker position on the five chromosomes. Vertical lines separate the chromosome, and the peaks correspond to loci with significant, nonzero interactions. Several are found on chromosome 5, and G site (Italy vs. Sweden) QTL represent many, but not all, of the peaks. These results are largely aligned with the significant QTL found by Ågren et al. [2013].

Figure 2: QTL plotted against marker position. Site (Italy vs. Sweden) QTL are solid red and G environment (two environments over three years) are dashed black. Vertical lines separate the five chromosomes.

To compare the speed of the FISTA methods to coordinate descent, Table 3 shows the times for running each algorithm on this dataset for a path of 50 values, averaged over 10 runs. For this moderately-sized data, the L-penalized coefficients can be computed in minutes using FISTA, compared to nearly a day for cyclic coordinate descent.

Coord. Desc. ISTA (fixed) FISTA (fixed) FISTA (backtr.)
23.63 h 7.83 m 2.85 m 1.77 m
Table 3: Computation time to obtain interactions for the Ågren et al. [2013] data at 50 values for cyclic coordinate descent, ISTA and FISTA with fixed step size, and FISTA with backtracking.

3.3 Arabidopsis eQTL experiment

Lowry et al. [2013] examined the regulation and evolution of gene expression by considering drought stress. This expression quantitative trait locus (eQTL) mapping experiment studied 104 individuals from the Tsu-1 (Tsushima, Japan) Kas-1 (Kashmir, India) recombinant inbred line population of Arabidopsis thaliana. It was conducted across wet and dry soil treatments with two replicates. Gene expression phenotypes were collected for 25,662 genes and 450 markers were genotyped [Lowry et al., 2013]. The goal was to identify main effect (G) and interaction (G E) eQTLs for the environmental conditions. Here, the matrix encodes the 450 markers. The matrix, which encodes for main effect and drought treatment interactions for the 25,662 expression phenotypes, can be expressed as

(9)

The large number of phenotypes in this dataset makes performing L-penalized MLMs a much more computationally-intensive endeavor than it is for the Ågren et al. [2013] experiment. It takes 36 hours to run the FISTA algorithm for a path of 16 penalties. However, applying R/qtl’s stepwise function [Broman et al., 2003] one-by-one for each phenotype would take about twice as long, at 73 hours. Figure 3, which reproduces Figure 2 in Lowry et al. [2013] using our results, visually summarizes the main effect and interaction eQTLs identified by our method when . Our method was able to detect many of the same main effects and G E effects.

Figure 3: Distribution of eQTL across genome. Main effects are shown in blue open circles and G E interactions in red closed squares.

4 Discussion

We have developed a fast fitting procedure for L-penalized multivariate regression models and demonstrated their use for several high-throughput data problems. Our approach takes advantage of the structure of the multivariate regression model to speed up the computational algorithms. The choice between cyclic coordinate descent and the various flavors of (F)ISTA is largely a trade-off between speed and stability. The former is a reasonably fast approach for computing L-penalized estimates for univariate linear models, but is too slow for our multivariate scenario. Instead, we turned to FISTA and the exploitation of the matrix properties and sparsity of our model. Analysis of simulated and Arabidopsis QTL data illustrates our method’s ability to detect sparse interactions.

Our work shows the feasibility of fitting multivariate linear models with moderately large dimensions. It can be extended in two promising directions. First, we can extend fitting models with different loss () and penalty functions (). For example, we can fit the elastic net [Zou and Hastie, 2005] by changing the penalty function, adding an L penalty to the L

penalty. We can also make the solution less sensitive to outliers by using a robust loss function such as Huber’s loss function instead of the squared error loss.

A second promising direction would be to extend the models to multi-dimensional (tensor-valued) responses. In (4), the design matrix is a Kronecker product of two matrices. It can be extended to more than two matrices to handle multi-dimensional . These models might be attractive for handling, for example, time series high-throughput data or 3-D image data. Further work is needed to explore the performance, scalability and stability of the fitting algorithms.

Our algorithms have been implemented in the Julia programming language and are available at https://bitbucket.org/jwliang/mlm_packages.

References

  • Ågren et al. [2013] Ågren, J., Oakley, C. G., McKay, J. K., Lovell, J. T., and Schemske, D. W. (2013). Genetic mapping of adaptation reveals fitness tradeoffs in Arabidopsis thaliana. Proceedings of the National Academy of Sciences, 110(52):21077–21082.
  • Beck and Teboulle [2009] Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202.
  • Broman et al. [2003] Broman, K. W., Wu, H., Sen, Ś., and Churchill, G. A. (2003). R/qtl: Qtl mapping in experimental crosses. Bioinformatics, 19(7):889–890.
  • Friedman et al. [2010] Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1.
  • Lowry et al. [2013] Lowry, D. B., Logan, T. L., Santuari, L., Hardtke, C. S., Richards, J. H., DeRose-Wilson, L. J., McKay, J. K., Sen, S., and Juenger, T. E. (2013). Expression quantitative trait locus mapping across water availability environments reveals contrasting associations with genomic features in arabidopsis. The Plant Cell, 25(9):3266–3279.
  • Nesterov [1983] Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate O (1/k2). In Soviet Mathematics Doklady, volume 27, pages 372–376.
  • Woodruff et al. [2011] Woodruff, T. J., Zota, A. R., and Schwartz, J. M. (2011). Environmental chemicals in pregnant women in the United States: NHANES 2003-2004. Environmental health perspectives, 119(6):878.
  • Wu and Lange [2008] Wu, T. T. and Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics, pages 224–244.
  • Xiong et al. [2011] Xiong, H., Goulding, E. H., Carlson, E. J., Tecott, L. H., McCulloch, C. E., and Sen, Ś. (2011). A flexible estimating equations approach for mapping function-valued traits. Genetics, 189(1):305–316.
  • Zou and Hastie [2005] Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320.