High-dimensional regression over disease subgroups

11/03/2016 ∙ by Frank Dondelinger, et al. ∙ 0

We consider high-dimensional regression over subgroups of observations. Our work is motivated by biomedical problems, where disease subtypes, for example, may differ with respect to underlying regression models, but sample sizes at the subgroup-level may be limited. We focus on the case in which subgroup-specific models may be expected to be similar but not necessarily identical. Our approach is to treat subgroups as related problem instances and jointly estimate subgroup-specific regression coefficients. This is done in a penalized framework, combining an ℓ_1 term with an additional term that penalizes differences between subgroup-specific coefficients. This gives solutions that are globally sparse but that allow information-sharing between the subgroups. We present algorithms for estimation and empirical results on simulated data and using Alzheimer's disease, amyotrophic lateral sclerosis and cancer datasets. These examples demonstrate the gains our approach can offer in terms of prediction and the ability to estimate subgroup-specific sparsity patterns.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

High-dimensional regression has been well studied in the case where all samples can reasonably be expected to follow the same model. However, in several current and emerging applications, observations span multiple subgroups that may not be identical with respect to the underlying regression models. Examples abound, from disease subtypes in biomedicine to customer subsets in business applications. We are specifically motivated by biomedical problems, where sets of samples, such as disease subtypes, although related, may differ with respect to underlying biology and therefore have different relationships between covariates and a response of interest.

Thus, we focus on high-dimensional regression in group-structured settings. In particular, we consider linear regression in a commonly-encountered scenario in which the same set of

covariates or predictors is available in each of subgroups. That is, we consider subgroup-specific linear regression problems indexed by , each with subgroup-specific sample size

, a response vector

of length , a feature matrix and a -vector of regression coefficients. The problem we address is estimating the regression coefficients .

We propose an approach to jointly estimate the regression coefficients that induces global sparsity and encourages similarity between subgroup-specific coefficients. We consider the following penalized formulation and its variants

where, is a matrix that collects together all the regression coefficients, denotes the norm of its argument and are tuning parameters. The last term is a fusion-type penalty between subgroups; note that the difference is taken between entire vectors of subgroup-specific coefficients. An fusion penalty is shown above, although other penalties may be used; in this manuscript, we also consider an variant. The parameters allow for the possibility of controlling the extent to which similarity is encouraged for specific pairs of subgroups.

Our proposal shares similarities with both the group lasso (Yuan and Lin, 2006) and the fused lasso (Tibshirani et al., 2005) but differs from both in important ways. In contrast to the group lasso, we consider subgroups of samples or observations rather than groups of coefficients and in contrast to the fused lasso, we consider fusion of entire (subgroup-specific) coefficient vectors, rather than successive coefficients under a pre-defined ordering. Obozinski et al. (2010) showed how the group lasso could be used in subgroup-structured settings, essentially by considering the global problem and defining groups (in the group lasso sense) corresponding to the same covariate across all subgroups. This means that each covariate tends either to be included in all subgroup-specific models or none. In contrast, our approach allows subgroups to have different sparsity patterns, whilst pulling subgroup-specific coefficients together and inducing global sparsity. Our work is also similar in spirit to recent work concerning joint estimation of graphical models over multiple problem instances (Danaher et al., 2014; Oates et al., 2014, 2015).

We are motivated by emerging problems in biomedical research and specifically in personalized medicine. High-dimensional regression problems are now becoming common in this area, with several high-dimensional data types already in mainstream use. In the personalized medicine setting, samples usually correspond to individuals and the subgroups

to e.g. diseases or disease subtypes. It is increasingly clear that many disease subtypes differ in their biology (see e.g. Weinstein et al., 2013; Akbani et al., 2014), suggesting that relationships between covariates and responses of interest may differ between them. However, sample sizes tend to be limited, especially at the subgroup level, posing problems for the subgroup-wise strategy of solving each problem separately. On the other hand, pooling all the data together into a single regression problem may lead to severe mis-specification if the underlying subgroup-specific models do indeed differ.

These issues may lead to losses in terms of predictive ability and perhaps just as important in the ability to efficiently estimate subgroup-specific influences that may themselves be of interest. In contrast to simple pooling, our approach allows subgroups to have different sparsity patterns and regression coefficients, but in contrast to the subgroup-wise approach it takes advantage of similarities between subgroups.

We show empirical results in the context of two neurodegenerative diseases - Alzheimer’s disease and amyotrophic lateral sclerosis (ALS) - and cancer (see below for full details of the applications and data). The responses concern disease progression in Alzheimer’s and ALS and therapeutic response in cancer cell lines. In the Alzheimer’s and ALS examples, subgroups are based on clinical factors, while in the cancer data they are based on the tissue type of the cell lines.

Across the three examples, data types include genetic, clinical and transcriptomic variables. We find that our approach can improve performance relative to pooling or subgroup-wise analysis. Importantly, in cases where pooling or subgroup-wise analyses do well (perhaps reflecting a lack of subgroup structure or insufficient similarity respectively) our approach remains competitive. This gives assurance that penalization is indeed able to share information appropriately in real-world examples. We emphasize that the goal of the empirical analyses we present is not to give the best predictions possible in these applications, but rather to explore joint estimation in group-structured biomedical problems.

2 Methods

2.1 Notation

Each subgroup has the same set of covariates, but subgroup-specific sample size . Total sample size is . For subgroup , is the feature matrix and the corresponding vector of observed responses. Subgroup-specific regression coefficients are . Where convenient we collect all regression coefficients together in a matrix and accordingly we use to denote the coefficient for covariate in subgroup .

2.2 Model Formulation

We seek to jointly estimate the regression coefficients whilst ensuring global sparsity and encouraging agreement between subgroup-specific coefficients. We propose the criterion

(1)

and a variant with an norm in the last term

(2)

Here, are tuning parameters. The role of the last term is to encourage similarity between subgroup-specific regression coefficients. The special case recovers the classical lasso (applied to all data pooled together). The tuning parameters give the possibility of controlling the extent of fusion between specific subgroups. By default all ’s are set to unity (“unweighted fusion”), but they can also be set to specific values as discussed below (“weighted fusion”). In the above formulation, we assume that and have been standardized (at the subgroup level) so that no intercept terms are required. Note that the regularization parameters are the same across subgroups.

The difference between the two variants is that the first, fusion encourages similarity between subgroup-specific coefficients, while the second version allows for exact equality. The formulation has the computational advantage that the fusion part of the objective function becomes continuously differentiable, and the estimate of the objective function at each step can be obtained by soft-thresholding, analogously to coordinate descent for regular lasso problems. In the formulation on the other hand, the fusion constraint is only piece-wise continuously differentiable, leading to a more difficult optimisation problem (see below).

2.3 Comparison with group and fused lasso

Our formulation resembles the group lasso and fused lasso, but differs from both in important ways. The original group lasso (Yuan and Lin, 2006) was designed to consider groups of covariates within a single regression problem. Let be the feature matrix and the vector of responses in a standard regression problem. Letting index groups of covariates, the group lasso criterion is

(3)

where is the submatrix of corresponding to the covariates in group , the corresponding regression coefficients and ’s tuning parameters. The penalty tends to include or exclude all members of a group from the model, i.e. all coefficients in a group may be set to zero giving groupwise sparsity.

In our setting, the subgroups are subsets of samples rather than covariates. Nevertheless, as shown in Obozinski et al. (2010), one could use a group lasso-like criterion for estimation in the multiple subgroup setting by forming groups each comprising all the coefficients for a single covariate across all regression problems. This encourages covariates to either be included in all the subgroup-specific models or none.

The fused lasso (Tibshirani et al., 2005) is also aimed at a single regression problem, but assumes that the covariates can be ordered in such a way that successive coefficients may be expected to be similar. This leads to the following criterion

(4)

where are tuning parameters and we have assumed that the covariates are in a suitable order. The final term encourages similarity between successive coefficients. Efficient solutions for various classes of this problem exist (e.g. Hoefling, 2010; Liu et al., 2010; Ye and Xie, 2011).

Our approach shares the use of a fusion-type penalty, but focuses on a different problem, namely that of jointly estimating regression coefficients across multiple, potentially non-identical, problems. Accordingly, our penalty encourages agreement between entire coefficient vectors from different subgroups and does not require any ordering of covariates.

2.4 Setting the tuning parameters

For weighted fusion, the parameters could be set by cross-validation but this may be onerous in practice. As an alternative we consider setting using a distance function based on the covariates. The idea is to allow more fusion between subgroups that are similar with respect to , while allowing the to be set in advance of estimation proper. However, this assumes that similarity in the covariates reasonably reflects similarity between the underlying regression coefficients, which may or may not be the case in specific applications.

We consider two variants. The first sets where are the sample means of the covariates in the subgroups respectively (we assume the data have been standardized). The second approach additionally takes the covariance structure into account by using the symmetrised Kullback-Leibler (KL) divergence, i.e. , where are estimated marginal distributions over the covariates in the subgroups respectively and is the KL-divergence between distributions and . In practice, this requires simplifying distributional assumptions. Below we use multivariate Normal models for this purpose, with the graphical lasso (Friedman et al., 2008) used to estimate the ’s. For both approaches, we set , with the largest distance between any pair of groups , (this scales to the unit interval).

2.5 Optimisation

We describe a coordinate descent approach for optimising equation (1). While it is possible to derive a block coordinate descent approach for equation (2) (e.g. following Friedman et al., 2007), this is generally inefficient for the high-dimensional problems that we consider. Instead, we will describe an optimization procedure based on a proximal gradient approximation derived in Chen et al. (2010).

2.5.1 Coordinate Descent for Fusion

The fusion penalty is continuously differentiable and we can obtain the optimal value for in equation (1) at each step by first calculating optimal values without the lasso penalty:

(5)

Then can be obtained by soft-thresholding on . The procedure is summarized in Algorithm 1.


1:procedure BlockDescentL2(, , , , , , )
2:     
3:     
4:     while not_converged AND  do
5:         for all j in 1:P do
6:               DescentUpdateL2(, , , , , ) using eq. (5)
7:                        
8:               
Algorithm 1 Block Coordinate Descent

While Algorithm 1 is easy to understand and implement, a naive implementation in most programming languages will be still be slow due to the need for an inner for-loop over , where can be in the tens of thousands for the kinds of problems we will consider. In order to efficiently optimize , we reformulate (1) as a classical lasso problem and apply the glmnet software (Friedman et al., 2010). We transform the sum in first part of the objective into matrix form by defining as a block-diagonal matrix with along the diagonals. The vector is a flattened version of with stacked vectors, and similarly for . So we have:

Now we move the fusion penalty into the first squared term by defining the augmented matrix , and augmented vector , such that

(6)

where


with a matrix encoding the pair-wise fusion constraints, and a vector of zeros. Each block , of rows of corresponds to the fusion constraint between two coefficient vectors and , with:

(7)

We can see that (6) is a classical lasso problem, to which glmnet can be directly applied.

2.5.2 Proximal-Gradient Approach for Fused L1 Penalty

Optimising equation (2) by block gradient descent, while possible, is highly inefficient due to having to deal with the discontinuities in the objective function space. In Chen et al. (2010), the authors describe a proximal relaxation of this problem that introduces additional smoothing to turn the objective function into a continuously differentiable function . Chen et al. deal with the multi-task regression setting (with common for each task); it is straightforward to adapt their procedure for the subgroup regression setting with different per subgroup.

It is notationally convenient to first introduce a graph formulation of the fusion penalties. We will think of the fusion constraints in terms of an undirected graph with vertex set corresponding to the subgroups and edges between all vertices. Then the penalised objective function can be written as:

(8)

where the last term includes both sparsity and fusion penalties, via the matrix , with

the identity matrix of size

, a matrix ( in this case) and if if and , if and . Note that unlike in Chen et al. (2010), we require the explicit sum over in the objective to account for different sample sizes in different groups111It would be possible to reformulate the first part of the objective in matrix form by defining as a block-diagonal matrix as in Section 2.5.1, defining as a matrix with along the diagonals and similarly as an matrix with along the diagonals; however, this formulation is neither practical nor intuitive, and the gain in notational simplicity is negligible..

The graph formulation allows for zero edges by setting to zero. We have implicitly assumed in the formulation of (1) and (2) that the relationship between subgroups is represented by an undirected graph. However, (8) is completely general, and it would be straightforward to incorporate a directed graph in our model. We have not pursued this avenue here, as there is no reason to suspect directionality in the subgroup relationships for the applications we consider below, and including directionality would double the number of tuning parameters that need to be considered.

Following Chen et al. (2010), we can introduce an auxiliary matrix . Because of duality between and , we can write . A smooth approximation of is then obtained by writing:

(9)

where is a positive smoothness parameter, and , with the Frobenius norm. They show that for a desired accuracy , we need to set . Theorem 1 in Chen et al. (2010) gives the gradient of as , where is the optimal solution of (9). Replacing by in equation (8), we obtain

(10)

which is now continuously differentiable with gradient

(11)

Chen et al. further show that where function S truncates each entry of to the range [-1,1] to ensure that . An upper bound of the Lipschitz constant L can be derived as:

(12)

where

is the largest eigenvalue of

and .

With the derivation of the gradient in (11) and the Lipschitz bound in (12), we can now apply Nesterov’s method (Nesterov, 2005) for optimizing (10). The procedure is summarized in Algorithm 2. For more details on the proximal approach see Chen et al. (2010).

1:procedure Proximal(, , , , , , , , )
2:     
3:     
4:     while not_converged AND  do
5:         Compute according to (11).
6:         
7:         
8:         
9:               
Algorithm 2 Proximal Gradient Optimization

3 Simulation Study

To test the performance of the proposed approach, we simulated data from a model based on characteristics of a recent cancer dataset, the Cancer Cell Line Encyclopedia (CCLE; Barretina et al., 2012). We treat cancer types as subgroups. To simulate data, we first estimated means and covariance matrices for each of subgroups (the eight cancer types with the latest sample sizes in CCLE plus a ninth for all other cancer types; covariances were estimated using the graphical lasso). For each group , we then sampled covariates from the multivariate normal . For a given total sample size , subgroup sizes were consistent with those in the original data. We used a random subset of 200 gene expression levels (i.e. the dimensionality was fixed at ). This parametric approach allowed us to vary sample sizes freely, including the case of total larger than in the original dataset. The set-up is intended to roughly reflect the correlation structure of the covariates, but we do not expect it to capture all aspects of the real data.

We are interested in the situation in which it may be useful to share information between subgroups. But we are also interested in investigating performance in settings that do not agree with our model formulation (the extreme cases being where subgroups are either entirely dissimilar or identical). Let be the set of subgroup indices (here, ). We set regression coefficients to be identical in a subset of the subgroups, such that the size of the subset governs the extent to which fusion could be useful. Specifically, if , all subgroups have the same regression coefficients (i.e. favoring a pooled analysis using a single regression model) and at the other extreme if all groups have differently drawn coefficients. Intermediate values of give differing levels of similarity.

For a given value , we defined membership of by considering the differences between the subgroup-specific models for the covariates. Specifically, we choose the groups that minimized the sum of symmetrised KL divergences between subgroup-specific models. A coefficient vector was then drawn separately for each subgroup and one, shared coefficient vector drawn for all . Each draw was done as follows. We first sampled a binary vector of length from a Bernoulli, i.e. . Then we drew if and set otherwise, where denotes a standard Normal with the interval excluded (this is to ensure non-zero coefficients are not very small in magnitude). Note that in the case of , all groups have separately drawn coefficients and the between-subgroup KL divergence plays no role.

We compare our approaches with pooled and subgroup-wise analyses. These are performed using classical lasso (we use the glmnet implementation) on respectively the whole dataset or each subgroup separately.

Figure 1: Simulated data, performance for varying . Here, the number of subgroups is fixed at of which have shared coefficients in the underlying data-generating model (see text for details of simulation set-up). A smaller corresponds to less similarity between underlying subgroup-specific models, with representing the case where all subgroups have separately drawn coefficients while represents an entirely homogenous model in which each subgroup has exactly the same regression coefficients. The total sample size is fixed at . Upper panel: Weighted root mean squared error (RMSE). RMSE is weighted by subgroup sizes. Lower panel: Area under the ROC curve (AUROC; with respect to the true sets of active variables with non-zero coefficients).
Figure 2: Simulated data, performance at varying sample sizes. Here, the number of subgroups is fixed at of which have shared coefficients in the underlying data-generating model (see text for details of simulation set-up). Top left: Root mean squared error (RMSE; weighted by subgroup sizes). Top right: Area under the ROC curve (AUROC; with respect to the true sets of active variables with non-zero coefficients). Bottom: Computational time taken in log seconds.

Figure 1 shows performance when varying the number of subgroups with shared coefficients, with the total number of samples fixed at . Here, a smaller value of corresponds to less similarity between subgroup-specific coefficients in the underlying models. At intermediate values of the fusion approaches offer gains over pooled and subgroup-wise analyses. This is because the pooled analyses are mis-specified due to the inhomogeneity of the data, while the subgroup-wise analyses, although correctly specified, must confront limited sample sizes since they analyze each subgroup entirely separately. In contrast, the fusion approaches are able to pool information across subgroups, but also allow for subgroup-specific coefficients. Importantly, even at the extremes of (separately drawn coefficients for each subgroup) and (all subgroups have exactly the same coefficients), the fusion approaches perform well. This demonstrates their flexibility in adapting the degree of fusion.

Figure 2 shows performance as a function of total sample size . Here, the number of subgroups with identical coefficients is fixed at . This gives a relatively weak opportunity for information sharing, since 5/9 groups have separately drawn coefficients. Since the true ’s are not identical, the pooled analysis is mis-specified and accordingly even at large sample sizes, it does not catch up with the other approaches. As expected, subgroup-wise analyses perform increasingly well at larger sample sizes. However, at smaller sample sizes the fusion approaches show some gains.

The and fusion approaches seem similar in performance. Our implementation leverages the glmnet package and is more computationally efficient than the approach. For computational convenience, in examples below we show results from the approach only.

4 Alzheimers disease: prediction of cognitive scores

Here, we use data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (Mueller et al., 2005) to explore the ability of fusion approaches to estimate regression models linking clinical and genetic covariates to disease progression, as captured by cognitive test scores.

In 2014, ADNI made a subset of its data available for a DREAM challenge (Allen et al., 2016) and we use these data here. The dataset consists of a total of

individuals who were followed up over at least 24 months. Cognitive function was evaluated using the mini-mental state examination (MMSE). At baseline, individuals were classified as either cognitively normal (CN), early mild cognitive impairment (EMCI), late mild cognitive impairment (LMCI) or diagnosed with Alzheimer’s disease (AD). These form clinically-defined subgroups for our analysis. For the present analysis, we use only genetic data (single nucleotide polymorphisms or SNPs) and clinical profile as covariates and disregard the neuroimaging data.

The task is to predict the slope of MMSE scores over a 24-month period. The total number of SNPs available is . Filtering by linkage disequilibrium reduces this to . For computational ease, we pre-selected 20,000 of this latter group that gave the smallest residuals when regressed with the clinical variables against responses in the training set. We note that this biases our analyses, but we emphasize that our goal in this section is not biomarker discovery but comparison between approaches all using the same (pre-selected) covariates.

Figure 3 shows root mean squared error (RMSE) separately for each of the four subgroups. The fusion approaches offer substantial gains compared with pooled and subgroup-wise analyses (the latter performed very badly and are not shown in the figure). The biggest gain is for the AD subgroup. For the weighted fusion analysis the tuning parameters were set using the distance between the means of each subgroup (in the space of genetic and clinical variables). Weighting did not appear to improve performance.

Figure 4 shows scatter plots of predicted MMSE slopes versus the true slopes. The predictions shown were obtained in a held-out fashion via 10-fold cross-validation (CV), as were the RMSE and Pearson correlations shown.

Figure 5 shows a comparison of the estimated regression coefficients themselves. The subgroup-wise approach is much sparser than the other methods, likely due to the fact that it must operate entirely separately on each (relatively small-sample) subgroup. The pooled approach finds more influential variables but obviously there is no subgroup-specificity. The fused approach selects more variables than the subgroup-wise analysis, but there are many instances of subgroup-specificity in the estimates.


Figure 3: Alzheimers disease prediction results, ADNI data. Box plots showing difference in RMSE of fused methods compared to the pooled linear regression model (higher values indicate better performance by the fused methods). [Subgroup-wise analysis performed less well than pooled and is not shown; boxplots are over 10-fold cross-validation.]
Figure 4: Alzheimers disease data, predicted vs. observed responses. Scatter plots show predicted and observed 24-month slopes for each of the standard and fused linear regression models. All predictions were obtained via 10-fold cross-validation.
Figure 5: Alzheimers disease data, estimated regression coefficients. Heatmap showing estimated regression coefficients for the clinical variables and a representative subsample of the SNPs. Absolute coefficients are thresholded at to improve readability.

5 Prediction of therapeutic response in cancer cell lines

The Cancer Cell Line Encyclopedia (CCLE, Barretina et al., 2012) is a panel of 947 cancer cell lines with associated molecular measurements and responses to 24 anti-cancer agents. Here, we use these data to explore group-structured regression. We treat the area above the dose-response curve as the response and use expression levels of 20,000 human genes as covariates. We treat the cancer types as subgroups . After discarding cell lines with missing values, we arrive at samples.


Figure 6: Cancer cell line therapeutic response prediction. Difference in weighted RMSE between L2 fusion approach and a pooled analysis. Results shown over 24 responses (anti-cancer agents) using data from the Cancer Cell Line Encyclopedia (CCLE); the dashed vertical line at zero indicates no difference, boxplots to the right indicate improvement (lower RMSE) over pooled.

Figure 6 shows results over all 24 responses (anti-cancer agents). We observe that for most responses the L2 fusion approach shows either improved or similar prediction performance to pooled in terms of RMSE (weighted by subgroup size). Weighted fusion shows a similar performance to unweighted fusion.

Figure 7: Cancer cell line therapeutic response prediction, broken down by subgroup (cancer type) for agents PD-0332991 and PLX4720.

Figure 7 shows results broken down by subgroup for two examples (responses PD-0332991 and PLX4720). In the former case, the fusion approaches largely outperform pooled and subgroup-wise analyses. In the second, pooled is the best performer, although the fusion approaches are similar in most subgroups.

6 ALS: prediction of disease progression

Amyotrophic lateral sclerosis (ALS) is an incurable neurodegenerative disease that can lead to death within three to four years of onset. However, about ten percent of patients survive more than 10 years. Prediction of disease progression remains an open question. We use data from the PROACT database, specifically data that were used in the 2015 DREAM ALS Stratification Prize4Life Challenge (data were retrieved from the PROACT database on 22/06/2015). Our aim is not to optimize predictive performance per se but rather to provide a case study exploring the use of fusion approaches in a moderate-dimensional, clinical data setting. In contrast to the Alzheimers example above, here the data are less high-dimensional and the subgrouping less clear cut (see below).

The data consist of observations from

patients. Each patient was enrolled in a clinical trial and followed up for a minimum of 12 months after the start of the trial. Disease progression is captured by a clinical scale called the ALS Functional Rating Scale (ALSFRS). The task is to predict the slope of the ALSFRS score from 3 to 12 months (after the start of the trial). For each patient, available covariates include ALSFRS scores for the 0-3 month period, demographic information and longitudinal measurements of clinical variables. We follow the featurization and imputation procedures devised by Mackay

(see Küffner et al., 2015) and obtain a total of covariates.

Subgroups were defined as follows. The first subgroup consists of patients with disease onset before the start of the trial. The second subgroup consists of patients for whom onset was after the start of the trial and who have negative ALSFRS slope. The third subgroup of patients also had onset after the start of the trial but positive ALSFRS slope. Thus, the subgroups reflect severity of onset. As we believe that the pre-trial onset group are likely to differ most from the others, we manually set the distance between groups 1 and the other two groups to 1 (the maximum), and set the distance between groups 2 and 3 to 0.1.

Figure 8 shows (held-out) RMSEs by subgroup; we see that the largest improvement in prediction performance is in subgroup 1. The fusion approach leads to a modest improvement. The difference between weighted and unweighted fusion is negligible222This dataset is different from the one reported in (Küffner et al., 2015)

, with larger variance in the slopes, and so RMSE values are not directly comparable; however, we note that performance for our methods compares favorably with that reported in the reference.

.


Figure 8: ALS prediction results. Box plots showing difference in RMSE of fused methods compared with the pooled linear regression model (higher values indicate better performance by the fused methods). [Subgroup-wise analysis performed less well than pooled and is not shown; boxplots are over 10-fold cross-validation.]

7 Conclusions

Many biomedical datasets are heterogenous, spanning multiple disease types (or other biological contexts) that are related but also expected to have specific underlying biology. This means that large datasets are often usefully thought of as comprising several smaller datasets, that have similarities but that cannot be assumed to be identically distributed. Statistically efficient regression in these group-structured settings requires ways to pool information where useful to do so, whilst retaining the possibility of subgroup-specific parameters and sparsity patterns. We proposed a penalized likelihood approach for high-dimensional regression in the group-structured setting that provides group-specific estimates with global sparsity and that allows for information sharing between groups.

In any given application, even when there are good scientific reasons to suspect differences in regression models between subgroups, it is hard to know in advance whether the nature of any differences is such that a specific kind of joint estimation would be beneficial. For example, if sample sizes are small and groups only slightly different, pooling may be more effective, or if the groups are entirely different, fusion of the kind we consider may not be useful. This means that in practice, either simple pooling or subgroup-wise analysis may be more effective than fusion. In our approach, the tuning parameter (set by cross-validation) determines the extent of fusion in a data-adaptive manner, and we saw in several examples that this appears successful in giving results that are at worst close to the best of pooling and subgroup-wise analyses. For settings with widely divergent ’s, it may be important to allow tuning parameters to depend on (we did not do so) and to consider alternative formulations that allow for asymmetric fusion.

An appealing feature of our approach is that it allows for subgroup-specific sparsity patterns and parameter estimates that may themselves be of scientific interest. We discussed point estimation, but did not discuss uncertainty quantification for these subgroup-specific estimates. A number of recent papers have discussed significance testing for lasso-type models (see e.g. Wasserman and Roeder, 2009; Lockhart et al., 2014; Städler and Mukherjee, 2016) and we think some of these ideas could be used with the models proposed here.

8 Software Availability

The R code used for the experiments in this paper has been made available as R package fuser on GitHub: https://github.com/FrankD/fuser. Scripts for reproducing the results in this paper can be obtained at: http://fhm-chicas-code.lancs.ac.uk/dondelin/SubgroupFusionPrediction.

9 Acknowledgements

Data collection and sharing for the Alzheimer’s data application was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company;CereSpir, Inc.;Cogstate;Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies;Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging;Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

References

  • Akbani et al. (2014) Rehan Akbani, Patrick Kwok Shing Ng, Henrica MJ Werner, Maria Shahmoradgoli, Fan Zhang, Zhenlin Ju, Wenbin Liu, Ji-Yeon Yang, Kosuke Yoshihara, Jun Li, et al. A pan-cancer proteomic perspective on the cancer genome atlas. Nature Communications, 5, 2014.
  • Allen et al. (2016) Genevera I Allen, Nicola Amoroso, Catalina Anghel, Venkat Balagurusamy, Christopher J Bare, Derek Beaton, Roberto Bellotti, David A Bennett, Kevin L Boehme, Paul C Boutros, et al. Crowdsourced estimation of cognitive decline and resilience in alzheimer’s disease. Alzheimer’s & Dementia, 12(6):645–653, 2016.
  • Barretina et al. (2012) Jordi Barretina, Giordano Caponigro, Nicolas Stransky, Kavitha Venkatesan, Adam A Margolin, Sungjoon Kim, Christopher J Wilson, Joseph Lehár, Gregory V Kryukov, Dmitriy Sonkin, et al. The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature, 483(7391):603–607, 2012.
  • Chen et al. (2010) Xi Chen, Seyoung Kim, Qihang Lin, Jaime G. Carbonell, and Eric P. Xing. Graph-structured multi-task regression and an efficient optimization method for general fused lasso. arXiv:1005.3579 [cs, math, stat], May 2010.
  • Danaher et al. (2014) Patrick Danaher, Pei Wang, and Daniela M Witten. The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(2):373–397, 2014.
  • Friedman et al. (2007) Jerome Friedman, Trevor Hastie, Holger Hoefling, and Robert Tibshirani. Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2):302–332, December 2007. ISSN 1932-6157, 1941-7330. doi: 10.1214/07-AOAS131. URL http://projecteuclid.org/euclid.aoas/1196438020.
  • Friedman et al. (2008) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. 9(3):432–441, 2008. ISSN 1465-4644, 1468-4357. doi: 10.1093/biostatistics/kxm045. URL http://biostatistics.oxfordjournals.org/content/9/3/432.
  • Friedman et al. (2010) Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1, 2010.
  • Hoefling (2010) Holger Hoefling. A path algorithm for the fused lasso signal approximator. Journal of Computational and Graphical Statistics, 19(4):984–1006, 2010.
  • Küffner et al. (2015) Robert Küffner, Neta Zach, Raquel Norel, Johann Hawe, David Schoenfeld, Liuxia Wang, Guang Li, Lilly Fang, Lester Mackey, Orla Hardiman, et al. Crowdsourced analysis of clinical trial data to predict amyotrophic lateral sclerosis progression. Nature Biotechnology, 33(1):51–57, 2015.
  • Liu et al. (2010) Jun Liu, Lei Yuan, and Jieping Ye. An efficient algorithm for a class of fused lasso problems. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 323–332. ACM, 2010. URL http://dl.acm.org/citation.cfm?id=1835847.
  • Lockhart et al. (2014) Richard Lockhart, Jonathan Taylor, Ryan J Tibshirani, and Robert Tibshirani. A significance test for the lasso. Annals of Statistics, 42(2):413, 2014.
  • Mueller et al. (2005) Susanne G Mueller, Michael W Weiner, Leon J Thal, Ronald C Petersen, Clifford R Jack, William Jagust, John Q Trojanowski, Arthur W Toga, and Laurel Beckett. Ways toward an early diagnosis in alzheimer’s disease: the alzheimer’s disease neuroimaging initiative (adni). Alzheimer’s & Dementia, 1(1):55–66, 2005.
  • Nesterov (2005) Yu Nesterov. Smooth minimization of non-smooth functions. Mathematical programming, 103(1):127–152, 2005.
  • Oates et al. (2014) Chris J Oates, Jim Korkola, Joe W Gray, and Sach Mukherjee. Joint estimation of multiple related biological networks. The Annals of Applied Statistics, 8(3):1892–1919, 2014.
  • Oates et al. (2015) Chris J Oates, Jim Q Smith, Sach Mukherjee, and James Cussens. Exact estimation of multiple directed acyclic graphs. Statistics and Computing, pages 1–15, 2015.
  • Obozinski et al. (2010) Guillaume Obozinski, Ben Taskar, and Michael I Jordan. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2):231–252, 2010.
  • Städler and Mukherjee (2016) Nicolas Städler and Sach Mukherjee. Two-sample testing in high-dimensional models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) (To appear), 2016.
  • Tibshirani et al. (2005) Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1):91–108, 2005.
  • Wasserman and Roeder (2009) Larry Wasserman and Kathryn Roeder. High dimensional variable selection. Annals of Statistics, 37(5A):2178, 2009.
  • Weinstein et al. (2013) John N Weinstein, Eric A Collisson, Gordon B Mills, Kenna R Mills Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya Shmulevich, Chris Sander, Joshua M Stuart, Cancer Genome Atlas Research Network, et al. The cancer genome atlas pan-cancer analysis project. Nature Genetics, 45(10):1113–1120, 2013.
  • Ye and Xie (2011) Gui-Bo Ye and Xiaohui Xie. Split bregman method for large scale fused lasso. Computational Statistics & Data Analysis, 55(4):1552–1569, 2011.
  • Yuan and Lin (2006) Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.