Merging versus Ensembling in Multi-Study Machine Learning: Theoretical Insight from Random Effects

05/17/2019 ∙ by Zoe Guan, et al. ∙ 3

A critical decision point when training predictors using multiple studies is whether these studies should be combined or treated separately. We compare two multi-study learning approaches in the presence of potential heterogeneity in predictor-outcome relationships across datasets. We consider 1) merging all of the datasets and training a single learner, and 2) cross-study learning, which involves training a separate learner on each dataset and combining the resulting predictions. In a linear regression setting, we show analytically and confirm via simulation that merging yields lower prediction error than cross-study learning when the predictor-outcome relationships are relatively homogeneous across studies. However, as heterogeneity increases, there exists a transition point beyond which cross-study learning outperforms merging. We provide analytic expressions for the transition point in various scenarios and study asymptotic properties.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

page 12

page 25

page 26

page 27

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Prediction and classification models trained on a single study often perform considerably worse in external validation than in cross-validation [1, 2]. Their generalizability is compromised by overfitting, but also by various sources of study heterogeneity, including differences in study design, data collection and measurement methods, unmeasured confounders, and study-specific sample characteristics [3]. Using multiple training studies can potentially address these challenges and lead to more replicable prediction models. In many settings, such as precision medicine, multi-study learning is motivated by systematic data sharing and data curation initiatives. For example, the establishment of gene expression databases such as Gene Expression Omnibus [4] and ArrayExpress [5] and neuro-imaging databases such as OpenNeuro [6] has facilitated access to sets of studies that provide comparable measurements of the same outcome and predictors (even if the original measurements are not be comparable, they can often be made comparable through preprocessing and normalization procedures [7, 8]). For problems where such a set of studies is available, it is important to systematically integrate information across datasets when developing prediction and classification models.

One approach is to merge all of the datasets and treat the observations as if they are all from the same study (for example, see [9, 10]). The resulting increase in sample size can lead to improved training and better performance when the datasets are relatively homogeneous. Also, the merged dataset is often representative of a broader reference population than any of the individual datasets. Xu et al. [9] showed that a prognostic test for breast cancer metastases developed from merged data performed better than the prognostic tests developed using individual studies. Zhou et al. [11] proposed hypothesis tests for determining when it is beneficial to pool data across multiple sites for linear regression, compared to using data from a single site.

Another approach is to combine results from separately trained models. Meta-analysis and ensembling both fall under this approach. Meta-analysis combines summary measures from multiple studies to increase statistical power (for example, see [12, 13]

). A common combination strategy is to take a weighted average of the study-specific summary measures. In fixed effects meta-analysis, the weights are based on the assumption that there is a single true parameter underlying all of the studies, while in random effects meta-analysis, the weights are derived from a model where the true parameter varies across studies according to a probability distribution. When learners are indexed by a finite number of common parameters, meta-analysis applied to these parameters can be used for multi-study learning, with useful results

[12]

. Various studies have compared meta-analysis to merging. For effect size estimation, Bravata and Olkin

[14] showed that merging heterogeneous datasets can lead to spurious results while meta-analysis protects against such problematic effects. Taminau et al. [15] and Kosch and Jung [16] found that merging had higher sensitivity than meta-analysis in gene expression analysis, while Lagani et al. [17] found that the two approaches performed comparably in reconstruction of gene interaction networks. Ensemble learning methods [18]

, which combine predictions from multiple models, can also be used to leverage information in a multi-study setting. By combining predictions, ensembling leads to lower variance and higher accuracy, and is applicable to more general classes of learners than meta-analysis. Patil and Parmigiani

[19] proposed cross-study learning, defined as weighted average ensembles of prediction models trained on different studies, as an alternative to merging. They showed empirically that when the datasets are heterogeneous, cross-study learning can lead to improved generalizability and replicability compared to merging and meta-analysis.

In this paper, we provide theoretical guidelines for determining whether it is more beneficial to merge or to ensemble. We consider both low-dimensional and high-dimensional linear regression settings by studying merged and cross-study learners (CSLs) based on ordinary least squares (LS) and ridge regression. We hypothesize a mixed effects model for heterogeneity and show that merging has lower prediction error than cross-study learning when heterogeneity is low, but as heterogeneity increases, there exists a transition point beyond which cross-study learning outperforms merging. We characterize this transition point analytically and study it via simulations. We also compare merging and cross-study learning in practice, using microbiome data.

Ii Problem Definition

We will use the following matrix notation: is the identity matrix, is an matrix of 0’s,

is a vector of 0’s of length

, is the trace of matrix , is a diagonal matrix with along its diagonal, and is the entry in row and column of matrix . Other notation introduced throughout the paper is summarized in Table 1 of Appendix C.

Suppose we have comparable studies that measure the same outcome and the same predictors, and the datasets have been harmonized so that measurements across studies are on the same scale. For study , let denote the number of observations, the outcome vector, and the design matrix, where the first column of is a vector of 1’s if there is an intercept. Assume the data are generated from the linear mixed effects model

(1)

where is the vector of fixed effects, is the design matrix for the random effects obtained by subsetting , is the vector of random effects with and where , is the vector of residual errors with and , and . For , if , then the effect of the corresponding predictor differs across studies, and if , then the predictor has the same effect in each study.

The relationship between the predictors and the outcome in a given study can be seen as a perturbation of the population-level effect vector . The degree of heterogeneity in predictor-outcome relationships across studies can be summarized by the sum of the variances of the random effects divided by the number of fixed effects: . We are interested in comparing the performance of two multi-study learning approaches as varies: 1) merging all of the studies and fitting a single linear regression model, and 2) fitting a linear regression model on each study and forming a CSL by taking a weighted average of the predictions.

Learners

For low-dimensional settings where for all , we consider merged and cross-study learners based on LS. The LS estimator of based on the merged data is

(2)

where and . The LS estimator based on study is

(3)

and the LS cross-study estimator is

(4)

where .

For high-dimensional settings where for some , we consider merged and cross-study learners based on ridge regression. When , is not invertible, so is not estimable using LS. Ridge regression overcomes the non-invertibility problem [20], which can also arise in low-dimensional settings with highly correlated predictors, by penalizing the -norm of the coefficient vector. The predictors are typically standardized prior to fitting the model since ridge regression shrinks all predictors proportionally (for example, Horel and Kennard’s seminal paper [20] assumes the predictors have mean 0 and variance 1). The coefficient estimates based on the standardized data are then transformed back to the original scale. Note that ridge regression is location-invariant [21], so without loss of generality, we assume that the predictors are scaled but not centered prior to applying ridge regression. We first provide the form of the ridge regression estimators in the case where an intercept is included. Let the scaled versions of and be denoted by and , where are positive definite scaling matrices. If scaling is not necessary or desirable (for example, if the predictors are measured in the same units), then set . Otherwise, let be diagonal with and

equal to the inverse standard deviation of column

of for and let be diagonal with and equal to the inverse standard deviation of column of for . The merged ridge regression estimator of can be written as

(5)
(6)

where is the regularization parameter and is obtained from by setting , so that the intercept is not regularized [21]. The estimator of from study is

(7)
(8)

and the CSL estimator is

(9)

If there is no intercept, then we set and to be the inverse standard deviations of the first columns of and respectively and replace with in the expressions above.

For simplicity, we assume the weights and the regularization parameters and are predetermined. Note that for linear regression, averaging predictions across study-specific learners is equivalent to averaging the estimated coefficient vectors across study-specific learners and then computing predictions. Thus, cross-study learning based on linear regression is similar to meta-analysis of effect sizes. In particular, when , calculating is equivalent to performing a standard univariate meta-analysis. When , weights each predictor equally in a given study while meta-analytic approaches, which involve either performing separate univariate meta-analyses for each predictor or performing a multivariate meta-analysis (for example, see [22, 23]), do not impose this constraint.

Performance Comparison

Given a test set with design matrix and outcome vector , the goal is to identify conditions under which cross-study learning has lower mean squared prediction error (MSPE) than merging, i.e.

where the expectations are taken with respect to and is the norm.

Iii Results

We consider two cases for the structure of : equal variances and unequal variances. Let be the distinct values on the diagonal of and let be the number of random effects with variance . In the equal variances case where and for , we provide a necessary and sufficient condition for the CSL to outperform the merged learner. In the unequal variances case, we provide sufficient conditions under which the CSL outperforms the merged learner and vice versa. These conditions allow us to characterize a transition point in terms of between a regime that favors merging and a regime that favors cross-study learning.

In order to present the results more concisely, let , , , and . Also, let be a matrix where if random effect is the th random effect with variance and otherwise, so that subsets to the columns corresponding to .

Proofs of the results are provided in Appendix A.

i Least Squares

For the LS results, assume we are in a low-dimensional setting where for all .

i.1 Equal Variances

Definition 1.

Define

(10)
Theorem 1.

Suppose the random effects have equal variances ( for ) and

(11)

Then if and only if .

By Theorem 1, for any fixed weighting scheme that does not depend on , satisfies Equation 11, and leads to in , represents a transition point from a regime where merging outperforms cross-study learning to a regime where cross-study learning outperforms merging. When equal weights are used and is not identical for all , it follows from Jensen’s operator inequality [24] that Equation 11 holds and the numerator of is positive, so the transition point always exists.

Corollary 1.1.

Suppose the random effects have equal variances and there exist positive definite matrices such that as ,

where denotes almost sure convergence. If we set , then

(12)

For example, suppose all study sizes are equal to , the predictors are independent and identically distributed within and across studies, and , , are positive definite. Then Corollary 1.1 applies with , , and . In the special case where and the predictor follows , the limit becomes

(13)

and the asymptotic transition point is controlled simply by the variance of the residuals, the variance of the predictor, and the study sample size.

i.2 Unequal Variances

Definition 2.

Define

(14)
(15)
Theorem 2.

Suppose

(16)

Then when .

Suppose

(17)

Then when .

Corollary 2.1.

Suppose there exist positive definite matrices such that as ,

  1. for

and we set .

If

(18)

then

(19)

If

(20)

then

(21)

In the unequal variances scenario, we can establish a transition interval such that the merged learner outperforms the CSL when is smaller than the lower bound of the interval and the CSL outperforms the merged learner when is greater than the upper bound of the interval. Note that the conditions and results for the equal variances scenario are special cases of the conditions and results for the unequal variances scenario. When , we have , , and .

i.3 Optimal Weights

It can be shown (see Appendix A) that the optimal weights for the CSL are given by

(22)

where the weight for study is proportional to the inverse MSPE of the LS learner trained on that study.

In the equal variances setting, . We saw previously that when the weights satisfy Equation 11 and do not depend on , characterizes the value of beyond which cross-study learning outperforms merging. The optimal weights depend on , so depends on under the optimal weighting scheme. Thus, it is difficult to obtain a closed-form expression for the transition point, though numerical methods can be used to solve for the value of such that . In Appendix A, we provide a closed-form approximation of the transition point. Note that the transition point under any fixed weighting scheme provides an upper bound for the transition point under the optimal weighting scheme. We also remark that in the special case where , the optimally weighted CSL has the same variance as the estimator from the true mixed effects model. If and are known, then the CSL will always be at least as efficient as the merged learner when , but in practice, and need to be estimated.

ii Ridge Regression

Below, we present results for ridge regression that are applicable to both low- and high-dimensional settings.

ii.1 Equal Variances

Definition 3.

Define

(23)
(24)
(25)
Theorem 3.

Suppose the random effects have equal variances and

(26)

Then if and only if .

ii.2 Unequal Variances

Definition 4.

Define

(27)
(28)
Theorem 4.

Suppose

(29)

Then when .

Suppose

(30)

Then when .

Again, the conditions and results for the equal variances scenario are special cases of the conditions and results for the unequal variances scenario.

iii Interpretation

The covariance matrices of linear regression coefficient estimators can be written as a sum of two components, one driven by between-study variability and one driven by within-study variability. For example, for LS we have

where is the matrix such that , and

Since the merged learner ignores between-study heterogeneity, the trace of its first component is generally larger than that of the CSL. However, since the merged learner is trained on a larger sample, the trace of its second component is generally smaller than that of the CSL. The merged and cross-study learners based on LS are unbiased, so the transition point depends on the trade-off between these two components. When , Expression 13 shows that having a higher-variance predictor favors cross-study learning over merging, since increasing the variance of the predictor amplifies the impact of the random effect.

Unlike LS estimators, ridge regression estimators are biased as a result of regularization. The transition point for ridge regression depends on the regularization parameters used on the merged and individual datasets. It also depends on the true coefficient vector through the squared bias terms in the MSPEs of the merged and cross-study learners, so an estimate of is needed to compute the expressions in Theorems 3 and 4. These expressions can vary considerably for different choices of regularization parameters and different values of . We did not provide the asymptotic results for ridge regression as (with the study sizes held constant) because this scenario is not entirely fair to the CSL. For and sufficiently large , the merged learner will be in the low-dimensional setting while the CSL will remain in the high-dimensional setting. As , the bias term approaches 0 for the merged learner (assuming ) but not for the CSL, which suggests that when is sufficiently large, merging will always yield lower MSPE than cross-study learning. Also, due to the squared bias term in the MSPE, it is not straightforward to derive optimal CSL weights for ridge regression.

In general, the transition points for LS and ridge regression depend on the design matrix of the test set. However, the test design matrix drops out when it is a scalar multiple of an orthogonal matrix. For example, this occurs when

.

Iv Simulations

We conducted simulations to verify the theoretical results for LS and ridge regression and to compare them to the empirical transition points for three methods for which we could not find a closed-form solution: LASSO, single hidden layer neural network, and random forest. We also made performance comparisons with a linear mixed effects model, univariate random effects meta-analyses, and multivariate random effects meta-analysis. We used the R packages

glmnet, nnet, randomForest, nlme, metafor, and mvmeta for ridge regression/LASSO, neural networks, random forests, linear mixed effects models, univariate meta-analyses, and multivariate meta-analyses respectively.

We considered four simulation scenarios corresponding to the settings in Theorems 1, 2, 3, and 4. We used 4 training studies and 4 test studies of size 40 for all scenarios. We set . For the low-dimensional settings, we set , , and generated 5 ’s from and 5 from , with 5 of the ’s having random slopes. For the high-dimensional settings, we set , , and generated 30 ’s from and 70 from , with 10 of the

’s having random slopes. For each simulation scenario, we fixed the predictor values in the training and test sets and the model hyperparameters. Predictor values were sampled from datasets in the

curatedOvarianData R package [25]. Model hyperparameters were tuned once using 5-fold cross-validation with outcomes generated under . For various values of , including 0 and the theoretical transition point, we generated random slopes, residual errors, and outcomes for each training and test study according to Model (1), then trained and tested the following approaches: linear mixed effects model, random effects meta-analysis of univariate LS estimates, random effects multivariate meta-analysis, and merged learners and CSLs based on LS, ridge regression, LASSO, neural networks, and random forests. For ridge regression and LASSO, the predictors were standardized prior to model fitting. For linear mixed effects, we fit the true model, using restricted maximum likelihood to estimate . For meta-analysis of univariate LS estimates, we used the DerSimonian and Laird method. For multivariate meta-analysis, we used restricted maximum likelihood to estimate , constraining the covariance matrix to be diagonal. LS, linear mixed effects, and meta-analysis were only applied in the low-dimensional setting. We performed 1000 replicates for each value of and estimated the MSPE of each estimator by averaging the squared error across replicates.

As seen in Figures 1 and 2, the empirical transition points for LS and ridge regression agree with the theoretical results from Theorems 1 and 3 (similar figures for Theorems 2 and 4 are provided in Appendix B). The methods all have similar empirical transition points except for random forest, which performed considerably worse than all of the other approaches (see Figure 8 in Appendix). The poor performance of random forest could be because the data were generated from a linear model. The univariate meta-analysis approach also performed poorly 8, which is unsurprising because the generating model is a multivariate model. The performance of the other models relative to the data generating model is summarized in Figure 3 for three values of . When , the merged regression learners and multivariate meta-analysis perform as well as or slightly better than the mixed effects model and outperform the CSLs. The merged neural network learner does slightly worse than the regression learners. At the LS transition point, all models perform similarly. Beyond the transition point, the models continue to perform similarly (when heterogeneity is high, all models perform poorly), with the CSLs slightly outperforming the merged learners and multivariate meta-analysis performing as well the mixed effects model. For each of the three values of , LASSO performed best, even slightly outperforming the mixed effects model and multivariate meta-analysis. This is likely because several of the true ’s were close to 0.

Figure 1: 4 training studies of size 40 with 10 predictors. The red lines correspond to the theoretical transition points from Theorems 1 and 3.
Figure 2: 4 training studies of size 40 with 100 predictors. The red line corresponds to the theoretical transition point from Theorem 3.
Figure 3: Performance comparisons for three values of (the random forest and univariate meta-analysis results are omitted to avoid stretching the y-axis but are provided in Appendix B). LS,M: merged LS learner; LS,C: CSL based on LS; R,M: merged ridge regression learner; R,C: CSL based on ridge regression; L,M: merged LASSO learner; L,C: CSL based on LASSO; NN,M: merged neural network; NN,C: CSL based on neural networks; MA: multivariate meta-analysis.

V Metagenomics Application

To illustrate in a practical example, we compared the performance of merging and cross-study learning used datasets from the curatedMetagenomicData R package [26], which contains a collection of curated, uniformly processed human microbiome data. We focused on three gut microbiome studies that measured cholesterol as well as gene marker abundance in stool, restricting to samples from female patients: 1) Qin et al.’s 2012 study of Chinese type 2 diabetes patients and non-diabetic controls ( samples from independent female patients) [27], 2) Karlsson et al.’s 2013 study of middle-aged European women with normal, impaired or diabetic glucose control ( samples from independent female patients) [28], and 3) Heintz-Buschart et al’s 2016 study of patients with a family history of type 1 diabetes ( samples from 13 female patients) [29]. We used merging and cross-study learning to train linear regression models to predict cholesterol, calculated the theoretical transition interval, and evaluated the performance of the two approaches.

We considered two scenarios: 1) training on different subsets of the same study and testing on a held out subset, and 2) training on different studies and testing on an independent study. In the first scenario, we randomly split the Qin et al. 2012 samples into five datasets of approximately equal size, using four for training and the remaining one for testing. We used age and the top five marker abundances most correlated with the outcome in the training set as the predictors. In second scenario, we used the Qin et al. 2012 and Karlsson et al. 2013 datasets for training and the Heintz-Buschart et al. 2016 dataset for testing. We used age and the top twenty marker abundances most correlated with the outcome in the training set as the predictors. In each scenario, we fit merged and CSL versions of LS and ridge regression. We estimated by fitting a linear mixed effects model using residual maximum likelihood, allowing each predictor to have a random effect. For the CSLs, we used the optimal weights given by Equation 22, plugging in the estimate of . We calculated the theoretical transition bounds from Theorems 2 and 4 and compared them to the estimate of . We evaluated the performance of the models empirically by calculating the prediction error on the test set.

In the first scenario was estimated to be , was , and was , suggesting that merging was expected outperform cross-study learning. In the test set, the merged versions of LS and ridge regression both had lower prediction error than the respective CSL versions (Figure 4). In the second scenario, was estimated to be , was , and was . In the test set, the CSL versions of LS and ridge regression both had lower prediction error than the respective merged versions (Figure 5).

Figure 4:

Root mean square prediction error (RMSPE) for the first scenario with bootstrap confidence intervals. LS,M: merged LS learner; LS,C: CSL based on LS; R,M: merged ridge regression learner; R,C: CSL based on ridge regression.

Figure 5: Root mean square prediction error (RMSPE) for the second scenario with bootstrap confidence intervals. LS,M: merged LS learner; LS,C: CSL based on LS; R,M: merged ridge regression learner; R,C: CSL based on ridge regression.

Vi Discussion

The availability of large and increasingly heterogeneous collections of data for training classifiers is challenging traditional approaches for training and validating prediction and classification algorithms. At the same time, it is also opening opportunities for new and more general paradigms. One of these is cross-study machine learning via CSLs, motivated by variation in the relation between predictors and outcomes across collections of similar studies. A natural benchmark for these methods is to combine all training studies, to exploit the power of larger training sample sizes. In previous work

[19], merged learners perform better than CSLs in low-heterogeneity settings. As heterogeneity increases, however, our earlier simulations indicated a "transition point" in the heterogeneity scale beyond which acknowledging cross-study heterogeneity becomes preferable, and the CSLs outperform the merged learners.

In this paper, we approached this problem analytically for the first time by characterizing cross-study heterogeneity using a linear mixed effects model. We derived closed-form transition points for standard and ridge-regularized linear regression models. We confirmed the analytic results in simulation and demonstrated that when the data are generated by a linear model, the LS and ridge regression solutions can serve as proxies for the transition point under other learning strategies (LASSO, neural network) for which closed-form derivation is difficult. Finally, we estimated the transition point in cases of low and high cross-study heterogeneity in microbiome data and showed how it can be used as a guide for deciding when and when not to merge studies together in the course of learning a prediction rule.

We focused on deriving analytic results for LS and ridge regression because of the opportunity to pursue closed-form solutions. Other widely used methods such as LASSO, neural networks, and random forests are not as easily amenable to a closed-form solution, so we used simulations to study the performance of merging versus ensembling for these methods. In our simulation settings, the merged learners based on LS, ridge regression, LASSO, and neural networks had comparable accuracy, as did the corresponding CSLs. The methods all had similar empirical transition points, perhaps as a consequence of their similar performance. An exception is random forest, which did not reach a transition point within the specified heterogeneity levels, and also performed less well in general, as is expected in data generated by linear models. The analytic results for LS/ridge regression could potentially serve as an approximation for other methods that perform comparably, though it is important to consider how the reliability of such an approximation could be affected by the nature of the data and choice of model hyperparameters.

In practice, the analytic transition point and transition interval expressions could be used to help guide decisions about whether to merge data from multiple studies when there is potential heterogeneity in predictor-outcome relationships across the study populations. can be estimated from the training data and compared to the theoretical transition points or bounds for LS and/or ridge regression. Various methods can be used to estimate

, including maximum likelihood and method of moments-based approaches used in meta-analysis (for example, see

[30]), with the caveat that estimates will be imprecise when the number of studies is small.

Under Model (1

), fitting a correctly specified mixed effects model will generally be more efficient than both the merged and cross-study versions of LS. However, more flexible machine learning algorithms can potentially yield better prediction accuracy than the true model. For example, in the the low-dimensional simulations, the mixed effects model was outperformed by either the merged learner or CSL based on LASSO for most levels of heterogeneity. Moreover, fitting a mixed effects model can be computationally difficult when the number of predictors is large and standard mixed effects models are not appropriate for high-dimensional data, though there are methods for penalized mixed effects models

[31, 32, 33, 34].

A limitation of our derivations is that they treat the following quantities as known: the subset of predictors with random effects, the CSL weights, and the regularization parameters for ridge regression. In practice, these are usually selected using statistical procedures that introduce additional variability. Furthermore, we obtained closed-form transition point expressions for cases where the CSL weighting scheme does not depend on the variances of the random effects. Such weighting schemes are generally be sub-optimal (for example, the optimal weights for LS given by Equation 22 depend on ), so the closed-form results are based on a conservative estimate of the maximal performance of cross-study learning. Another limitation is the assumption that the random effects are uncorrelated, which is often not true in practice.

In summary, although this work is predicated upon the assumption that cross-study heterogeneity manifests through random effects and assumes that weights and regularization parameters are known, we believe it provides a theoretical rationale for multi-study machine learning, and a strong foundation for developing practical rules and guidelines to implement it.

Vii Reproducibility

Code to reproduce the simulations and data application is available at
https://github.com/zoeguan/transition_point

Viii Acknowledgements

Work supported by NIH grants 4P30CA006516-51 (Parmigiani) and 2T32CA009337-36 (Patil), NSERC PGS-D Scholarship (Guan) and NSF grant DMS-1810829 (Patil and Parmigiani). We thank Lorenzo Trippa and Boyu Ren for useful discussions.

References

  • [1] P. J. Castaldi, I. J. Dahabreh, and J. P. Ioannidis. An empirical assessment of validation practices for molecular classifiers. Briefings in bioinformatics, 12(3):189–202, 2011.
  • [2] C. Bernau, M. Riester, A.-L. Boulesteix, G. Parmigiani, C. Huttenhower, L. Waldron, and L. Trippa. Cross-study validation for the assessment of prediction algorithms. Bioinformatics, 30(12):i105–i112, Jun 2014. PMID: 24931973.
  • [3] Y. Zhang, C. Bernau, G. Parmigiani, and L. Waldron. The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models. Biostatistics, page kxy044, 2018.
  • [4] R. Edgar, M. Domrachev, and A. E. Lash. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic acids research, 30(1):207–210, 2002.
  • [5] H. Parkinson, U. Sarkans, N. Kolesnikov, N. Abeygunawardena, T. Burdett, M. Dylag, I. Emam, A. Farne, E. Hastings, E. Holloway, et al. ArrayExpress update—an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic acids research, 39(suppl_1):D1002–D1004, 2010.
  • [6] K. Gorgolewski, O. Esteban, G. Schaefer, B. Wandell, and R. Poldrack. OpenNeuro—a free online platform for sharing and analysis of neuroimaging data. Organization for Human Brain Mapping. Vancouver, Canada, page 1677, 2017.
  • [7] C. Lazar, S. Meganck, J. Taminau, D. Steenhoff, A. Coletta, C. Molter, D. Y. Weiss-Solís, R. Duque, H. Bersini, and A. Nowé. Batch effect removal methods for microarray gene expression data integration: a survey. Briefings in bioinformatics, 14(4):469–490, 2012.
  • [8] M. Benito, J. Parker, Q. Du, J. Wu, D. Xiang, C. M. Perou, and J. S. Marron. Adjustment of systematic microarray data biases. Bioinformatics, 20(1):105–114, 2004.
  • [9] L. Xu, A. C. Tan, R. L. Winslow, and D. Geman. Merging microarray data from separate breast cancer studies provides a robust prognostic test. BMC bioinformatics, 9(1):125, 2008.
  • [10] H. Jiang, Y. Deng, H.-S. Chen, L. Tao, Q. Sha, J. Chen, C.-J. Tsai, and S. Zhang. Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes. BMC bioinformatics, 5(1):81, 2004.
  • [11] H. H. Zhou, Y. Zhang, V. K. Ithapu, S. C. Johnson, G. Wahba, and V. Singh. When can Multi-Site Datasets be Pooled for Regression? Hypothesis Tests, -consistency and Neuroscience Applications. arXiv preprint arXiv:1709.00640, 2017.
  • [12] M. Riester, J. M. Taylor, A. Feifer, T. Koppie, J. E. Rosenberg, R. J. Downey, B. H. Bochner, and F. Michor. Combination of a novel gene expression signature with a clinical nomogram improves the prediction of survival in high-risk bladder cancer. Clinical Cancer Research, 18(5):1323–1333, March 2012.
  • [13] G. C. Tseng, D. Ghosh, and E. Feingold. Comprehensive literature review and statistical considerations for microarray meta-analysis. Nucleic acids research, 40(9):3785–3799, 2012.
  • [14] D. M. Bravata and I. Olkin. Simple pooling versus combining in meta-analysis. Evaluation & the health professions, 24(2):218–230, 2001.
  • [15] J. Taminau, C. Lazar, S. Meganck, and A. Nowé. Comparison of merging and meta-analysis as alternative approaches for integrative gene expression analysis. ISRN bioinformatics, 2014, 2014.
  • [16] R. Kosch and K. Jung. Conducting gene set tests in meta-analyses of transcriptome expression data. Research synthesis methods, 2018.
  • [17] V. Lagani, A. D. Karozou, D. Gomez-Cabrero, G. Silberberg, and I. Tsamardinos. A comparative evaluation of data-merging and meta-analysis methods for reconstructing gene-gene interactions. BMC bioinformatics, 17(5):S194, 2016.
  • [18] T. G. Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pages 1–15. Springer, 2000.
  • [19] P. Patil and G. Parmigiani. Training replicable predictors in multiple studies. Proceedings of the National Academy of Sciences, 115(11):2578–2583, 2018.
  • [20] A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
  • [21] P. J. Brown. Centering and scaling in ridge regression.