On Ensembling vs Merging: Least Squares and Random Forests under Covariate Shift

06/04/2021
by   Maya Ramchandran, et al.
0

It has been postulated and observed in practice that for prediction problems in which covariate data can be naturally partitioned into clusters, ensembling algorithms based on suitably aggregating models trained on individual clusters often perform substantially better than methods that ignore the clustering structure in the data. In this paper, we provide theoretical support to these empirical observations by asymptotically analyzing linear least squares and random forest regressions under a linear model. Our main results demonstrate that the benefit of ensembling compared to training a single model on the entire data, often termed 'merging', might depend on the underlying bias and variance interplay of the individual predictors to be aggregated. In particular, under both fixed and high dimensional linear models, we show that merging is asymptotically superior to optimal ensembling techniques for linear least squares regression due to the unbiased nature of least squares prediction. In contrast, for random forest regression under fixed dimensional linear models, our bounds imply a strict benefit of ensembling over merging. Finally, we also present numerical experiments to verify the validity of our asymptotic results across different situations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/01/2022

Results Merging in the Patent Domain

In this paper, we test machine learning methods for results merging in p...
research
10/10/2017

LinXGBoost: Extension of XGBoost to Generalized Local Linear Models

XGBoost is often presented as the algorithm that wins every ML competiti...
research
07/11/2022

Multi-Study Boosting: Theoretical Considerations for Merging vs. Ensembling

Cross-study replicability is a powerful model evaluation criterion that ...
research
05/07/2018

Complete Analysis of a Random Forest Model

Random forests have become an important tool for improving accuracy in r...
research
08/18/2023

On High-Dimensional Asymptotic Properties of Model Averaging Estimators

When multiple models are considered in regression problems, the model av...
research
10/31/2017

Partial Least Squares Random Forest Ensemble Regression as a Soft Sensor

Six simple, dynamic soft sensor methodologies with two update conditions...
research
04/24/2020

DeepMerge: Classifying High-redshift Merging Galaxies with Deep Neural Networks

We investigate and demonstrate the use of convolutional neural networks ...

Please sign up or login with your details

Forgot password? Click here to reset