Feature Selection for Huge Data via Minipatch Learning

10/16/2020
by   Tianyi Yao, et al.
0

Feature selection often leads to increased model interpretability, faster computation, and improved model performance by discarding irrelevant or redundant features. While feature selection is a well-studied problem with many widely-used techniques, there are typically two key challenges: i) many existing approaches become computationally intractable in huge-data settings with millions of observations and features; and ii) the statistical accuracy of selected features degrades in high-noise, high-correlation settings, thus hindering reliable model interpretation. We tackle these problems by proposing Stable Minipatch Selection (STAMPS) and Adaptive STAMPS (AdaSTAMPS). These are meta-algorithms that build ensembles of selection events of base feature selectors trained on many tiny, (adaptively-chosen) random subsets of both the observations and features of the data, which we call minipatches. Our approaches are general and can be employed with a variety of existing feature selection strategies and machine learning techniques. In addition, we provide theoretical insights on STAMPS and empirically demonstrate that our approaches, especially AdaSTAMPS, dominate competing methods in terms of feature selection accuracy and computational time.

READ FULL TEXT
research
10/26/2020

Fast-Ensembles of Minimum Redundancy Feature Selection

Finding relevant subspaces in very high-dimensional data is a challengin...
research
10/22/2021

Gaussian Graphical Model Selection for Huge Data via Minipatch Learning

Gaussian graphical models are essential unsupervised learning techniques...
research
06/01/2017

Statistical Analysis and Parameter Selection for Mapper

In this article, we study the question of the statistical convergence of...
research
06/14/2016

Max-Margin Feature Selection

Many machine learning applications such as in vision, biology and social...
research
02/04/2018

Heuristic Feature Selection for Clickbait Detection

We study feature selection as a means to optimize the baseline clickbait...
research
04/30/2021

A User-Guided Bayesian Framework for Ensemble Feature Selection in Life Science Applications (UBayFS)

Training machine learning models on high-dimensional datasets is a chall...
research
11/17/2015

Sacrificing information for the greater good: how to select photometric bands for optimal accuracy

Large-scale surveys make huge amounts of photometric data available. Bec...

Please sign up or login with your details

Forgot password? Click here to reset