A Mathematical Framework for Feature Selection from Real-World Data with Non-Linear Observations

08/31/2016
by   Martin Genzel, et al.
0

In this paper, we study the challenge of feature selection based on a relatively small collection of sample pairs {(x_i, y_i)}_1 ≤ i ≤ m. The observations y_i ∈R are thereby supposed to follow a noisy single-index model, depending on a certain set of signal variables. A major difficulty is that these variables usually cannot be observed directly, but rather arise as hidden factors in the actual data vectors x_i ∈R^d (feature variables). We will prove that a successful variable selection is still possible in this setup, even when the applied estimator does not have any knowledge of the underlying model parameters and only takes the 'raw' samples {(x_i, y_i)}_1 ≤ i ≤ m as input. The model assumptions of our results will be fairly general, allowing for non-linear observations, arbitrary convex signal structures as well as strictly convex loss functions. This is particularly appealing for practical purposes, since in many applications, already standard methods, e.g., the Lasso or logistic regression, yield surprisingly good outcomes. Apart from a general discussion of the practical scope of our theoretical findings, we will also derive a rigorous guarantee for a specific real-world problem, namely sparse feature extraction from (proteomics-based) mass spectrometry data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/13/2019

Gradient Boosted Feature Selection

A feature selection algorithm should ideally satisfy four conditions: re...
research
10/29/2020

Post-selection inference with HSIC-Lasso

Detecting influential features in complex (non-linear and/or high-dimens...
research
08/20/2018

The Mismatch Principle: Statistical Learning Under Large Model Uncertainties

We study the learning capacity of empirical risk minimization with regar...
research
09/15/2019

Target-Focused Feature Selection Using a Bayesian Approach

In many real-world scenarios where data is high dimensional, test time a...
research
02/11/2020

A study of local optima for learning feature interactions using neural networks

In many fields such as bioinformatics, high energy physics, power distri...
research
06/04/2015

Classification with many classes: challenges and pluses

The objective of the paper is to study accuracy of multi-class classific...
research
03/24/2023

Predictive modeling for limited distributed targets

Many forecasting applications have a limited distributed target variable...

Please sign up or login with your details

Forgot password? Click here to reset