Improving Sample and Feature Selection with Principal Covariates Regression

12/22/2020
by   Rose K. Cersonsky, et al.
10

Selecting the most relevant features and samples out of a large set of candidates is a task that occurs very often in the context of automated data analysis, where it can be used to improve the computational performance, and also often the transferability, of a model. Here we focus on two popular sub-selection schemes which have been applied to this end: CUR decomposition, that is based on a low-rank approximation of the feature matrix and Farthest Point Sampling, that relies on the iterative identification of the most diverse samples and discriminating features. We modify these unsupervised approaches, incorporating a supervised component following the same spirit as the Principal Covariates Regression (PCovR) method. We show that incorporating target information provides selections that perform better in supervised tasks, which we demonstrate with ridge regression, kernel ridge regression, and sparse kernel regression. We also show that incorporating aspects of simple supervised learning models can improve the accuracy of more complex models, such as feed-forward neural networks. We present adjustments to minimize the impact that any subselection may incur when performing unsupervised tasks. We demonstrate the significant improvements associated with the use of PCov-CUR and PCov-FPS selections for applications to chemistry and materials science, typically reducing by a factor of two the number of features and samples which are required to achieve a given level of regression accuracy.

READ FULL TEXT

page 1

page 2

page 4

page 8

page 9

page 10

page 11

page 13

research
02/12/2020

Structure-Property Maps with Kernel Principal Covariates Regression

Data analysis based on linear methods, which look for correlations betwe...
research
01/26/2021

Generalization error of random features and kernel methods: hypercontractivity and kernel matrix concentration

Consider the classical supervised learning problem: we are given data (y...
research
06/17/2015

Feature Selection for Ridge Regression with Provable Guarantees

We introduce single-set spectral sparsification as a deterministic sampl...
research
02/20/2020

Diversity sampling is an implicit regularization for kernel methods

Kernel methods have achieved very good performance on large scale regres...
research
03/23/2019

Measuring the Similarity between Materials with an Emphasis on the Materials Distinctiveness

In this study, we establish a basis for selecting similarity measures wh...
research
02/04/2022

Learning Representation from Neural Fisher Kernel with Low-rank Approximation

In this paper, we study the representation of neural networks from the v...
research
04/04/2018

Identification of Shallow Neural Networks by Fewest Samples

We address the uniform approximation of sums of ridge functions ∑_i=1^m ...

Please sign up or login with your details

Forgot password? Click here to reset