Variable selection with genetic algorithms using repeated cross-validation of PLS regression models as fitness measure

11/17/2017
by   David Kepplinger, et al.
0

Genetic algorithms are a widely used method in chemometrics for extracting variable subsets with high prediction power. Most fitness measures used by these genetic algorithms are based on the ordinary least-squares fit of the resulting model to the entire data or a subset thereof. Due to multicollinearity, partial least squares regression is often more appropriate, but rarely considered in genetic algorithms due to the additional cost for estimating the optimal number of components. We introduce two novel fitness measures for genetic algorithms, explicitly designed to estimate the internal prediction performance of partial least squares regression models built from the variable subsets. Both measures estimate the optimal number of components using cross-validation and subsequently estimate the prediction performance by predicting the response of observations not included in model-fitting. This is repeated multiple times to estimate the measures' variations due to different random splits. Moreover, one measure was optimized for speed and more accurate estimation of the prediction performance for observations not included during variable selection. This leads to variable subsets with high internal and external prediction power. Results on high-dimensional chemical-analytical data show that the variable subsets acquired by this approach have competitive internal prediction power and superior external prediction power compared to variable subsets extracted with other fitness measures.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/06/2018

Cross validating extensions of kernel, sparse or regular partial least squares regression models to censored data

When cross-validating standard or extended Cox models, the commonly used...
research
04/22/2016

An improved chromosome formulation for genetic algorithms applied to variable selection with the inclusion of interaction terms

Genetic algorithms are a well-known method for tackling the problem of v...
research
01/30/2020

A Study of Fitness Landscapes for Neuroevolution

Fitness landscapes are a useful concept to study the dynamics of meta-he...
research
12/16/2022

The CDF penalty:sparse and quasi unbiased estimation in regression models

In high-dimensional regression modelling, the number of candidate covari...
research
01/23/2020

Improving generalisation of AutoML systems with dynamic fitness evaluations

A common problem machine learning developers are faced with is overfitti...
research
04/22/2016

Developing an ICU scoring system with interaction terms using a genetic algorithm

ICU mortality scoring systems attempt to predict patient mortality using...
research
02/21/2014

Important Molecular Descriptors Selection Using Self Tuned Reweighted Sampling Method for Prediction of Antituberculosis Activity

In this paper, a new descriptor selection method for selecting an optima...

Please sign up or login with your details

Forgot password? Click here to reset