An entropic feature selection method in perspective of Turing formula

02/19/2019
by   Jingyi Shi, et al.
0

Health data are generally complex in type and small in sample size. Such domain-specific challenges make it difficult to capture information reliably and contribute further to the issue of generalization. To assist the analytics of healthcare datasets, we develop a feature selection method based on the concept of Coverage Adjusted Standardized Mutual Information (CASMI). The main advantages of the proposed method are: 1) it selects features more efficiently with the help of an improved entropy estimator, particularly when the sample size is small, and 2) it automatically learns the number of features to be selected based on the information from sample data. Additionally, the proposed method handles feature redundancy from the perspective of joint-distribution. The proposed method focuses on non-ordinal data, while it works with numerical data with an appropriate binning method. A simulation study comparing the proposed method to six widely cited feature selection methods shows that the proposed method performs better when measured by the Information Recovery Ratio, particularly when the sample size is small.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/08/2017

An Effective Feature Selection Method Based on Pair-Wise Feature Proximity for High Dimensional Low Sample Size Data

Feature selection has been studied widely in the literature. However, th...
research
08/27/2020

Feature Selection from High-Dimensional Data with Very Low Sample Size: A Cautionary Tale

In classification problems, the purpose of feature selection is to ident...
research
09/12/2022

Bilevel Optimization for Feature Selection in the Data-Driven Newsvendor Problem

We study the feature-based newsvendor problem, in which a decision-maker...
research
01/21/2021

Orthogonal Least Squares Based Fast Feature Selection for Linear Classification

An Orthogonal Least Squares (OLS) based feature selection method is prop...
research
03/24/2021

Statistical Integration of Heterogeneous Data with PO2PLS

The availability of multi-omics data has revolutionized the life science...
research
09/25/2017

Understanding a Version of Multivariate Symmetric Uncertainty to assist in Feature Selection

In this paper, we analyze the behavior of the multivariate symmetric unc...
research
07/31/2017

Consistent Nonparametric Different-Feature Selection via the Sparsest k-Subgraph Problem

Two-sample feature selection is the problem of finding features that des...

Please sign up or login with your details

Forgot password? Click here to reset