The Interaction of Entropy-Based Discretization and Sample Size: An Empirical Study

01/06/2012
by   Casey Bennett, et al.
0

An empirical investigation of the interaction of sample size and discretization - in this case the entropy-based method CAIM (Class-Attribute Interdependence Maximization) - was undertaken to evaluate the impact and potential bias introduced into data mining performance metrics due to variation in sample size as it impacts the discretization process. Of particular interest was the effect of discretizing within cross-validation folds averse to outside discretization folds. Previous publications have suggested that discretizing externally can bias performance results; however, a thorough review of the literature found no empirical evidence to support such an assertion. This investigation involved construction of over 117,000 models on seven distinct datasets from the UCI (University of California-Irvine) Machine Learning Library and multiple modeling methods across a variety of configurations of sample size and discretization, with each unique "setup" being independently replicated ten times. The analysis revealed a significant optimistic bias as sample sizes decreased and discretization was employed. The study also revealed that there may be a relationship between the interaction that produces such bias and the numbers and types of predictor attributes, extending the "curse of dimensionality" concept from feature selection into the discretization realm. Directions for further exploration are laid out, as well some general guidelines about the proper application of discretization in light of these results.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/27/2020

An Empirical Study on Feature Discretization

When dealing with continuous numeric features, we usually adopt feature ...
research
08/08/2017

An Effective Feature Selection Method Based on Pair-Wise Feature Proximity for High Dimensional Low Sample Size Data

Feature selection has been studied widely in the literature. However, th...
research
09/14/2021

Targeted Cross-Validation

In many applications, we have access to the complete dataset but are onl...
research
08/22/2023

Toward Generalizable Machine Learning Models in Speech, Language, and Hearing Sciences: Power Analysis and Sample Size Estimation

This study's first purpose is to provide quantitative evidence that woul...
research
10/13/2017

A simple data discretizer

Data discretization is an important step in the process of machine learn...
research
06/02/2020

Unsupervised Discretization by Two-dimensional MDL-based Histogram

Unsupervised discretization is a crucial step in many knowledge discover...
research
08/23/2019

Economically rational sample-size choice and irreproducibility

Several systematic studies have suggested that a large fraction of publi...

Please sign up or login with your details

Forgot password? Click here to reset