Toward Optimal Feature Selection in Naive Bayes for Text Categorization

02/09/2016
by   Bo Tang, et al.
0

Automated feature selection is important for text categorization to reduce the feature size and to speed up the learning process of classifiers. In this paper, we present a novel and efficient feature selection framework based on the Information Theory, which aims to rank the features with their discriminative capacity for classification. We first revisit two information measures: Kullback-Leibler divergence and Jeffreys divergence for binary hypothesis testing, and analyze their asymptotic properties relating to type I and type II errors of a Bayesian classifier. We then introduce a new divergence measure, called Jeffreys-Multi-Hypothesis (JMH) divergence, to measure multi-distribution divergence for multi-class classification. Based on the JMH-divergence, we develop two efficient feature selection methods, termed maximum discrimination (MD) and MD-χ^2 methods, for text categorization. The promising results of extensive experiments demonstrate the effectiveness of the proposed approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/20/2016

FSMJ: Feature Selection with Maximum Jensen-Shannon Divergence for Text Categorization

In this paper, we present a new wrapper feature selection approach based...
research
12/26/2014

A Novel Feature Selection and Extraction Technique for Classification

This paper presents a versatile technique for the purpose of feature sel...
research
12/19/2014

Empirically Estimable Classification Bounds Based on a New Divergence Measure

Information divergence functions play a critical role in statistics and ...
research
05/11/2016

EEF: Exponentially Embedded Families with Class-Specific Features for Classification

In this letter, we present a novel exponentially embedded families (EEF)...
research
01/19/2019

MOROCO: The Moldavian and Romanian Dialectal Corpus

In this work, we introduce the MOldavian and ROmanian Dialectal COrpus (...
research
01/26/2017

A theoretical framework for evaluating forward feature selection methods based on mutual information

Feature selection problems arise in a variety of applications, such as m...
research
11/30/2018

Unsupervised learning with GLRM feature selection reveals novel traumatic brain injury phenotypes

Baseline injury categorization is important to traumatic brain injury (T...

Please sign up or login with your details

Forgot password? Click here to reset