Feature Selection Based on Term Frequency and T-Test for Text Categorization

05/03/2013
by   Deqing Wang, et al.
0

Much work has been done on feature selection. Existing methods are based on document frequency, such as Chi-Square Statistic, Information Gain etc. However, these methods have two shortcomings: one is that they are not reliable for low-frequency terms, and the other is that they only count whether one term occurs in a document and ignore the term frequency. Actually, high-frequency terms within a specific category are often regards as discriminators. This paper focuses on how to construct the feature selection function based on term frequency, and proposes a new approach based on t-test, which is used to measure the diversity of the distributions of a term between the specific category and the entire corpus. Extensive comparative experiments on two text corpora using three classifiers show that our new approach is comparable to or or slightly better than the state-of-the-art feature selection methods (i.e., χ^2, and IG) in terms of macro-F_1 and micro-F_1.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/27/2017

A Comparative Study on Different Types of Approaches to Bengali document Categorization

Document categorization is a technique where the category of a document ...
research
08/30/2019

Charge-Based Prison Term Prediction with Deep Gating Network

Judgment prediction for legal cases has attracted much research efforts ...
research
12/13/2010

Inverse-Category-Frequency based supervised term weighting scheme for text categorization

Term weighting schemes often dominate the performance of many classifier...
research
06/20/2016

FSMJ: Feature Selection with Maximum Jensen-Shannon Divergence for Text Categorization

In this paper, we present a new wrapper feature selection approach based...
research
11/05/2021

Feature Selective Likelihood Ratio Estimator for Low- and Zero-frequency N-grams

In natural language processing (NLP), the likelihood ratios (LRs) of N-g...
research
03/02/2017

Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ

Scattertext is an open source tool for visualizing linguistic variation ...
research
05/10/2016

Different approaches for identifying important concepts in probabilistic biomedical text summarization

Automatic text summarization tools help users in biomedical domain to ac...

Please sign up or login with your details

Forgot password? Click here to reset