DCDistance: A Supervised Text Document Feature extraction based on class labels

Text Mining is a field that aims at extracting information from textual data. One of the challenges of such field of study comes from the pre-processing stage in which a vector (and structured) representation should be extracted from unstructured data. The common extraction creates large and sparse vectors representing the importance of each term to a document. As such, this usually leads to the curse-of-dimensionality that plagues most machine learning algorithms. To cope with this issue, in this paper we propose a new supervised feature extraction and reduction algorithm, named DCDistance, that creates features based on the distance between a document to a representative of each class label. As such, the proposed technique can reduce the features set in more than 99 capable of improving the classification accuracy over a set of benchmark datasets when compared to traditional and state-of-the-art features selection algorithms.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/04/2019

A Study of Feature Extraction techniques for Sentiment Analysis

Sentiment Analysis refers to the study of systematically extracting the ...
research
08/31/2018

A Supervised Learning Approach For Heading Detection

As the Portable Document Format (PDF) file format increases in popularit...
research
03/28/2019

Multifaceted 4D Feature Segmentation and Extraction in Point and Field-based Datasets

The use of large-scale multifaceted data is common in a wide variety of ...
research
06/27/2012

The Greedy Miser: Learning under Test-time Budgets

As machine learning algorithms enter applications in industrial settings...
research
07/04/2020

Detecting Opportunities for Differential Maintenance of Extracted Views

Semi-structured and unstructured data management is challenging, but man...
research
12/21/2013

Extracting Region of Interest for Palm Print Authentication

Biometrics authentication is an effective method for automatically recog...
research
11/26/2019

A Measure of Similarity in Textual Data Using Spearman's Rank Correlation Coefficient

In the last decade, many diverse advances have occurred in the field of ...

Please sign up or login with your details

Forgot password? Click here to reset