Cluster Based Symbolic Representation for Skewed Text Categorization

06/24/2017
by   Lavanya Narayana Raju, et al.
0

In this work, a problem associated with imbalanced text corpora is addressed. A method of converting an imbalanced text corpus into a balanced one is presented. The presented method employs a clustering algorithm for conversion. Initially to avoid curse of dimensionality, an effective representation scheme based on term class relevancy measure is adapted, which drastically reduces the dimension to the number of classes in the corpus. Subsequently, the samples of larger sized classes are grouped into a number of subclasses of smaller sizes to make the entire corpus balanced. Each subclass is then given a single symbolic vector representation by the use of interval valued features. This symbolic representation in addition to being compact helps in reducing the space requirement and also the classification time. The proposed model has been empirically demonstrated for its superiority on bench marking datasets viz., Reuters 21578 and TDT2. Further, it has been compared against several other existing contemporary models including model based on support vector machine. The comparative analysis indicates that the proposed model outperforms the other existing models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/28/2016

Symbolic Representation and Classification of Logos

In this paper, a model for classification of logos based on symbolic rep...
research
05/31/2017

Class Specific Feature Selection for Interval Valued Data Through Interval K-Means Clustering

In this paper, a novel feature selection approach for supervised interva...
research
07/09/2020

Behavioral analysis of support vector machine classifier with Gaussian kernel and imbalanced data

The parameters of support vector machines (SVMs) such as the penalty par...
research
11/25/2019

A Self-Adaptive Synthetic Over-Sampling Technique for Imbalanced Classification

Traditionally, in supervised machine learning, (a significant) part of t...
research
10/16/2016

Term-Class-Max-Support (TCMS): A Simple Text Document Categorization Approach Using Term-Class Relevance Measure

In this paper, a simple text categorization method using term-class rele...
research
10/03/2016

Nonsymbolic Text Representation

We introduce the first generic text representation model that is complet...

Please sign up or login with your details

Forgot password? Click here to reset