Text Classification based on Word Subspace with Term-Frequency

06/08/2018
by   Erica K. Shimomoto, et al.
0

Text classification has become indispensable due to the rapid increase of text in digital form. Over the past three decades, efforts have been made to approach this task using various learning algorithms and statistical models based on bag-of-words (BOW) features. Despite its simple implementation, BOW features lack semantic meaning representation. To solve this problem, neural networks started to be employed to learn word vectors, such as the word2vec. Word2vec embeds word semantic structure into vectors, where the angle between vectors indicates the meaningful similarity between words. To measure the similarity between texts, we propose the novel concept of word subspace, which can represent the intrinsic variability of features in a set of word vectors. Through this concept, it is possible to model text from word vectors while holding semantic information. To incorporate the word frequency directly in the subspace model, we further extend the word subspace to the term-frequency (TF) weighted word subspace. Based on these new concepts, text classification can be performed under the mutual subspace method (MSM) framework. The validity of our modeling is shown through experiments on the Reuters text database, comparing the results to various state-of-art algorithms.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/03/2018

Incorporating Word Embeddings into Open Directory Project based Large-scale Classification

Recently, implicit representation models, such as embedding or deep lear...
research
06/17/2018

An Improved Text Sentiment Classification Model Using TF-IDF and Next Word Negation

With the rapid growth of Text sentiment analysis, the demand for automat...
research
08/22/2018

Deep Extrofitting: Specialization and Generalization of Expansional Retrofitting Word Vectors using Semantic Lexicons

The retrofitting techniques, which inject external resources into word r...
research
08/10/2015

Measuring Word Significance using Distributed Representations of Words

Distributed representations of words as real-valued vectors in a relativ...
research
04/09/2020

Two halves of a meaningful text are statistically different

Which statistical features distinguish a meaningful text (possibly writt...
research
09/04/2017

From Review to Rating: Exploring Dependency Measures for Text Classification

Various text analysis techniques exist, which attempt to uncover unstruc...
research
02/14/2019

Categorical Metadata Representation for Customized Text Classification

The performance of text classification has improved tremendously using i...

Please sign up or login with your details

Forgot password? Click here to reset