Word Vector Enrichment of Low Frequency Words in the Bag-of-Words Model for Short Text Multi-class Classification Problems

09/18/2017
by   Bradford Heap, et al.
0

The bag-of-words model is a standard representation of text for many linear classifier learners. In many problem domains, linear classifiers are preferred over more complex models due to their efficiency, robustness and interpretability, and the bag-of-words text representation can capture sufficient information for linear classifiers to make highly accurate predictions. However in settings where there is a large vocabulary, large variance in the frequency of terms in the training corpus, many classes and very short text (e.g., single sentences or document titles) the bag-of-words representation becomes extremely sparse, and this can reduce the accuracy of classifiers. A particular issue in such settings is that short texts tend to contain infrequently occurring or rare terms which lack class-conditional evidence. In this work we introduce a method for enriching the bag-of-words model by complementing such rare term information with related terms from both general and domain-specific Word Vector models. By reducing sparseness in the bag-of-words models, our enrichment approach achieves improved classification over several baseline classifiers in a variety of text classification problems. Our approach is also efficient because it requires no change to the linear classifier before or during training, since bag-of-words enrichment applies only to text being classified.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/06/2016

Bag of Tricks for Efficient Text Classification

This paper explores a simple and efficient baseline for text classificat...
research
01/28/2013

An alternative text representation to TF-IDF and Bag-of-Words

In text mining, information retrieval, and machine learning, text docume...
research
02/17/2017

Analysis and Optimization of fastText Linear Text Classifier

The paper [1] shows that simple linear classifier can compete with compl...
research
10/03/2016

Nonsymbolic Text Representation

We introduce the first generic text representation model that is complet...
research
12/11/2012

Bag-of-Words Representation for Biomedical Time Series Classification

Automatic analysis of biomedical time series such as electroencephalogra...
research
10/26/2018

Learning and Interpreting Multi-Multi-Instance Learning Networks

We introduce an extension of the multi-instance learning problem where e...
research
10/05/2022

Using Full-Text Content to Characterize and Identify Best Seller Books

Artistic pieces can be studied from several perspectives, one example be...

Please sign up or login with your details

Forgot password? Click here to reset