An alternative text representation to TF-IDF and Bag-of-Words

01/28/2013
by   Zhixiang, et al.
0

In text mining, information retrieval, and machine learning, text documents are commonly represented through variants of sparse Bag of Words (sBoW) vectors (e.g. TF-IDF). Although simple and intuitive, sBoW style representations suffer from their inherent over-sparsity and fail to capture word-level synonymy and polysemy. Especially when labeled data is limited (e.g. in document classification), or the text documents are short (e.g. emails or abstracts), many features are rarely observed within the training corpus. This leads to overfitting and reduced generalization accuracy. In this paper we propose Dense Cohort of Terms (dCoT), an unsupervised algorithm to learn improved sBoW document features. dCoT explicitly models absent words by removing and reconstructing random sub-sets of words in the unlabeled corpus. With this approach, dCoT learns to reconstruct frequent words from co-occurring infrequent words and maps the high dimensional sparse sBoW vectors into a low-dimensional dense representation. We show that the feature removal can be marginalized out and that the reconstruction can be solved for in closed-form. We demonstrate empirically, on several benchmark datasets, that dCoT features significantly improve the classification accuracy across several document classification tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/05/2017

The Influence of Feature Representation of Text on the Performance of Document Classification

In this paper we perform a comparative analysis of three models for feat...
research
05/16/2014

Distributed Representations of Sentences and Documents

Many machine learning algorithms require the input to be represented as ...
research
11/28/2019

Metre as a stylometric feature in Latin hexameter poetry

This paper demonstrates that metre is a privileged indicator of authoria...
research
09/18/2017

Word Vector Enrichment of Low Frequency Words in the Bag-of-Words Model for Short Text Multi-class Classification Problems

The bag-of-words model is a standard representation of text for many lin...
research
08/30/2020

SOLAR: Sparse Orthogonal Learned and Random Embeddings

Dense embedding models are commonly deployed in commercial search engine...
research
11/26/2022

Searching for Discriminative Words in Multidimensional Continuous Feature Space

Word feature vectors have been proven to improve many NLP tasks. With re...
research
12/23/2016

"What is Relevant in a Text Document?": An Interpretable Machine Learning Approach

Text documents can be described by a number of abstract concepts such as...

Please sign up or login with your details

Forgot password? Click here to reset