From Image to Text Classification: A Novel Approach based on Clustering Word Embeddings

07/25/2017
by   Andrei M. Butnaru, et al.
0

In this paper, we propose a novel approach for text classification based on clustering word embeddings, inspired by the bag of visual words model, which is widely used in computer vision. After each word in a collection of documents is represented as word vector using a pre-trained word embeddings model, a k-means algorithm is applied on the word vectors in order to obtain a fixed-size set of clusters. The centroid of each cluster is interpreted as a super word embedding that embodies all the semantically related word vectors in a certain region of the embedding space. Every embedded word in the collection of documents is then assigned to the nearest cluster centroid. In the end, each document is represented as a bag of super word embeddings by computing the frequency of each super word embedding in the respective document. We also diverge from the idea of building a single vocabulary for the entire collection of documents, and propose to build class-specific vocabularies for better performance. Using this kind of representation, we report results on two text mining tasks, namely text categorization by topic and polarity classification. On both tasks, our model yields better performance than the standard bag of words.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/23/2019

Vector of Locally-Aggregated Word Embeddings (VLAWE): A novel document-level embedding

In this paper, we propose a novel representation for text documents base...
research
10/30/2018

Word Mover's Embedding: From Word2Vec to Document Embedding

While the celebrated Word2Vec technique yields semantically rich represe...
research
08/07/2019

A Simple and Effective Approach for Fine Tuning Pre-trained Word Embeddings for Improved Text Classification

This work presents a new and simple approach for fine-tuning pretrained ...
research
11/29/2018

Sequential Embedding Induced Text Clustering, a Non-parametric Bayesian Approach

Current state-of-the-art nonparametric Bayesian text clustering methods ...
research
04/05/2018

Few-Shot Text Classification with Pre-Trained Word Embeddings and a Human in the Loop

Most of the literature around text classification treats it as a supervi...
research
01/06/2020

Semantic Sensitive TF-IDF to Determine Word Relevance in Documents

Keyword extraction has received an increasing attention as an important ...
research
02/23/2019

Vector of Locally-Aggregated Word Embeddings (VLAWE): A Novel Document-level Representation

In this paper, we propose a novel representation for text documents base...

Please sign up or login with your details

Forgot password? Click here to reset