SCDV : Sparse Composite Document Vectors using soft clustering over distributional representations

12/20/2016
by   Dheeraj Mekala, et al.
0

We present a feature vector formation technique for documents - Sparse Composite Document Vector (SCDV) - which overcomes several shortcomings of the current distributional paragraph vector representations that are widely used for text representation. In SCDV, word embedding's are clustered to capture multiple semantic contexts in which words occur. They are then chained together to form document topic-vectors that can express complex, multi-topic documents. Through extensive experiments on multi-class and multi-label classification tasks, we outperform the previous state-of-the-art method, NTSG (Liu et al., 2015a). We also show that SCDV embedding's perform well on heterogeneous tasks like Topic Coherence, context-sensitive Learning and Information Retrieval. Moreover, we achieve significant reduction in training and prediction times compared to other representation methods. SCDV achieves best of both worlds - better performance with lower time and space complexity.

READ FULL TEXT
research
11/18/2019

Improving Document Classification with Multi-Sense Embeddings

Efficient representation of text documents is an important building bloc...
research
10/15/2018

Improving Topic Models with Latent Feature Word Representations

Probabilistic topic models are widely used to discover latent topics in ...
research
09/10/2019

Neural Embedding Allocation: Distributed Representations of Topic Models

Word embedding models such as the skip-gram learn vector representations...
research
06/09/2016

Generative Topic Embedding: a Continuous Representation of Documents (Extended Version with Proofs)

Word embedding maps words into a low-dimensional continuous embedding sp...
research
01/16/2020

Document Network Projection in Pretrained Word Embedding Space

We present Regularized Linear Embedding (RLE), a novel method that proje...
research
03/21/2022

Efficient Classification of Long Documents Using Transformers

Several methods have been proposed for classifying long textual document...
research
02/22/2018

Learning Topic Models by Neighborhood Aggregation

Topic models are one of the most frequently used models in machine learn...

Please sign up or login with your details

Forgot password? Click here to reset