Efficient Vector Representation for Documents through Corruption

07/08/2017
by   Minmin Chen, et al.
0

We present an efficient document representation learning framework, Document Vector through Corruption (Doc2VecC). Doc2VecC represents each document as a simple average of word embeddings. It ensures a representation generated as such captures the semantic meanings of the document during learning. A corruption model is included, which introduces a data-dependent regularization that favors informative or rare words while forcing the embeddings of common and non-discriminative ones to be close to zero. Doc2VecC produces significantly better word embeddings than Word2Vec. We compare Doc2VecC with several state-of-the-art document representation learning algorithms. The simple model architecture introduced by Doc2VecC matches or out-performs the state-of-the-art in generating high-quality document representations for sentiment analysis, document classification as well as semantic relatedness tasks. The simplicity of the model enables training on billions of words per hour on a single machine. At the same time, the model is very efficient in generating representations of unseen documents at test time.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/14/2017

Sentiment Analysis by Joint Learning of Word Embeddings and Classifier

Word embeddings are representations of individual words of a text docume...
research
03/24/2016

Semantic Regularities in Document Representations

Recent work exhibited that distributed word representations are good at ...
research
06/01/2020

Hybrid Improved Document-level Embedding (HIDE)

In recent times, word embeddings are taking a significant role in sentim...
research
07/29/2015

Document Embedding with Paragraph Vectors

Paragraph Vectors has been recently proposed as an unsupervised method f...
research
07/08/2018

Latent Semantic Analysis Approach for Document Summarization Based on Word Embeddings

Since the amount of information on the internet is growing rapidly, it i...
research
09/27/2017

KeyVec: Key-semantics Preserving Document Representations

Previous studies have demonstrated the empirical success of word embeddi...
research
05/14/2018

A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors

Motivations like domain adaptation, transfer learning, and feature learn...

Please sign up or login with your details

Forgot password? Click here to reset