Log In Sign Up

Efficient Vector Representation for Documents through Corruption

by   Minmin Chen, et al.

We present an efficient document representation learning framework, Document Vector through Corruption (Doc2VecC). Doc2VecC represents each document as a simple average of word embeddings. It ensures a representation generated as such captures the semantic meanings of the document during learning. A corruption model is included, which introduces a data-dependent regularization that favors informative or rare words while forcing the embeddings of common and non-discriminative ones to be close to zero. Doc2VecC produces significantly better word embeddings than Word2Vec. We compare Doc2VecC with several state-of-the-art document representation learning algorithms. The simple model architecture introduced by Doc2VecC matches or out-performs the state-of-the-art in generating high-quality document representations for sentiment analysis, document classification as well as semantic relatedness tasks. The simplicity of the model enables training on billions of words per hour on a single machine. At the same time, the model is very efficient in generating representations of unseen documents at test time.


page 1

page 2

page 3

page 4


Sentiment Analysis by Joint Learning of Word Embeddings and Classifier

Word embeddings are representations of individual words of a text docume...

Semantic Regularities in Document Representations

Recent work exhibited that distributed word representations are good at ...

Hybrid Improved Document-level Embedding (HIDE)

In recent times, word embeddings are taking a significant role in sentim...

Document Embedding with Paragraph Vectors

Paragraph Vectors has been recently proposed as an unsupervised method f...

Multi-View Document Representation Learning for Open-Domain Dense Retrieval

Dense retrieval has achieved impressive advances in first-stage retrieva...

KeyVec: Key-semantics Preserving Document Representations

Previous studies have demonstrated the empirical success of word embeddi...

Latent Semantic Analysis Approach for Document Summarization Based on Word Embeddings

Since the amount of information on the internet is growing rapidly, it i...

Code Repositories


Doc2VecC from the paper "Efficient Vector Representation for Documents through Corruption"

view repo