DeepAI
Log In Sign Up

Efficient Vector Representation for Documents through Corruption

07/08/2017
by   Minmin Chen, et al.
0

We present an efficient document representation learning framework, Document Vector through Corruption (Doc2VecC). Doc2VecC represents each document as a simple average of word embeddings. It ensures a representation generated as such captures the semantic meanings of the document during learning. A corruption model is included, which introduces a data-dependent regularization that favors informative or rare words while forcing the embeddings of common and non-discriminative ones to be close to zero. Doc2VecC produces significantly better word embeddings than Word2Vec. We compare Doc2VecC with several state-of-the-art document representation learning algorithms. The simple model architecture introduced by Doc2VecC matches or out-performs the state-of-the-art in generating high-quality document representations for sentiment analysis, document classification as well as semantic relatedness tasks. The simplicity of the model enables training on billions of words per hour on a single machine. At the same time, the model is very efficient in generating representations of unseen documents at test time.

READ FULL TEXT

page 1

page 2

page 3

page 4

08/14/2017

Sentiment Analysis by Joint Learning of Word Embeddings and Classifier

Word embeddings are representations of individual words of a text docume...
03/24/2016

Semantic Regularities in Document Representations

Recent work exhibited that distributed word representations are good at ...
06/01/2020

Hybrid Improved Document-level Embedding (HIDE)

In recent times, word embeddings are taking a significant role in sentim...
07/29/2015

Document Embedding with Paragraph Vectors

Paragraph Vectors has been recently proposed as an unsupervised method f...
03/16/2022

Multi-View Document Representation Learning for Open-Domain Dense Retrieval

Dense retrieval has achieved impressive advances in first-stage retrieva...
09/27/2017

KeyVec: Key-semantics Preserving Document Representations

Previous studies have demonstrated the empirical success of word embeddi...
07/08/2018

Latent Semantic Analysis Approach for Document Summarization Based on Word Embeddings

Since the amount of information on the internet is growing rapidly, it i...

Code Repositories

iclr2017

Doc2VecC from the paper "Efficient Vector Representation for Documents through Corruption"


view repo