Distributed Representations of Sentences and Documents

05/16/2014
by   Quoc V. Le, et al.
0

Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, "powerful," "strong" and "Paris" are equally distant. In this paper, we propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that Paragraph Vectors outperform bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/28/2013

An alternative text representation to TF-IDF and Bag-of-Words

In text mining, information retrieval, and machine learning, text docume...
research
03/26/2020

Bag of biterms modeling for short texts

Analyzing texts from social media encounters many challenges due to thei...
research
07/18/2017

Spherical Paragraph Model

Representing texts as fixed-length vectors is central to many language p...
research
07/05/2017

The Influence of Feature Representation of Text on the Performance of Document Classification

In this paper we perform a comparative analysis of three models for feat...
research
12/27/2015

Learning Document Embeddings by Predicting N-grams for Sentiment Classification of Long Movie Reviews

Despite the loss of semantic information, bag-of-ngram based methods sti...
research
01/29/2016

Zipf's law is a consequence of coherent language production

The task of text segmentation may be undertaken at many levels in text a...
research
09/26/2021

Electoral Programs of German Parties 2021: A Computational Analysis Of Their Comprehensibility and Likeability Based On SentiArt

The electoral programs of six German parties issued before the parliamen...

Please sign up or login with your details

Forgot password? Click here to reset